To push the last mile of LLM deplyment, a lot of works focus on training and serving LLMs within low cost. In this blog, we summarize the most important works in efficient LLMs.
Table of contents
- Table of contents
- Background
- Survey
- Efficient LLM Serving
- Efficient LLM Training
- Conclusion
- Reference
Background
To catch up the latest progress of efficient LLMs, I highly recommend the introduction course (TinyML and Efficient Deep Learning Computing) from MIT Song Lab.
Transformer Math
Before we dive into the efficient LLMs, let’s review some math behind transformers. It refers to two popular tech blogs (Transformer Math 101 and Transformer Inference Arithmetic).
Inference
Training
Survey
- [arXiv 2023.02] Full Stack Optimization of Transformer Inference: a Survey, UC Berkeley
- A new full-stack optimization hardware for transformer models.
- Arithmetic intensity is a key metric to evaluate the efficiency of the DNN models.
- [arXiv 2023.12] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, Carnegie Mellon University
- [TMLR 2024] Efficient Large Language Models: A Survey, The Ohio State University
- [arXiv 2024.04] A Survey on Efficient Inference for Large Language Models, Infinigence-AI and Tsinghua University
Efficient LLM Serving
LLM pipeline (stages), cost estimation, metrics.
Serving LLMs over Heterogeneous Environment
As AWS reported, 90% of the machine learning demand in the cloud is for inference.
Key challenges: KV-cache management for long-texts and sparse computation of MoE architectures
-
[ASPLOS 2023] STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining, University of Virginia
- [ACL 2023 Demo] Petals: Collaborative Inference and Fine-tuning of Large Models, HSE University
- An implementation of naive model pipeline parallelisms across heterogeneous devices.
- [arXiv 2024.04] M´elange: Cost Efficient Large Language Model Serving by
Exploiting GPU Heterogeneity, UC Berkeley
- A cost-efficient GPU allocation stategy for LLM serving (model request size, request rate, and latency service-level objective).
- [MLSys 2024] HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, National University of Singapore
- A I/O aware parallel strategy for on-device LLM serving.
- [ICML 2024] HEXGEN: Generative Inference of Large Language Model
over Heterogeneous Environment, The Hong Kong University of Science and Technology and ETH Zurich
- Without a careful model partition strategy, naive pipeline and tensor parallelism lead to out-of-memory errors.
- An implementation that accommodates tensor model parallelism and pipeline parallelism.
Datasets
- AzurePublicDataset: Microsoft Azure Traces
Efficient LLM Training
The standard LLM training phrase contains three stages: pre-train, fine-tune, and alignment.
- Pre-train aims to learn a common knowledge from a large-scale corpus (~20P tokens, P is the number of parameters).
- Fine-tune targets to improve task-related capability on instruction dataset, such as machine translation (~1%-5% pre-train data).
- Alignment aims to tune the trained model with human perference. It makes LLMs output more responsible and consistent answers.
Conclusion
Reference
- [Github Repo] HetServe-LLMs: A Overview of Efficiently Serving Large Language Models across Edge Devices, New York University and Shandong University.
- [Github Repo] Awesome LLM Systems Papers, University of Michigan.
- [Github Repo] Generative AI Roadmap, University of Pittsburgh
- [ISCA 2024] Splitwise: Efficient generative LLM inference using phase splitting, University of Washington.