To push the last mile of LLM deplyment, a lot of works focus on training and serving LLMs within low cost. In this blog, we summarize the most important works in efficient LLMs.

Table of contents

Background

To catch up the latest progress of efficient LLMs, I highly recommend the introduction course (TinyML and Efficient Deep Learning Computing) from MIT Song Lab.

Transformer Math

Before we dive into the efficient LLMs, let’s review some math behind transformers. It refers to two popular tech blogs (Transformer Math 101 and Transformer Inference Arithmetic).

Inference

Training

Survey

Efficient LLM Serving

LLM pipeline (stages), cost estimation, metrics.

Serving LLMs over Heterogeneous Environment

As AWS reported, 90% of the machine learning demand in the cloud is for inference.

Key challenges: KV-cache management for long-texts and sparse computation of MoE architectures

Datasets

  1. AzurePublicDataset: Microsoft Azure Traces

Efficient LLM Training

The standard LLM training phrase contains three stages: pre-train, fine-tune, and alignment.

  • Pre-train aims to learn a common knowledge from a large-scale corpus (~20P tokens, P is the number of parameters).
  • Fine-tune targets to improve task-related capability on instruction dataset, such as machine translation (~1%-5% pre-train data).
  • Alignment aims to tune the trained model with human perference. It makes LLMs output more responsible and consistent answers.

Conclusion


Reference

  1. [Github Repo] HetServe-LLMs: A Overview of Efficiently Serving Large Language Models across Edge Devices, New York University and Shandong University.
  2. [Github Repo] Awesome LLM Systems Papers, University of Michigan.
  3. [Github Repo] Generative AI Roadmap, University of Pittsburgh
  4. [ISCA 2024] Splitwise: Efficient generative LLM inference using phase splitting, University of Washington.