[1]: Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023) [2]: OPT: Open Pre-trained Transformer Language Models (Zhang et al., 2022) [3]: Release blog posts for: MPT-7B (May 2023) and MPT-30B (June 2023) [4]: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (BigScience, 2023) [5]: Scaling Laws for Neural Language Models (Kaplan et al., 2020) [6]: Mistral 7B (Jiang et al., 2023) [7]: Efficient Streaming Language Models with Attention Sinks (Xiao et al., 2023) + GitHub repository [8]: H_2O. Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Zhang et al., 2023) + GitHub repository✅ [9]: Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time (Liu et al. 2023) [10]: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (Ge et al., 2023) [11]: Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019) [12]: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023) [13]: PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022) [14]: The Falcon Series of Open Language Models (Almazrouei et al., 2023) [15]: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2023) + GitHub repository [16]: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) + GitHub repository [17]: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022) + GitHub repository [18]: SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Xiao et al., 2022) + GitHub repository [19]: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (Sheng et al., 2023) + GitHub repository [20] Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) + GitHub repository [21] vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (Kwon et al. 2023) [22] Efficiently Programming Large Language Models using SGLang (Zheng et al., 2023) + Blog post [23]: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (Huang et al., 2018) [24]: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (Narayanan et al., 2021) [25]: Efficiently Scaling Transformer Inference (Pope et al., 2022) [26]: Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache (Lin et al., 2024)
大型语言模型(LLM)的推理过程通常需要大量的计算资源,特别是自注意力机制的计算量会随着序列长度的平方增长。为了优化推理效率,KV 缓存应运而生。它通过存储过去token的键值张量来避免重复计算,从而将计算复杂度从平方级降低到线性级。
KV 缓存的权衡:内存换算力
KV 缓存是一种权衡策略,它以牺牲内存为代价来换取更快的计算速度。本文将深入探讨 KV 缓存的规模、面临的挑战以及应对策略。
KV 缓存到底有多大?
对于每个序列中的每个token,我们需要存储两个向量张量(一个键张量和一个值张量),每个张量的维度为
d_head
,对应于每个注意力层的每个注意力头的维度。每个张量参数所占用的空间取决于其精度:全精度(FP32)为 4 字节/参数,半精度(BF16、FP16)为 2 字节/参数,8 位数据类型(INT8、FP8)为 1 字节/参数,等等。假设批次大小为
b
,总序列长度(提示 + 完成)为t
,解码器块/注意力层数为n_layers
,每个注意力层的注意力头数为n_heads
,注意力层的隐藏维度为d_head
,精度为p_a
。那么,多头注意力(MHA)模型的 KV 缓存每个token的内存消耗(以字节为单位)为:$$
KV_cache_per_token = 2 \cdot n_heads \cdot d_head \cdot p_a
$$
注意:在 MHA 模型中,
n_heads \cdot d_head = d_model
,但为了简化公式,我们不会使用它。因此,KV 缓存的总大小(以字节为单位)为:
$$
KV_cache_total = b \cdot t \cdot KV_cache_per_token
$$
KV 缓存的挑战:内存爆炸
从公式可以看出,KV 缓存的大小与批次大小和总序列长度成线性关系。由于总序列长度无法事先确定,因此 KV 缓存的内存需求也是未知的,这给内存管理带来了巨大的挑战。
以流行的 MHA 模型为例(表 1),包括 Meta 的 Llama-2 和 OPT,MosaicML 的 MPT 和 BigScience 的 BLOOM:
表 1:流行的多头注意力 (MHA) 模型规格
假设参数存储在半精度(FP16、BF16)中,我们选择一个较小的模型(Llama-2-7B. 和一个较大的模型(BLOOM-176B)。对于 Llama-2-7B(BLOOM-176B),KV 缓存的内存消耗约为 0.5 MB/token(4 MB/token)。✅
以 Llama-2-7B 为例,使用半精度,加载模型权重大约需要 14 GB 的内存,与缓存 28k 个token的键值相同。28k 个token可能对应于 56 个长度为 512 的序列的批次,这并不算极端。
从上面的数字可以看出,KV 缓存的内存消耗会非常大,甚至可能超过加载大型序列模型权重所需的内存量。
应对 KV 缓存挑战:内存优化策略
为了应对 KV 缓存带来的内存压力,我们可以采取多种优化策略:
1. 减少模型权重内存占用
2. 减少 KV 缓存内存占用
3. 利用多设备内存
4. 优化内存管理
5. 扩展存储容量
KV 缓存优化:未来方向
随着 LLM 的不断发展,KV 缓存优化将成为一个重要的研究方向。未来,我们将看到更多针对 KV 缓存的优化策略,例如:
总结
本文深入探讨了 KV 缓存的内存挑战以及应对策略。通过优化模型权重、KV 缓存、内存管理和存储容量,我们可以有效地解决 KV 缓存带来的内存压力,提高 LLM 推理的效率。
参考文献
[1]: Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023)
[2]: OPT: Open Pre-trained Transformer Language Models (Zhang et al., 2022)
[3]: Release blog posts for: MPT-7B (May 2023) and MPT-30B (June 2023)
[4]: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (BigScience, 2023)
[5]: Scaling Laws for Neural Language Models (Kaplan et al., 2020)
[6]: Mistral 7B (Jiang et al., 2023)
[7]: Efficient Streaming Language Models with Attention Sinks (Xiao et al., 2023) + GitHub repository
[8]: H_2O. Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Zhang et al., 2023) + GitHub repository✅
[9]: Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time (Liu et al. 2023)
[10]: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (Ge et al., 2023)
[11]: Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019)
[12]: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023)
[13]: PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)
[14]: The Falcon Series of Open Language Models (Almazrouei et al., 2023)
[15]: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2023) + GitHub repository
[16]: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) + GitHub repository
[17]: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022) + GitHub repository
[18]: SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Xiao et al., 2022) + GitHub repository
[19]: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (Sheng et al., 2023) + GitHub repository
[20] Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) + GitHub repository
[21] vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (Kwon et al. 2023)
[22] Efficiently Programming Large Language Models using SGLang (Zheng et al., 2023) + Blog post
[23]: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (Huang et al., 2018)
[24]: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (Narayanan et al., 2021)
[25]: Efficiently Scaling Transformer Inference (Pope et al., 2022)
[26]: Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache (Lin et al., 2024)