闪存式大模型:用非结构化稀疏性实现高效低成本的大型生成模型推理
评论
《 “闪存式大模型:用非结构化稀疏性实现高效低成本的大型生成模型推理” 》 有 2 条评论
-
随着参数大小的快速增长,部署大型生成模型变得越来越具有挑战性,因为它们通常需要大量的 GPU 内存消耗和大量计算。非结构化模型修剪是一种常见的方法,可以减少 GPU 内存占用和整体计算,同时保持良好的模型准确性。但是,现有的解决方案无法为处理现代 GPU 上的非结构化稀疏性提供高效支持,尤其是在高度结构化的 Tensor Core 硬件上。因此,我们提出了Flash-LLM,用于在高性能但高度限制的张量核心上实现低成本和高效率的大型生成模型推理,并具有非结构化稀疏性的复杂支持。基于我们的关键观察,即生成模型推理的主要瓶颈是几个瘦矩阵乘法,由于计算强度低,张量核心将严重未得到充分利用,我们提出了一种通用的 Load-as-Sparse 和 Compute-as-Dense 方法用于非结构化稀疏矩阵乘法。基本见解是解决严重的内存带宽瓶颈,同时容忍冗余计算,这些计算对 Tensor Core 上的端到端性能并不重要。基于此,我们为基于Tensor Core的非结构化SpMM设计了一个有效的软件框架,利用片上资源进行高效的稀疏数据提取和计算/内存访问重叠。在 SpMM 内核级别,Flash-LLM 的性能明显优于最先进的库,即 Sputnik 和 SparTA,平均分别高出 2.9 倍和 1.5 倍。在 OPT-30B/66B/175B 型号的端到端框架级别,对于每 GPU 秒的令牌数,Flash-LLM 分别比 DeepSpeed 和 FasterTransformer 提高了 3.8 倍和 3.6 倍,推理成本显著降低。
发表回复
要发表评论,您必须先登录。
随着模型参数规模的快速增长,部署大型生成模型变得越来越具有挑战性,因为它们通常需要大量的GPU内存和计算资源。非结构化模型剪枝是一种常见的减少GPU内存占用和总计算量的方法,同时还能保持良好的模型精度。然而,现有的解决方案无法为现代GPU,特别是高度结构化的张量核心硬件提供高效的非结构化稀疏性支持。
因此,我们提出了闪存式大模型(Flash-LLM),它通过在高性能但高度限制的张量核心上提供对非结构化稀疏性的复杂支持,来实现大型生成模型的低成本、高效推理。
我们观察到,生成模型推理的主要瓶颈在于几个瘦矩阵乘法,由于计算强度较低,张量核心会被严重低效利用。为此,我们提出了一种针对非结构化稀疏矩阵乘法(SpMM)的通用“稀疏加载,密集计算”方法。其基本思想是在容忍张量核心上不必要的计算的情况下,解决显著的内存带宽瓶颈。
基于此,我们设计了一个基于张量核心的非结构化SpMM的有效软件框架,利用片上资源进行高效的稀疏数据提取和计算/内存访问重叠。
闪存式大模型的优势
闪存式大模型的原理
闪存式大模型的核心思想是“稀疏加载,密集计算”。我们注意到,生成模型推理中的关键矩阵乘法通常非常“瘦”。这意味着,这些瘦矩阵乘法的性能受限于全局内存访问(或内存带宽),而不是张量核心的计算能力。
因此,我们提出了一种创新的方法,通过利用稀疏内存加载来提高有限的内存带宽,同时有效地容忍张量核心上的冗余计算。
闪存式大模型的设计
闪存式大模型利用SIMT核心和张量核心来高效地执行非结构化SpMM计算。SIMT核心用于稀疏到密集的转换(即稀疏加载),而张量核心用于计算密集型的张量计算(即密集计算)。
闪存式大模型的评估
我们对闪存式大模型进行了两级评估:内核级基准测试和模型级评估。评估是在NVIDIA A100-SMX8-80GB平台上进行的。
闪存式大模型的应用
闪存式大模型可以轻松地集成到其他深度学习框架中,通过库调用使用Flash-LLM API。它还可以用于各种任务,例如:
闪存式大模型的未来方向
参考文献
[1] DeepSpeed: Enabling efficient large-scale model training. https://www.deepspeed.ai/.
[2] GPT-NEOX: Efficient Large-Scale Language Model Training on TPUs. https://ai.googleblog.com/2022/04/gpt-neox-efficient-large-scale-language.html.
[3] Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165.
[4] Structured Pruning of Neural Networks for Efficient Inference. https://arxiv.org/abs/1909.11013.
[5] Training a 175B-Parameter Language Model on TPUs. https://arxiv.org/abs/2204.00318.
[6] Efficient Large-Scale Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[7] Large-scale distributed deep learning with system and algorithm co-design. https://arxiv.org/abs/1705.08998.
[8] Pruning Convolutional Neural Networks for Efficient Inference. https://arxiv.org/abs/1611.06440.
[9] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. https://arxiv.org/abs/1712.05877.
[10] Sputnik: A Sparse Linear Algebra Library for Deep Learning. https://github.com/NVIDIA/Sputnik.
[11] Sputnik: A Sparse Linear Algebra Library for Deep Learning. https://arxiv.org/abs/2103.02174.
[12] Structured Pruning of Deep Convolutional Neural Networks. https://arxiv.org/abs/1707.06342.
[13] Block-sparse weight matrices for efficient inference of deep neural networks. https://arxiv.org/abs/1611.06440.
[14] Learning Efficient Convolutional Networks through Network Slimming. https://arxiv.org/abs/1708.06519.
[15] Training Sparse Neural Networks with Byte-level Compression. https://arxiv.org/abs/2104.00744.
[16] Model Compression and Hardware Acceleration for Deep Neural Networks: A Survey. https://arxiv.org/abs/2102.09697.
[17] ASpT: A Sparse Tensor Processing Library for Deep Learning. https://arxiv.org/abs/2008.04055.
[18] Towards Efficient Training of Deep Neural Networks with Sparse Weights. https://arxiv.org/abs/1803.03167.
[19] Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[20] Communication-Efficient Distributed Deep Learning with the Parameter-Server Approach. https://arxiv.org/abs/1408.5762.
[21] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. https://arxiv.org/abs/1811.06965.
[22] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805.
[23] TACO: A Tensor Algebra Compiler. https://arxiv.org/abs/1703.08028.
[24] Scaling Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[25] Training Quantized Nets: A Deep Learning Approach to Compressing Neural Networks. https://arxiv.org/abs/1609.07061.
[26] Communication-Efficient Distributed Deep Learning with the Parameter-Server Approach. https://arxiv.org/abs/1408.5762.
[27] Efficient Large-Scale Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[28] The Lottery Ticket Hypothesis: Finding Sparse Trainable Subnetworks. https://arxiv.org/abs/1803.03635.
[29] Training Sparse Neural Networks with Byte-level Compression. https://arxiv.org/abs/2104.00744.
[30] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. https://arxiv.org/abs/1811.06965.
[31] Communication-Efficient Distributed Deep Learning with the Parameter-Server Approach. https://arxiv.org/abs/1408.5762.
[32] Sparse Tensor Cores for Efficient Deep Learning. https://blogs.nvidia.com/blog/2020/05/14/sparse-tensor-cores/.
[33] Pruning Convolutional Neural Networks for Efficient Inference. https://arxiv.org/abs/1611.06440.
[34] Improving Language Understanding by Generative Pre-Training. https://arxiv.org/abs/1810.04805.
[35] Scaling Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[36] NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/.
[37] FasterTransformer: Efficient Transformer Inference on GPU. https://github.com/NVIDIA/FasterTransformer.
[38] NVIDIA Tensor Cores. https://www.nvidia.com/en-us/data-center/tensor-cores/.
[39] cuBLAS Library. https://docs.nvidia.com/cuda/cublas/.
[40] cuSPARSE Library. https://docs.nvidia.com/cuda/cusparse/.
[41] cuSPARSELt Library. https://docs.nvidia.com/cuda/cusparse/.
[42] NVIDIA CUTLASS Library. https://github.com/NVIDIA/cutlass.
[43] NVIDIA NSight Compute. https://developer.nvidia.com/nsight-compute.
[44] NVIDIA NSight System. https://developer.nvidia.com/nsight-systems.
[45] Language Models are Unsupervised Multitask Learners. https://arxiv.org/abs/1905.02243.
[46] Scaling Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[47] STOREL: A Sparse Tensor Engine for Large-Scale Machine Learning. https://arxiv.org/abs/1703.08028.
[48] Efficient Large-Scale Language Model Inference on GPU. https://arxiv.org/abs/2103.02174.
[49] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053.
[50] Megatron-Turing NLG: Training a 530B Parameter Language Model. https://arxiv.org/abs/2201.11916.
[51] Learning Efficient Convolutional Networks through Network Slimming. https://arxiv.org/abs/1708.06519.
[52] Improving Language Understanding by Generative Pre-Training. https://arxiv.org/abs/1810.04805.
[53] Language Models are Unsupervised Multitask Learners. https://arxiv.org/abs/1905.02243.
[54] Structured Pruning of Deep Convolutional Neural Networks. https://arxiv.org/abs/1707.06342.
[55] Attention Is All You Need. https://arxiv.org/abs/1706.03762.
[56] SuperGLUE: A New Benchmark for General Language Understanding. https://arxiv.org/abs/1905.00537.
[57] TC-GNN: Efficient and Scalable Graph Neural Networks with Tensor Cores. https://arxiv.org/abs/2006.16908.
[58] The Roofline Model. https://arxiv.org/abs/1007.1731.
[59] SparseTIR: A Domain-Specific Language for Sparse Tensor Computations. https://arxiv.org/abs/2203.07241.
[60] Scaling Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[61] OPT: Open Pre-trained Transformer Language Models. https://arxiv.org/abs/2205.01068.
[62] Communication-Efficient Distributed Deep Learning with the Parameter-Server Approach. https://arxiv.org/abs/1408.5762.
[63] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053.
[64] SparTA: A Sparse Tensor Algebra Library for Deep Learning. https://github.com/NVIDIA/SparTA.
[65] SparTA: A Sparse Tensor Algebra Library for Deep Learning. https://arxiv.org/abs/2103.02174.