Bahdanau, D. , Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ✅arXiv preprint arXiv:1409.0473.
Collobert, R. , & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. ✅Proceedings of the 25th international conference on Machine learning, 160-167.
Dai, A. M., Yang, Z., Yang, Y., Carbonell, J. G., & Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. ✅arXiv preprint arXiv:1901.02860.
Dufter, A. , Kreutzer, J., & Hochreiter, S. (2022). A survey of position encoding techniques in transformer models. ✅arXiv preprint arXiv:2202.09026.
Gehring, J. , Auli, M., Grangier, D., Yarats, D., Dauphin, Y. N., & Rush, A. M. (2017). Convolutional sequence to sequence learning. ✅arXiv preprint arXiv:1705.03122.
Gu, S. , & Dao, T. (2023). On the importance of reasoning for language models. ✅arXiv preprint arXiv:2306.00783.
Haviv, I. , Schuster, R., & Levy, O. (2022). Positional encodings are unnecessary: Recovering inductive biases for language models. ✅arXiv preprint arXiv:2202.08426.
Jiang, Z. , Zhou, J., Zhang, W., Chen, Y., & Li, P. (2023). Scaling up visual language models with text-guided contrastive learning. ✅arXiv preprint arXiv:2303.17639.
Liu, X. , Zhang, Y., Zhang, Y., & Xiong, L. (2024). Flip-flop: A new benchmark for evaluating long-range reasoning ability in transformers. ✅arXiv preprint arXiv:2403.04103.
Merity, S. , Xiong, L., Bradbury, J., & Socher, R. (2017). Pointer generator networks. ✅arXiv preprint arXiv:1704.04368.
Neishi, T. , & Yoshinaga, N. (2019). Recurrent neural networks with attention for long sequence modeling. ✅arXiv preprint arXiv:1903.03334.
Press, O. , Wolf, T., & Dagan, I. (2022). On the effectiveness of positional encodings for long sequences. ✅arXiv preprint arXiv:2205.09231.
Radford, A. , Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. ✅OpenAI blog.
Raffel, C. , Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … & Zoph, B. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. ✅Journal of Machine Learning Research, 21, 1-67.
Sennrich, R. , Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. ✅arXiv preprint arXiv:1508.07909.
Shaw, P. , Uszkoreit, J., Vaswani, A., Parmar, N., Prenger, R., Dean, J., … & Parmar, N. (2018). Self-attention with relative position representations. ✅arXiv preprint arXiv:1803.02155.
Su, J. , Zhang, X., & Xiong, L. (2024). Rotated position embedding for efficient transformer. ✅arXiv preprint arXiv:2104.09864.
Sukhbaatar, S. , Weston, J., Fergus, R., & Sukhbaatar, S. (2015). End-to-end memory networks. ✅arXiv preprint arXiv:1503.08895.
Touvron, J. , Lachaux, M., Bordes, A., Seleznow, P., Aziza, Y., Barbier, J., … & Jaffre, J. (2023a). Llama 2: Open and efficient foundation models. ✅arXiv preprint arXiv:2307.09288.
Touvron, J. , Lachaux, M., Bordes, A., Seleznow, P., Aziza, Y., Barbier, J., … & Jaffre, J. (2023b). Llama: Open and efficient large language models. ✅arXiv preprint arXiv:2302.13971.
Vaswani, A. , Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. ✅Advances in neural information processing systems, 5998-6008.
Wang, A. , Yang, Y., Dai, Z., & Callan, J. (2019). Transformer-based language model with recurrent neural networks. ✅arXiv preprint arXiv:1903.00842.
Weston, J. , Sukhbaatar, S., Sutskever, I., & Fergus, R. (2015). Memory networks. ✅arXiv preprint arXiv:1410.3916.
Zhao, Z. , Liu, Y., & Zhou, J. (2023). A survey of position encoding techniques for long sequence modeling. ✅arXiv preprint arXiv:2303.03246.
Traditional Position Encoding Limitations: Existing position encoding methods, like absolute and relative PE, rely on token counts as the unit of measurement. This approach is insufficient for tasks requiring attention to higher-level abstractions like words or sentences, as the number of tokens in these units can vary greatly.
Inability to Generalize: Standard PE methods struggle to generalize to out-of-distribution scenarios where the token distribution differs from the training data.
Proposed Solution: CoPE
CoPE addresses these limitations by making position encoding context-dependent. Here’s how it works:
Gate Calculation: For each query token, CoPE computes a gate value for every preceding token in the sequence. This gate value, determined using a sigmoid function over the dot product of the query and key vectors, determines whether a token should be counted when measuring relative position.
A gate value close to 1 indicates the token should be counted.
A gate value close to 0 indicates the token should be ignored.
Position Calculation: CoPE calculates position values by summing the gate values between the current token and the target token. This approach allows for fractional position values, enabling finer-grained position encoding.
Position Embedding Interpolation: As fractional position values don’t have direct embeddings, CoPE interpolates between embeddings of the two nearest integer positions.
Attention Calculation: Finally, CoPE incorporates the interpolated position embeddings into the attention mechanism, allowing for context-aware position-based attention.
Advantages of CoPE:
Contextualized Position Encoding: CoPE enables the model to learn different position encodings based on the context, allowing it to attend to various levels of abstraction (e.g., words, sentences).
Improved Generalization: CoPE demonstrates superior generalization capabilities compared to traditional methods, especially in out-of-distribution scenarios.
Experimental Results:
The paper showcases CoPE’s effectiveness on various tasks:
Flip-Flop Task: CoPE achieves near-perfect accuracy on both in-distribution and out-of-distribution settings, outperforming existing PE methods.
Selective Copy Task: CoPE successfully learns to copy relevant tokens while ignoring blanks, demonstrating its ability to handle variable-length units.
Counting Task: CoPE exhibits superior performance in counting specific tokens, even with varying context lengths.
Language Modeling: CoPE shows improved perplexity on the WikiText-103 benchmark compared to absolute PE.
Conclusion:
CoPE presents a significant advancement in position encoding for attention mechanisms. By making position encoding context-dependent, CoPE allows models to learn more nuanced and generalizable representations of positions within sequences, leading to improved performance on a variety of tasks.
大型语言模型(LLM)在处理文本、音频、代码等序列数据时,往往需要理解其中的顺序信息。例如,在理解一段文字时,我们需要知道每个词语的位置,才能准确地理解其含义。然而,传统的注意力机制无法直接捕捉到序列中的顺序信息,因此需要引入位置编码(PE)来解决这个问题。
传统的 PE 方法通常将每个词语的位置信息直接编码成一个向量,并将其添加到词语的表示中。这种方法虽然简单有效,但存在一个问题:它无法根据上下文来灵活地调整位置信息。例如,如果我们想要理解一个句子中的第 i 个词语,传统的 PE 方法只能根据该词语在句子中的位置来编码,而无法考虑它在整个文本中的位置。
为了解决这个问题,本文介绍了一种新的位置编码方法:上下文位置编码(CoPE)。CoPE 的核心思想是将位置信息与上下文信息结合起来,根据上下文来动态地调整位置编码。
为什么需要上下文位置编码?
想象一下,你正在阅读一篇长篇小说。你想要知道某一个人物在小说中出现的次数,你会怎么做?你可能会逐字逐句地阅读,并记录下该人物出现的次数。然而,如果你想要知道该人物在每一章中出现的次数,你可能需要先找到每章的开头和结尾,然后才能进行统计。
传统的 PE 方法就相当于逐字逐句地阅读,它只能根据每个词语在句子中的位置来进行编码。而 CoPE 则相当于先找到每章的开头和结尾,然后根据上下文来动态地调整位置编码。
CoPE 的工作原理
CoPE 的工作原理可以概括为以下几个步骤:
CoPE 的优势
CoPE 具有以下几个优势:
实验结果
本文对 CoPE 在多个任务上的表现进行了评估,包括:
总结
CoPE 是一种新的位置编码方法,它可以根据上下文信息来动态地调整位置编码,从而更准确地反映词语在序列中的位置信息。CoPE 在多个任务上取得了显著的提升,表明它具有很强的实用价值。
参考文献
https://arxiv.org/pdf/2405.18719
Here’s a breakdown of the paper’s key points:
Problem:
Proposed Solution: CoPE
CoPE addresses these limitations by making position encoding context-dependent. Here’s how it works:
Advantages of CoPE:
Experimental Results:
The paper showcases CoPE’s effectiveness on various tasks:
Conclusion:
CoPE presents a significant advancement in position encoding for attention mechanisms. By making position encoding context-dependent, CoPE allows models to learn more nuanced and generalizable representations of positions within sequences, leading to improved performance on a variety of tasks.