AI – 第 11 页 – 借一步网

在上周的文章《Transformer升级之路：12、无限外推的ReRoPE？》中，我们介绍了ReRoPE和Leaky ReRoPE的创新方法。实验结果表明，它们在几乎不损失训练效果的情况下能够扩展LLM的Context长度，并且实现了“longer context, lower loss”的理想特性。尤其是ReRoPE，似乎表现出了无限的Context处理能力。然而，这些方法也带来了推理阶段的成本增加问题。本文将探讨一种新的解决方案：逆用Leaky ReRoPE。

回顾RoPE与Leaky ReRoPE

RoPE与位置编码

RoPE（Rotary Position Embedding）形式上是一种绝对位置编码，但实际上它实现了相对位置编码。其对应的相对位置矩阵为：

*** QuickLaTeX cannot compile formula:
[\left(\begin{array}{cccccccccc}0 & 1 & 2 & 3 & \cdots & L-2 & L-1\-1 & 0 & 1 & 2 & \cdots & L-3 & L-2\-2 & -1 & 0 & 1 & \cdots & L-4 & L-3\\vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\2-L & 3-L & 4-L & \cdots & -2 & -1 & 0\1-L & 2-L & 3-L & \cdots & -3 & -2 & -1\\end{array}\right)]

*** Error message:
Extra alignment tab has been changed to \cr.
leading text: ...2 & 3 & \cdots & L-2 & L-1\-1 & 0 & 1 & 2 &
Undefined control sequence \2.
leading text: ...vdots & \vdots & \ddots & \vdots & \vdots\2
Extra alignment tab has been changed to \cr.
leading text: ... \vdots & \vdots\2-L & 3-L & 4-L & \cdots &
Undefined control sequence \1.
leading text: ...ts\2-L & 3-L & 4-L & \cdots & -2 & -1 & 0\1
Extra \right.
leading text: ... & \cdots & -3 & -2 & -1\\end{array}\right)
\begin{array} on input line 8 ended by \end{document}.
leading text: \end{document}
Improper \prevdepth.
leading text: \end{document}
Missing } inserted.
leading text: \end{document}
Missing \cr inserted.
leading text: \end{document}
Missing $ inserted.
leading text: \end{document}

Leaky ReRoPE与窗口化处理

为了在保留局域性的同时避免Long Context导致位置越界问题，Leaky ReRoPE将推理阶段的相对位置矩阵改为：

*** QuickLaTeX cannot compile formula:
[\left(\begin{array}{cccccccccc}0 & 1 & 2 & \cdots & w-1 & w & \cdots & w+k & \cdots & w+L-1-wk\-1 & 0 & 1 & \cdots & w-2 & w-1 & \cdots & w+k-1 & \cdots & w+L-2-wk\\vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots\-w+1 & -w+2 & -w+3 & \cdots & -1 & 0 & \cdots & k-1 & \cdots & L-2-wk\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\end{array}\right)]

*** Error message:
Extra alignment tab has been changed to \cr.
leading text: ...& w & \cdots & w+k & \cdots & w+L-1-wk\-1 &
Extra alignment tab has been changed to \cr.
leading text: ... & \ddots & \vdots & \vdots & \vdots\-w+1 &
Extra \right.
leading text: ...\vdots & \vdots & \vdots\\end{array}\right)
\begin{array} on input line 8 ended by \end{document}.
leading text: \end{document}
Improper \prevdepth.
leading text: \end{document}
Missing } inserted.
leading text: \end{document}
Missing \cr inserted.
leading text: \end{document}
Missing $ inserted.
leading text: \end{document}
You can't use `\end' in internal vertical mode.
leading text: \end{document}
\begin{array} on input line 8 ended by \end{document}.
leading text: \end{document}
Missing } inserted.

其中，( w ) 是窗口宽度，( k ) 用来调节可处理的最大长度。

ReRoPE的无限扩展能力

ReRoPE直接取了 ( k \to \infty ) 的极限：

*** QuickLaTeX cannot compile formula:
[\left(\begin{array}{cccccccccc}0 & 1 & 2 & \cdots & w-1 & w & \cdots & w & \cdots & w\-1 & 0 & 1 & \cdots & w-2 & w-1 & \cdots & w & \cdots & w\\vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots\-w+1 & -w+2 & -w+3 & \cdots & -1 & 0 & \cdots & w & \cdots & w\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\end{array}\right)]

*** Error message:
Extra alignment tab has been changed to \cr.
leading text: ...ts & w-1 & w & \cdots & w & \cdots & w\-1 &
Extra alignment tab has been changed to \cr.
leading text: ... & \ddots & \vdots & \vdots & \vdots\-w+1 &
Extra \right.
leading text: ...\vdots & \vdots & \vdots\\end{array}\right)
\begin{array} on input line 8 ended by \end{document}.
leading text: \end{document}
Improper \prevdepth.
leading text: \end{document}
Missing } inserted.
leading text: \end{document}
Missing \cr inserted.
leading text: \end{document}
Missing $ inserted.
leading text: \end{document}
You can't use `\end' in internal vertical mode.
leading text: \end{document}
\begin{array} on input line 8 ended by \end{document}.
leading text: \end{document}
Missing } inserted.

逆用Leaky ReRoPE：让训练阶段变慢，推理阶段变快

为什么逆用？

原本的ReRoPE和Leaky ReRoPE在推理阶段增加了计算成本。如果我们反过来在训练阶段使用Leaky ReRoPE，而在推理阶段使用常规的RoPE，能否解决这一问题呢？

实验与结果

我们进行了以下实验组合：“GAU + Deep Norm + Tiger + 语言模型”。在训练阶段使用 ( k=1/16, w=128 ) 的Leaky ReRoPE，在推理阶段使用正常的RoPE。测试结果如下：

测试长度	Baseline	Baseline-logn	NTK-RoPE-fixed	NTK-RoPE-logn†-fixed	NTK-RoPE-logn-fixed	NTK-RoPE-mixed	NTK-RoPE-logn†-mixed	NTK-RoPE-logn-mixed	ReRoPE-w256	ReRoPE-w256-logn†	ReRoPE-w256-logn	InvLeaky ReRoPE-w128-logn	InvLeaky ReRoPE-w128-b8-logn	HFWA512 (训练)
训练	49.41%	49.40%	49.41%	49.41%	49.40%	49.41%	49.41%	49.40%	49.41%	49.41%	49.40%	49.38%	49.62%	48.70%
4096（重复）	24.17%	24.60%	51.86%	55.94%	62.85%	53.09%	59.11%	68.91%	77.90%	82.40%	85.12%	82.25%	81.15%	80.84%
4096（不重复）	23.16%	24.02%	39.61%	41.11%	44.14%	40.12%	42.38%	45.41%	48.48%	48.85%	49.07%	48.32%	48.85%	48.15%

其中，( b8 ) 是指RoPE的频率底数从10000换成了80000。可以看到，“Leaky ReRoPE → RoPE”的InvLeaky ReRoPE虽然效果上不如“RoPE → ReRoPE/Leaky ReRoPE”，但依然胜过了HFWA，并且由于推理阶段是常规的RoPE，可以套用现成的加速技术，因此依然是有相当竞争力的。

我们对 $( k, w, b )$ 等参数做了一些简单的调参，发现最优解基本上就是上述两个组合。具体来说：

调参与训练速度

( k ) 设置为“扩展倍数的2倍的倒数”
( w ) 设置为训练长度的四分之一
( b ) 可选乘以扩展倍数

在上述实验中，模型参数量为1亿，训练长度为512，每1000步的训练时间从330秒增加到了350秒，增加不到10%。虽然这里有GAU的原因，因为GAU是单头的注意力，本身比多头注意力快。如果是多头注意力或者训练长度更长，增加幅度可能会更大，但估计不会超过50%，依然在可接受范围内。

小结

本文提出了Leaky ReRoPE的“逆用”方法，通过在训练阶段使用更大步长的Leaky ReRoPE，使得推理阶段可以退回常规的RoPE，从而保持推理速度不变。实验结果表明，这种方法依然具有相当的竞争力。未来的工作可以进一步优化参数设置，提升模型性能。

希望这篇文章能够帮助您更好地理解逆用Leaky ReRoPE的方法及其优势。如果有任何疑问或建议，欢迎在评论区留言讨论。

朴素贝叶斯：老实人的自白

一般化的朴素贝叶斯：变身超级英雄

注意力机制：时尚界的弄潮儿

当朴素贝叶斯遇上注意力机制

深度模型：层叠与残差的魔法

总结：一场“心有灵犀”的邂逅

回顾RoPE与Leaky ReRoPE

RoPE与位置编码

Leaky ReRoPE与窗口化处理

ReRoPE的无限扩展能力

逆用Leaky ReRoPE：让训练阶段变慢，推理阶段变快

为什么逆用？

实验与结果

调参与训练速度

小结

分类： AI

朴素贝叶斯与注意力机制：一场“心有灵犀”的邂逅

朴素贝叶斯：老实人的自白

一般化的朴素贝叶斯：变身超级英雄

注意力机制：时尚界的弄潮儿

当朴素贝叶斯遇上注意力机制

深度模型：层叠与残差的魔法

总结：一场“心有灵犀”的邂逅

Transformer升级之路：逆用Leaky ReRoPE解决推理成本问题

回顾RoPE与Leaky ReRoPE

RoPE与位置编码

Leaky ReRoPE与窗口化处理

ReRoPE的无限扩展能力

逆用Leaky ReRoPE：让训练阶段变慢，推理阶段变快

为什么逆用？

实验与结果

调参与训练速度

小结