Authors: Xin Xu ; Tong Xiao ; Zitong Chao ; Zhenya Huang ; Can Yang ; Yang Wang
Summary: Math Word Problems (MWPs) are crucial for evaluating the capability of Large Language Models (LLMs), with current research primarily focusing on questions with concise contexts. However, as real-world math problems often involve complex circumstances, LLMs’ ability to solve long MWPs is vital for their applications in these scenarios, yet remains under-explored. This study pioneers the exploration of Context Length Generalizability (CoLeG), the ability of LLMs to solve long MWPs. We introduce Extended Grade-School Math (E-GSM), a collection of MWPs with lengthy narratives. Two novel metrics are proposed to assess the efficacy and resilience of LLMs in solving these problems. Our examination of existing zero-shot prompting techniques and both proprietary and open-source LLMs reveals a general deficiency in CoLeG. To alleviate these challenges, we propose distinct approaches for different categories of LLMs. For proprietary LLMs, a new instructional prompt is proposed to mitigate the influence of long context. For open-source LLMs, a new data augmentation task is developed to improve CoLeG. Our comprehensive results demonstrate the effectiveness of our proposed methods, showing not only improved performance on E-GSM but also generalizability across several other MWP benchmarks. Our findings pave the way for future research in employing LLMs for complex, real-world applications, offering practical solutions to current limitations and opening avenues for further exploration of model generalizability and training methodologies.
近年来,随着人工智能技术的迅猛发展,大型语言模型(LLMs)在解决数学问题方面展现出了巨大的潜力。然而,当前的研究大多集中在那些背景简短的问题上。现实生活中的数学问题往往涉及复杂的叙述和背景,这对大型语言模型提出了更高的要求。本文将探讨LLMs在解决长篇数学文本问题(MWPs)方面的能力,并介绍一种名为E-GSM的新数据集及相关研究成果。
背景介绍
数学文本问题(MWPs)是以自然语言形式呈现的数学问题,需要精细的推理能力来解决。传统的数学问题数据集,如GSM8K. 通常包含简短的叙述,只有几句话。然而,这种设置与现实世界中的情况存在差异。现实中的数学问题往往有更长的背景,这可能会对数学推理过程产生影响。研究表明,长篇背景可能会阻碍而不是促进数学推理过程。✅
研究目的
本研究的主要目的是探讨LLMs在解决长篇数学文本问题(CoLeG,即Context Length Generalizability)的能力。为此,我们构建了一个名为Extended Grade-School Math(E-GSM)的数据集,这个数据集包含了从GSM8K扩展而来的长篇数学问题。我们还提出了两种新的指标来评估LLMs在解决这些问题时的效率和韧性。
研究方法
数据集构建
E-GSM数据集的构建过程主要包括以下几个步骤:
我们通过多轮扩展逐步增加问题的长度,最终获得了一个包含多个扩展轮次问题的综合数据集。
评估方法
我们使用七个专有LLMs和20个开源LLMs,以及三种最先进的零样本提示技术对E-GSM进行了评估。结果表明,LLMs在长篇数学文本问题上的表现较弱,尤其是在处理更长的背景时。
解决方案
为了解决这个问题,我们针对专有LLMs和开源LLMs分别提出了不同的策略:
研究结果
我们的实验结果表明,这些策略在E-GSM及其他多个MWP基准测试上都表现出了显著的效果和较强的泛化能力。具体来说,CoRe和扩展微调任务不仅提高了LLMs在E-GSM上的准确性,还展示了其在其他数学问题基准测试中的广泛适用性。
结论
LLMs在解决长篇数学文本问题上的能力对于其在现实世界应用中的重要性不言而喻。我们的研究表明,长篇数学问题会显著降低LLMs的数学推理能力。通过提出有针对性的解决方案,我们不仅改善了LLMs在长篇数学问题上的表现,也为未来研究提供了宝贵的方向和方法。
本研究为LLMs在复杂、真实世界应用中的使用铺平了道路,提供了实用的解决方案,并为模型泛化能力和训练方法的进一步探索开辟了新的途径。
本文参考了《Can LLMs Solve Longer Math Word Problems Better?》一文中的研究成果,旨在为读者提供关于大型语言模型在解决长篇数学文本问题方面最新进展的深入洞见。希望通过这篇文章,您能对这一前沿领域有更清晰的认识。
Can LLMs Solve longer Math Word Problems Better?
https://papers.cool/arxiv/2405.14804
Authors: Xin Xu ; Tong Xiao ; Zitong Chao ; Zhenya Huang ; Can Yang ; Yang Wang
Summary: Math Word Problems (MWPs) are crucial for evaluating the capability of Large Language Models (LLMs), with current research primarily focusing on questions with concise contexts. However, as real-world math problems often involve complex circumstances, LLMs’ ability to solve long MWPs is vital for their applications in these scenarios, yet remains under-explored. This study pioneers the exploration of Context Length Generalizability (CoLeG), the ability of LLMs to solve long MWPs. We introduce Extended Grade-School Math (E-GSM), a collection of MWPs with lengthy narratives. Two novel metrics are proposed to assess the efficacy and resilience of LLMs in solving these problems. Our examination of existing zero-shot prompting techniques and both proprietary and open-source LLMs reveals a general deficiency in CoLeG. To alleviate these challenges, we propose distinct approaches for different categories of LLMs. For proprietary LLMs, a new instructional prompt is proposed to mitigate the influence of long context. For open-source LLMs, a new data augmentation task is developed to improve CoLeG. Our comprehensive results demonstrate the effectiveness of our proposed methods, showing not only improved performance on E-GSM but also generalizability across several other MWP benchmarks. Our findings pave the way for future research in employing LLMs for complex, real-world applications, offering practical solutions to current limitations and opening avenues for further exploration of model generalizability and training methodologies.