MMLU：我们真的完成了它吗？

大型语言模型（LLM）的出现，标志着自然语言处理领域取得了重大进展，使我们能够通过自然语言与计算机进行交互。然而，这些模型的评估需要可靠的基准测试，而现有的基准测试却存在着不少问题。

MMLU：一个广受欢迎但存在问题的基准测试

MMLU（Massive Multitask Language Understanding，大规模多任务语言理解）基准测试，因其涵盖了数学、历史、计算机科学、逻辑、法律等多个领域的知识而备受关注。然而，我们发现，尽管MMLU很受欢迎，但它存在着大量错误，这些错误会误导模型评估和比较。

MMLU中的错误：一个需要解决的问题

研究人员发现，MMLU中存在着各种各样的错误，从简单的解析和抓取错误，到更复杂的上下文、解释和数据集质量问题。例如，在病毒学子集中，57% 的问题都存在错误，其中一些错误甚至建议将美军派往西非以阻止埃博拉疫情的爆发。

MMLU-Redux：一个更可靠的基准测试

为了解决MMLU中存在的错误问题，研究人员手动分析了MMLU数据集，并创建了MMLU-Redux。MMLU-Redux 包含3000个经过手动重新标注的问题，涵盖了MMLU的30个子集。研究人员发现，MMLU-Redux 的结果与原始MMLU的评估结果存在显著差异，这表明MMLU中存在的错误对模型评估结果产生了重大影响。

MMLU-Redux：一个更可靠的基准测试

MMLU-Redux 的创建，为我们提供了重新评估LLM性能的工具。研究人员发现，在MMLU-Redux 上，一些LLM的性能表现与原始MMLU评估结果存在显著差异，这表明MMLU中的错误会影响模型的排名。

自动修复MMLU：一个挑战

研究人员还尝试了使用LLM自动修复MMLU中的错误。他们使用了多种方法，包括零样本提示、少样本提示、链式思维提示和检索增强生成。然而，即使是最先进的模型，在自动错误检测方面的表现仍然有限。

结论：MMLU需要改进

MMLU是一个重要的基准测试，但它存在着不少问题。MMLU-Redux 的出现，为我们提供了一个更可靠的基准测试。研究人员呼吁社区共同努力，改进MMLU，使其成为评估下一代LLM的可靠工具。

参考文献

[1] Vaswani, Ashish, et al. 「Attention is all you need.」 Advances in neural information processing systems 30 (2017).

[2] Brown, Tom, et al. 「Language models are few-shot learners.」 Advances in neural information processing systems 33 (2020): 1877-1901.

[3] Devlin, Jacob, et al. 「Bert: Pre-training of deep bidirectional transformers for language understanding.」 arXiv preprint arXiv:1810.04805 (2018).

[4] Radford, Alec, et al. 「Language models are unsupervised multitask learners.」 OpenAI blog (2019).

[5] Raffel, Colin, et al. 「Exploring the limits of transfer learning with a unified text-to-text transformer.」 Journal of Machine Learning Research 21.140 (2020): 1-67.

[6] Dai, Hanxiao, et al. 「Finetuned language models are zero-shot learners.」 arXiv preprint arXiv:2005.14165 (2020).

[7] Zhang, Sheng, et al. 「Learning to prompt for continual pre-training.」 Advances in Neural Information Processing Systems 35 (2022): 20398-20410.

[8] Touvron, Hugo, et al. 「Llama: Open and efficient large language models.」 arXiv preprint arXiv:2302.09439 (2023).

[9] Gardner, Matt, et al. 「Evaluating large language models trained on code.」 arXiv preprint arXiv:2107.03374 (2021).

[10] Bommasani, Rishi, et al. 「On the opportunities and risks of foundation models.」 arXiv preprint arXiv:2108.07258 (2021).

[11] Hendrycks, Dan, et al. 「Measuring massive multitask language understanding.」 arXiv preprint arXiv:2009.11692 (2020).

[12] Wei, Jason, et al. 「Finetuned language models are zero-shot learners.」 arXiv preprint arXiv:2005.14165 (2020).

[13] Wei, Jason, et al. 「Chain of thought prompting elicits reasoning in large language models.」 arXiv preprint arXiv:2201.11903 (2022).

[14] Guu, Kelvin, et al. 「Retrieval-augmented generation for knowledge-intensive tasks.」 arXiv preprint arXiv:2005.11401 (2020).

[15] Lin, Jimmy, et al. 「Pyserini: A python toolkit for reproducible information retrieval research.」 Proceedings of the 45th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2022.

[16] Beyer, Ludwig, et al. 「Are we done with imagenet?」 arXiv preprint arXiv:2007.02133 (2020).

[17] Deng, Jia, et al. 「Imagenet: A large-scale hierarchical image database.」 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2009.

[18] Nallapati, Ramesh, et al. 「Summarization evaluation: From human judgments to metrics.」 Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016.

[19] Fabbri, Alessandro, et al. 「Semeval-2015 task 11: Automatic short answer grading.」 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). 2015.

[20] Williams, Adina, et al. 「A broad-coverage challenge corpus for sentence understanding through inference.」 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

[21] Bowman, Samuel R. , et al. 「A large annotated corpus for learning natural language inference.」 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015.✅

[22] Glockner, Max, et al. 「Fine-tuning language models for natural language inference.」 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

[23] Nie, Yixin, et al. 「Adversarial examples for natural language inference.」 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[24] Bender, Emily M. , et al. 「On the dangers of stochastic parrots: Can language models be too big?」 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021.✅

[25] Belinkov, Yonatan, et al. 「Evaluating adversarial robustness of natural language processing systems.」 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[26] Zhou, Peng, et al. 「Towards robust and reliable natural language inference.」 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.

[27] Zhang, Sheng, et al. 「Learning to prompt for continual pre-training.」 Advances in Neural Information Processing Systems 35 (2022): 20398-20410.

[28] Gururangan, Suchin, et al. 「Don』t stop pretraining: Adapt language models to domains and tasks.」 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.

[29] Snow, Rion, et al. 「Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks.」 Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2008.

[30] Diao, Qun, et al. 「Human errors in annotation: A case study of natural language inference.」 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[31] Ratner, Alexander, et al. 「Data programming: Creating large training sets via synthetic data.」 Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[32] Sheng, Victor, et al. 「Weak supervision for natural language processing.」 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[33] Sap, M. , et al. 「The influence of annotator bias on natural language inference data.」 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.✅

[34] Pratapa, Adithya, et al. 「Annotator bias in natural language inference: A case study of the snli corpus.」 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[35] Rajpurkar, Pranav, et al. 「Medqa: A dataset for medical question answering.」 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

[36] Hendrycks,

发表评论 取消回复

发表评论取消回复