[1] Marco Alessio, Guglielmo Faggioli, and Nicola Ferro. 2023. DECAF: a Modular and Extensible Conversational Search Framework. In SIGIR ’23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan). Association for Computing Machinery, to appear.
[2] Mohammad Aliannejadi, Leif Azzopardi, Hamed Zamani, Evangelos Kanoulas, Paul Thomas, and Nick Craswell. 2021. Analysing Mixed Initiatives and Search Strategies during Conversational Search. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (Virtual Event, Queensland, Australia). Association for Computing Machinery, 16–26.
[3] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplanz. 2021. A General Language Assistant as a Laboratory for Alignment. (2021). https://arxiv.org/abs/2112.00861
[4] Leif Azzopardi, Mohammad Aliannejadi, and Evangelos Kanoulas. 2022. Towards Building Economic Models of Conversational Search. In Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13186, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer, 31–38.
[5] Christine Bauer, Ben Carterette, Nicola Ferro, and Norbert Fuhr. 2023. Report from Dagstuhl Seminar 23031: Frontiers of Information Access Experimentation for Research and Education. (2023). https://arxiv.org/abs/2305.01509
[6] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada). Association for Computing Machinery, 610–623.✅
[7] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. (2023). https://arxiv.org/abs/2303.12712
[8] Chris Buckley and Janet Walz. 2000. The TREC-8 Query Track. In NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8). NIST, 65–76.
[9] Nuo Chen, Jiqun Liu, and Tetsuya Sakai. 2023. A Reference-Dependent Model for Web Search Evaluation. In Proceedings of the ACM Web Conference 2023 (Austin, TX, USA). Association for Computing Machinery, 3396–3405.
[10] Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity in Information Retrieval Evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore). Association for Computing Machinery, 659–666.✅
[11] Hoa Trang Dang and Jimmy Lin. 2007. Different Structures for Evaluating Answers to Complex Questions: Pyramids Won’t Topple, and Neither Will Human Assessors. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (Prague, Czech Republic). Association for Computational Linguistics, 768–775.
[12] Emily Dinan, Gavin Abercrombie, A. Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2022. SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Dublin, Ireland). Association for Computational Linguistics, 4113–4133.✅
[13] Michael D. Ekstrand, Anubrata Das, Robin Burke, and Fernando Diaz. 2021. Fairness and Discrimination in Information Access Systems. (2021). https://arxiv.org/abs/2105.05779✅
[14] Matthew Ekstrand-Abueg, Virgil Pavlu, Makoto Kato, Tetsuya Sakai, Takehiro Yammoto, and Mayu Iwata. 2013. Exploring semi-automatic nugget extraction for Japanese one click access evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland). Association for Computing Machinery, 749–752.
随着大型语言模型(LLM)的迅速发展,以LLM为基础的对话系统(例如聊天机器人)在近几年取得了惊人的进步。然而,这些系统也带来了新的挑战,它们可能对用户和社会产生负面影响。因此,建立一个有效的评估框架,及时发现这些潜在的负面影响,并量化其积极影响,变得至关重要。
评估框架的六大要素
一个理想的评估框架至少应该满足以下六个要素:
SWAN框架:基于片段的评估方法
为了满足上述要求,本文提出了一个名为SWAN(Schematised Weighted Average Nugget,模式化加权平均片段分数)的评估框架,该框架主要包含以下特点:
片段权重
片段权重类似于信息检索指标(如nDCG)中的基于排名的衰减,但片段权重不一定随着片段位置的增加而单调递减。例如,基于S-measure的线性衰减函数假设片段的实际价值随着对话的进行而降低(即更快满足信息需求的较短对话会获得更高的奖励),而另一种方法则是只对来自对话最后一轮的片段赋予正权重,以模拟近因效应。锚定效应等因素也可以被纳入考虑,即“迄今为止看到的片段”会影响当前片段的权重。
SWAN分数
SWAN分数可以定义为:
其中,C表示评估标准的集合(即模式),CWc表示标准c的权重,Uc表示从对话样本中提取的关于标准c的片段集合,WANc(Uc)表示标准c的加权平均片段分数。
二十个评估标准
本文提出了二十个评估标准,可以作为SWAN框架的插件,这些标准涵盖了对话系统各个方面的评估,例如:
总结
本文介绍了用于评估对话系统的SWAN框架,该框架可以用于面向任务的对话和非面向任务的对话。此外,本文还提出了二十个评估标准,可以作为SWAN框架的插件。未来,我们将设计适合各种标准的对话采样方法,构建用于比较多个系统的种子用户回复,并验证SWAN的特定实例,以防止对话系统对用户和社会造成负面影响。
参考文献
[1] Marco Alessio, Guglielmo Faggioli, and Nicola Ferro. 2023. DECAF: a Modular and Extensible Conversational Search Framework. In SIGIR ’23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan). Association for Computing Machinery, to appear.
[2] Mohammad Aliannejadi, Leif Azzopardi, Hamed Zamani, Evangelos Kanoulas, Paul Thomas, and Nick Craswell. 2021. Analysing Mixed Initiatives and Search Strategies during Conversational Search. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (Virtual Event, Queensland, Australia). Association for Computing Machinery, 16–26.
[3] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplanz. 2021. A General Language Assistant as a Laboratory for Alignment. (2021). https://arxiv.org/abs/2112.00861
[4] Leif Azzopardi, Mohammad Aliannejadi, and Evangelos Kanoulas. 2022. Towards Building Economic Models of Conversational Search. In Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13186, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer, 31–38.
[5] Christine Bauer, Ben Carterette, Nicola Ferro, and Norbert Fuhr. 2023. Report from Dagstuhl Seminar 23031: Frontiers of Information Access Experimentation for Research and Education. (2023). https://arxiv.org/abs/2305.01509
[6] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada). Association for Computing Machinery, 610–623.✅
[7] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. (2023). https://arxiv.org/abs/2303.12712
[8] Chris Buckley and Janet Walz. 2000. The TREC-8 Query Track. In NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8). NIST, 65–76.
[9] Nuo Chen, Jiqun Liu, and Tetsuya Sakai. 2023. A Reference-Dependent Model for Web Search Evaluation. In Proceedings of the ACM Web Conference 2023 (Austin, TX, USA). Association for Computing Machinery, 3396–3405.
[10] Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity in Information Retrieval Evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore). Association for Computing Machinery, 659–666.✅
[11] Hoa Trang Dang and Jimmy Lin. 2007. Different Structures for Evaluating Answers to Complex Questions: Pyramids Won’t Topple, and Neither Will Human Assessors. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (Prague, Czech Republic). Association for Computational Linguistics, 768–775.
[12] Emily Dinan, Gavin Abercrombie, A. Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2022. SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Dublin, Ireland). Association for Computational Linguistics, 4113–4133.✅
[13] Michael D. Ekstrand, Anubrata Das, Robin Burke, and Fernando Diaz. 2021. Fairness and Discrimination in Information Access Systems. (2021). https://arxiv.org/abs/2105.05779✅
[14] Matthew Ekstrand-Abueg, Virgil Pavlu, Makoto Kato, Tetsuya Sakai, Takehiro Yammoto, and Mayu Iwata. 2013. Exploring semi-automatic nugget extraction for Japanese one click access evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland). Association for Computing Machinery, 749–752.