作者:Dean Wyatte, Fatemeh Tahmasbi, Ming Li, Thomas Markovich
大型语言模型(Large Language Models,LLMs)在生成多样化查询的合理答案方面表现出色,代表了机器学习模型的一次重大飞跃。然而,这些模型在客户支持应用中也面临着一系列挑战,例如容易产生幻觉(hallucination)和数据泄露风险。本文将探讨如何通过将语言建模任务重新定义为判别性分类任务,来利用LLMs增强客户支持服务。
Authors: Dean Wyatte ; Fatemeh Tahmasbi ; Ming Li ; Thomas Markovich
Summary: Modern large language models (LLMs) represent a paradigm shift in what can plausibly be expected of machine learning models. The fact that LLMs can effectively generate sensible answers to a diverse range of queries suggests that they would be useful in customer support applications. While powerful, LLMs have been observed to be prone to hallucination which unfortunately makes their near term use in customer support applications challenging. To address this issue we present a system that allows us to use an LLM to augment our customer support advocates by re-framing the language modeling task as a discriminative classification task. In this framing, we seek to present the top-K best template responses for a customer support advocate to use when responding to a customer. We present the result of both offline and online experiments where we observed offline gains and statistically significant online lifts for our experimental system. Along the way, we present observed scaling curves for validation loss and top-K accuracy, resulted from model parameter ablation studies. We close by discussing the space of trade-offs with respect to model size, latency, and accuracy as well as and suggesting future applications to explore.
作者:Dean Wyatte, Fatemeh Tahmasbi, Ming Li, Thomas Markovich
大型语言模型(Large Language Models,LLMs)在生成多样化查询的合理答案方面表现出色,代表了机器学习模型的一次重大飞跃。然而,这些模型在客户支持应用中也面临着一系列挑战,例如容易产生幻觉(hallucination)和数据泄露风险。本文将探讨如何通过将语言建模任务重新定义为判别性分类任务,来利用LLMs增强客户支持服务。
问题背景与研究目标
尽管LLMs在生成多样化查询的合理答案方面表现出色,但它们的短期应用在客户支持中面临挑战。幻觉答案和数据泄露风险使得它们的直接应用受到限制。为了解决这些问题,本文提出了一种系统,将语言建模任务重新定义为判别性分类任务,帮助客服代表选择最佳的模板回复。
方法论:两阶段训练流程
为了有效利用LLMs来增强客户支持服务,本文提出了一个两阶段训练流程:
领域适应性预训练(Domain Adaptive Pre-training)
首先,使用预训练的LLM,并在目标领域的数据上继续预训练。本文使用了Cash App客户支持记录的数据进行预训练,这有助于模型学习特定领域的语言和上下文。
判别性微调(Discriminative Fine-tuning)
在领域适应的基础上,添加一个新的线性层,并在标记了客服代表模板回复选择的较小数据集上进行端到端的微调,以产生最终的分类器。
数据集准备与模型选择
数据集准备
本文使用Cash App客户支持记录构建数据集,并进行了处理以去除个人识别信息(PII),确保数据安全和隐私。
模型选择
选用了基于GPTNeoX架构的Pythia系列LLMs,这些模型在预训练阶段已经学习了大量的通用网络数据。
实验设计与结果
离线训练和评估(Offline Training and Evaluation)
通过不同的指标(如FLOPs、语言模型损失、分类损失等)来评估模型在不同规模下的性能和效率。分析了模型大小、训练数据量与模型性能之间的关系。
在线案例研究(Online Case Study)
在实际的客户支持系统中部署模型,以评估模型在现实世界中的有效性。通过将预测结果从随机选择的2%的客服交互中移除,来衡量系统对客服代表选择模板的影响。
A/B测试与响应时间节省分析
对模型的不同版本进行A/B测试,以评估模型更新对客服效率指标的影响。评估模型预测对客服代表选择正确模板所需时间的影响,并与没有使用模板的情况进行了比较。
实际部署考虑
本文讨论了将这些模型投入生产时的实际考虑,包括模型的更新策略、延迟要求和业务指标的影响。研究了模型大小、延迟和准确性之间的权衡,并提出了针对不同需求的模型参数调整建议。
未来研究方向
本文在最后一部分提出了一些可能的未来研究方向和可以进一步探索的点:
结论
通过重新定义语言建模任务为判别性分类任务,本文提出了一种利用LLMs增强客户支持服务的创新方法。通过领域适应性预训练和判别性微调,模型在特定任务上的性能得到了显著提升,同时避免了幻觉和数据泄露的风险。未来的研究方向为进一步提升LLMs在客户支持和其他领域的应用效果提供了广阔的空间。
Scaling Laws for Discriminative Classification in Large Language Models
https://papers.cool/arxiv/2405.15765
Authors: Dean Wyatte ; Fatemeh Tahmasbi ; Ming Li ; Thomas Markovich
Summary: Modern large language models (LLMs) represent a paradigm shift in what can plausibly be expected of machine learning models. The fact that LLMs can effectively generate sensible answers to a diverse range of queries suggests that they would be useful in customer support applications. While powerful, LLMs have been observed to be prone to hallucination which unfortunately makes their near term use in customer support applications challenging. To address this issue we present a system that allows us to use an LLM to augment our customer support advocates by re-framing the language modeling task as a discriminative classification task. In this framing, we seek to present the top-K best template responses for a customer support advocate to use when responding to a customer. We present the result of both offline and online experiments where we observed offline gains and statistically significant online lifts for our experimental system. Along the way, we present observed scaling curves for validation loss and top-K accuracy, resulted from model parameter ablation studies. We close by discussing the space of trade-offs with respect to model size, latency, and accuracy as well as and suggesting future applications to explore.