(1) Abbasi-Yadkori 等人 (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24 (2011).
(2) Agarwal 等人 (2014) Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning. PMLR, 1638–1646.
(3) Agrawal 和 Goyal (2013) Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning. PMLR, 127–135.
(4) Aharon 等人 (2015) Michal Aharon, Oren Anava, Noa Avigdor-Elgrabli, Dana Drachsler-Cohen, Shahar Golan, and Oren Somekh. 2015. Excuseme: Asking users to help in item cold-start recommendations. In Proceedings of the 9th ACM Conference on Recommender Systems. 83–90.
(5) Auer 等人 (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2 (2002), 235–256.
(6) Bajari 等人 (2021) Patrick Bajari, Brian Burdick, Guido W Imbens, Lorenzo Masoero, James McQueen, Thomas Richardson, and Ido M Rosen. 2021. Multiple randomization designs. arXiv preprint arXiv:2112.13495 (2021).
(7) Bendada 等人 (2020) Walid Bendada, Guillaume Salha, and Théo Bontempelli. 2020. Carousel personalization in music streaming apps with contextual bandits. In Proceedings of the 14th ACM Conference on Recommender Systems. 420–425.
(8) Chapelle 和 Li (2011) Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. Advances in neural information processing systems 24 (2011).
(9) Chen (2021) Minmin Chen. 2021. Exploration in recommender systems. In Fifteenth ACM Conference on Recommender Systems. 551–553.
(10) Chen 等人 (2019) Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 456–464.
(11) Chen 等人 (2021) Minmin Chen, Yuyan Wang, Can Xu, Ya Le, Mohit Sharma, Lee Richardson, Su-Lin Wu, and Ed Chi. 2021. Values of User Exploration in Recommender Systems. In Fifteenth ACM Conference on Recommender Systems. 85–95.
(12) Cheung 等人 (2019) Wang Chi Cheung, Vincent Tan, and Zixin Zhong. 2019. A Thompson sampling algorithm for cascading bandits. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 438–447.
(13) Chu 等人 (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 208–214.
(14) Covington 等人 (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.
(15) Ding 等人 (2021) Qin Ding, Cho-Jui Hsieh, and James Sharpnack. 2021. An efficient algorithm for generalized linear bandit: Online stochastic gradient descent and thompson sampling. In International Conference on Artificial Intelligence and Statistics. PMLR, 1585–1593.
(16) Durand 等人 (2018) Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis, and Joelle Pineau. 2018. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference. PMLR, 67–82.
(17) Filippi 等人 (2010) Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. 2010. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems 23 (2010).
(18) Houthooft 等人 (2016) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. 2016. Vime: Variational information maximizing exploration. Advances in neural information processing systems 29 (2016).
(19) Imbens 和 Rubin (2015) Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
(20) Jadidinejad 等人 (2020) Amir H Jadidinejad, Craig Macdonald, and Iadh Ounis. 2020. Using Exploration to Alleviate Closed Loop Effects in Recommender Systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2025–2028.
(21) Jeunen 和 Goethals (2021) Olivier Jeunen and Bart Goethals. 2021. Top-k contextual bandits with equity of exposure. In Proceedings of the 15th ACM Conference on Recommender Systems. 310–320.
(22) Jiang 等人 (2019) Ray Jiang, Silvia Chiappa, Tor Lattimore, András György, and Pushmeet Kohli. 2019. Degenerate feedback loops in recommender systems. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 383–390.
神经线性老虎机 (NLB) 虽然简单,但直接将其融入工业训练和服务流程中却面临着挑战,主要体现在三个方面:
实验
我们在大型短视频推荐平台上进行了一系列在线 A/B 测试,以评估基于神经线性老虎机的排名系统的性能。我们还检查了不确定性测量的属性和可靠性。
我们首先在控制组和处理组上运行了 0.3% 的流量的用户分流 A/B 测试,持续六周,以观察用户指标。控制组是生产中的原始排名模型,处理组是利用神经线性老虎机的基于探索的排名系统。对于 NLB,如 5.2 节所述,我们在流式方式中更新协方差矩阵,而精度矩阵 Σ† 则在每次训练运行时离线更新,以与训练管道保持一致。为了确保矩阵求逆的稳定性,我们将正则化超参数设置为 ϵ=1e−6 (公式 4)。为了选择噪声参数 σ2,我们计算了 5 个不同训练模型的集合的不确定性(作为一种昂贵的真实值测量),并选择了常数超参数 σ2=10,使得从集合和神经线性老虎机获得的不确定性大致处于同一数量级。
神经线性老虎机的表现:内容新鲜度和用户满意度
直观地,基于不确定性的探索系统(例如 NLB)会更多地暴露新鲜和尾部内容,这会改变整体内容语料库分布,并从这些区域获取有价值的学习信号,进而转化为用户参与度的提升。
表 2 报告了在不同时间段内发布的新鲜内容上的正向交互次数增加。标题行中的时间区间(例如 1 小时)根据不同的新鲜度级别对内容进行分组。不同新鲜度级别内容上正向交互次数的显著增加证明了探索可以帮助系统有效地探索新鲜内容,并获取有价值的学习信号。有趣的是,我们还看到满意的每日活跃用户数量随着时间的推移而稳定增加,如图 8 所示。我们推测这种提升可能来自以下两个方面。首先,系统帮助用户发现新颖的兴趣,因为我们还看到用户在提供正向交互的独特主题数量上增加了 +1.25%。同时,用户更喜欢在专门针对短视频内容的特定表面上看到新鲜内容。
不确定性估计的属性和可靠性
神经线性老虎机中的关键组成部分之一是二次不确定性项,它捕捉了不同 (𝐮,𝐚) 对的探索项的强度。虽然在理论上可以量化,但可视化不确定性在不同用户和内容类型之间如何变化仍然是一个有趣的问题。为了检查这一点,我们选择了三个代表性特征,其中两个捕捉内容属性:1) 内容发布时间以来的天数(即内容年龄);2) 终身正向交互次数(即内容流行度);以及一个捕捉用户属性的特征:3) 用户在平台上提供的总交互次数(即用户活跃度)。
我们使用斯皮尔曼秩相关系数来衡量这些特征与神经线性老虎机计算的不确定性项之间的关系,该系数评估了两个变量之间的单调关系。表 3 报告了所选三个特征与神经线性老虎机计算的不确定性之间的斯皮尔曼秩相关系数。有趣的是,可以观察到,当前系统对于新鲜和不太流行的内容更加不确定,而对于不同活跃度级别的用户则或多或少保持中立。此外,我们计算了特征与从集合模型获得的不确定性之间的斯皮尔曼秩相关性,结果表明内容特征为 -0.3,用户特征为 0。这些结果与从神经线性老虎机计算的结果相似,表明了不确定性估计的可靠性。
神经线性老虎机对语料库指标变化的影响
为了检查神经线性老虎机的探索能力,即它如何使内容语料库的大小受益,我们执行了 5% 的用户-语料库-协同分流实验,将 5% 的语料库和用户分别分流到控制组和处理组。对于基于神经线性老虎机的基于探索的排名系统,我们看到 Discoverable Corpus @100,7 增加了 +5.33%,Discoverable Corpus @1000,7 增加了 +5.66%。与基于利用的系统相比,神经线性老虎机更公平地分配内容。具体来说,探索后指标的改进表明尾部内容的可发现性更高。
讨论和未来工作
在本文中,我们对探索通过系统中一个重要的中介,即内容语料库,对用户的长期价值进行了系统研究。我们解决了测量挑战,设计了新的指标和实验框架,以捕捉探索对内容语料库的益处,并建立了可发现语料库增长与长期用户满意度提升之间的联系。我们通过在一个大型商业推荐平台上进行广泛的现实世界现场实验来验证它们,并展示了我们的宝贵发现。
我们进一步研究了神经线性老虎机算法,用于在生产中构建基于不确定性的探索系统。值得指出的是,当前设置是针对单任务预测和探索量身定制的。相反,大多数现代推荐系统旨在捕捉多种丰富的反馈来源,并且通常在其实际应用中使用多任务学习。如何在这些更复杂的多任务设置下有效地进行探索是一个有趣的未来方向。
参考文献
(1) Abbasi-Yadkori 等人 (2011)
Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24 (2011).
(2) Agarwal 等人 (2014)
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning. PMLR, 1638–1646.
(3) Agrawal 和 Goyal (2013)
Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning. PMLR, 127–135.
(4) Aharon 等人 (2015)
Michal Aharon, Oren Anava, Noa Avigdor-Elgrabli, Dana Drachsler-Cohen, Shahar Golan, and Oren Somekh. 2015. Excuseme: Asking users to help in item cold-start recommendations. In Proceedings of the 9th ACM Conference on Recommender Systems. 83–90.
(5) Auer 等人 (2002)
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2 (2002), 235–256.
(6) Bajari 等人 (2021)
Patrick Bajari, Brian Burdick, Guido W Imbens, Lorenzo Masoero, James McQueen, Thomas Richardson, and Ido M Rosen. 2021. Multiple randomization designs. arXiv preprint arXiv:2112.13495 (2021).
(7) Bendada 等人 (2020)
Walid Bendada, Guillaume Salha, and Théo Bontempelli. 2020. Carousel personalization in music streaming apps with contextual bandits. In Proceedings of the 14th ACM Conference on Recommender Systems. 420–425.
(8) Chapelle 和 Li (2011)
Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. Advances in neural information processing systems 24 (2011).
(9) Chen (2021)
Minmin Chen. 2021. Exploration in recommender systems. In Fifteenth ACM Conference on Recommender Systems. 551–553.
(10) Chen 等人 (2019)
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 456–464.
(11) Chen 等人 (2021)
Minmin Chen, Yuyan Wang, Can Xu, Ya Le, Mohit Sharma, Lee Richardson, Su-Lin Wu, and Ed Chi. 2021. Values of User Exploration in Recommender Systems. In Fifteenth ACM Conference on Recommender Systems. 85–95.
(12) Cheung 等人 (2019)
Wang Chi Cheung, Vincent Tan, and Zixin Zhong. 2019. A Thompson sampling algorithm for cascading bandits. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 438–447.
(13) Chu 等人 (2011)
Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 208–214.
(14) Covington 等人 (2016)
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.
(15) Ding 等人 (2021)
Qin Ding, Cho-Jui Hsieh, and James Sharpnack. 2021. An efficient algorithm for generalized linear bandit: Online stochastic gradient descent and thompson sampling. In International Conference on Artificial Intelligence and Statistics. PMLR, 1585–1593.
(16) Durand 等人 (2018)
Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis, and Joelle Pineau. 2018. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference. PMLR, 67–82.
(17) Filippi 等人 (2010)
Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. 2010. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems 23 (2010).
(18) Houthooft 等人 (2016)
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. 2016. Vime: Variational information maximizing exploration. Advances in neural information processing systems 29 (2016).
(19) Imbens 和 Rubin (2015)
Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
(20) Jadidinejad 等人 (2020)
Amir H Jadidinejad, Craig Macdonald, and Iadh Ounis. 2020. Using Exploration to Alleviate Closed Loop Effects in Recommender Systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2025–2028.
(21) Jeunen 和 Goethals (2021)
Olivier Jeunen and Bart Goethals. 2021. Top-k contextual bandits with equity of exposure. In Proceedings of the 15th ACM Conference on Recommender Systems. 310–320.
(22) Jiang 等人 (2019)
Ray Jiang, Silvia Chiappa, Tor Lattimore, András György, and Pushmeet Kohli. 2019. Degenerate feedback loops in recommender systems. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 383–390.