[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318.
[2] Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, Pedro PB de Gusmão, and Nicholas D Lane. 2020. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390 (2020).
[3] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečn`y, H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. 2018. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097 (2018).
[4] Timothy Castiglia, Shiqiang Wang, and Stacy Patterson. [n.d.]. Self-Supervised Vertical Federated Learning. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022).
[5] Daoyuan Chen, Dawei Gao, Weirui Kuang, Yaliang Li, and Bolin Ding. 2022. pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning. arXiv preprint arXiv:2206.03655 (2022).
[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
[7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.
[8] Chong Fu, Xuhong Zhang, Shouling Ji, Jinyin Chen, Jingzheng Wu, Shanqing Guo, Jun Zhou, Alex X Liu, and Ting Wang. 2022. Label inference attacks against vertical federated learning. In 31st USENIX Security Symposium (USENIX Security 22), Boston, MA.
[9] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. 2023. Sequence to sequence text generation with diffusion models. In International Conference on Learning Representations.
[10] Siyuan Guo, Lixin Zou, Yiding Liu, Wenwen Ye, Suqi Cheng, Shuaiqiang Wang, Hechang Chen, Dawei Yin, and Yi Chang. 2021. Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate Estimation. In The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[11] Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, et al. 2020. Fedml: A research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518 (2020).
[12] Yuanqin He, Yan Kang, Jiahuan Luo, Lixin Fan, and Qiang Yang. 2022. A hybrid self-supervised learning framework for vertical federated learning. arXiv preprint arXiv:2208.08934 (2022).
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
[15] Kaggle. [n.d.]. Give Me Some Credit dataset. https://www.kaggle.com/c/GiveMeSomeCredit.
[16] Yan Kang, Yang Liu, and Xinle Liang. 2022. FedCVT: Semi-supervised vertical federated learning with cross-view training. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 4 (2022), 1–16.
[17] Vladimir Kolesnikov, Ranjit Kumaresan, Mike Rosulek, and Ni Trieu. 2016. Efficient batched oblivious PRF with applications to private set intersection. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 818–829.
[19] Fan Lai, Yinwei Dai, Sanjay Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha Madhyastha, and Mosharaf Chowdhury. 2022. Fedscale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning. PMLR, 11814–11827.
[20] Oscar Li, Jiankai Sun, Xin Yang, Weihao Gao, Hongyi Zhang, Junyuan Xie, Virginia Smith, and Chong Wang. 2022. Label Leakage and Protection in Two-party Split Learning. In International Conference on Learning Representations.
[22] Wenjie Li, Qiaolin Xia, Junfeng Deng, Hao Cheng, Jiangming Liu, Kouying Xue, Yong Cheng, and Shu-Tao Xia. 2022. Semi-supervised cross-silo advertising with partial knowledge transfer. International Workshop on Trustworthy Federated Learning in Conjunction with IJCAI 2022 (FL-IJCAI’22) (2022).
[23] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. 2022. Diffusion-LM Improves Controllable Text Generation. In Advances in Neural Information Processing Systems.
[24] Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018. Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval.
[25] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282.
[26] P. Mooney. [n.d.]. Breast histopathology images. https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images.✅
[27] Feiyang Pan, Xiang Ao, Pingzhong Tang, Min Lu, Dapeng Liu, Lei Xiao, and Qing He. 2020. Field-aware calibration: A simple and empirically strong method for reliable probabilistic predictions. In Proceedings of The Web Conference 2020. 729–739.
[30] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
[31] Jiankai Sun, Xin Yang, Yuanshun Yao, and Chong Wang. 2022. Label Leakage and Protection from Forward Embedding in Vertical Federated Learning. arXiv preprint arXiv:2203.01451 (2022).
[32] Jiankai Sun, Yuanshun Yao, Weihao Gao, Junyuan Xie, and Chong Wang. 2021. Defending against reconstruction attack in vertical federated learning. arXiv preprint arXiv:2107.09898 (2021).
[33] Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He, Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, et al. 2022. FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings. arXiv preprint arXiv:2210.04620 (2022).
[34] Praneeth Vepakomma, Otkrist Gupta, Abhimanyu Dubey, and Ramesh Raskar. 2018. Reducing leakage in distributed deep learning for sensitive health data. arXiv preprint arXiv:1812.00564 2 (2018).
[35] Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. 2018. Split learning for health: Distributed deep learning without sharing raw patient data. ICLR 2019 Workshop on AI for social good (2018).
[36] Zirui Wang, Yulia Tsvetkov, Orhan Firat, and Yuan Cao. 2021. Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models. In International Conference on Learning Representations.
[37] Penghui Wei, Weimin Zhang, Ruijie Hou, Jinquan Liu, Shaoguo Liu, Liang Wang, and Bo Zheng. 2022. Posterior Probability Matters: Doubly-Adaptive Calibration for Neural Predictions in Online Advertising. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2645–2649.
[38] Penghui Wei, Weimin Zhang, Zixuan Xu, Shaoguo Liu, Kuang-chih Lee, and Bo Zheng. 2021. AutoHERI: Automated Hierarchical Representation Integration for Post-Click Conversion Rate Estimation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3528–3532.
[39] Zixuan Xu, Penghui Wei, Weimin Zhang, Shaoguo Liu, Liang Wang, and Bo Zheng. 2022. UKD: Debiasing Conversion Rate Estimation via Uncertainty-regularized Knowledge Distillation. In Proceedings of The Web Conference 2022.
[40] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 2 (2019), 1–19.
[41] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33 (2020), 5824–5836.
[42] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.
[43] Yifei Zhang and Hao Zhu. 2020. Additively homomorphical encryption based deep neural network for asymmetrically collaborative machine learning. arXiv preprint arXiv
近年来,越来越多的网络应用开始使用机器学习模型来提供个性化的服务,满足用户的偏好。转化率 (CVR) 估计是在线推荐和广告系统中的一个基础模块,其目标是在用户点击广告后预测其转化事件(例如,电商广告中的购买行为)的概率。CVR 估计在候选排名和广告竞价策略中起着至关重要的作用。
数据隐私的挑战
在线广告中,用户在发布商页面浏览广告并点击后,会跳转到广告落地页。用户在落地页上的后续行为,包括转化决策,会被收集起来。发布商拥有用户的浏览兴趣和点击反馈,而需求方广告平台则收集用户的点击后行为,例如停留时间和转化决策。为了准确地估计 CVR 并更好地保护数据隐私,垂直联邦学习 (vFL) [35, 40] 成为了一种自然解决方案,它能够在不交换原始数据的情况下,结合两者的优势来训练模型。
然而,目前缺乏标准化的数据集和系统化的评估方法。由于缺乏标准化的数据集,现有的研究通常采用公共数据集,通过手工制作的特征划分来模拟 vFL 设置,这给公平比较带来了挑战。
FedAds: 垂直联邦学习下的转化率估计基准
为了解决这一问题,我们引入了 FedAds,这是第一个用于隐私保护的 vFL 转化率估计基准,旨在促进 vFL 算法的标准化和系统化评估。FedAds 包含:
FedAds 的主要贡献:
FedAds 旨在为未来的 vFL 算法和 CVR 估计研究提供帮助。
FedAds 的主要组成部分:
提高 vFL 的有效性和隐私性
实验评估
我们对各种 vFL 模型进行了系统化的评估,包括有效性和隐私方面。
结论和未来工作
我们介绍了 FedAds,这是一个用于隐私保护的 CVR 估计的第一个基准,旨在促进 vFL 算法的系统化评估。FedAds 包含一个来自阿里巴巴广告平台的大规模真实世界数据集,以及对各种神经网络基于 vFL 算法的有效性和隐私方面的系统化评估。此外,我们探索了使用生成模型生成未对齐样本的特征表示来合并未对齐数据,以提高 vFL 的有效性。为了更好地保护隐私,我们还开发了基于混合和投影操作的扰动方法。实验表明,这些方法取得了合理的性能。
在未来的工作中,我们将探索以下方向:
参考文献:
[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318.
[2] Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, Pedro PB de Gusmão, and Nicholas D Lane. 2020. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390 (2020).
[3] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečn`y, H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. 2018. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097 (2018).
[4] Timothy Castiglia, Shiqiang Wang, and Stacy Patterson. [n.d.]. Self-Supervised Vertical Federated Learning. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022).
[5] Daoyuan Chen, Dawei Gao, Weirui Kuang, Yaliang Li, and Bolin Ding. 2022. pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning. arXiv preprint arXiv:2206.03655 (2022).
[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
[7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.
[8] Chong Fu, Xuhong Zhang, Shouling Ji, Jinyin Chen, Jingzheng Wu, Shanqing Guo, Jun Zhou, Alex X Liu, and Ting Wang. 2022. Label inference attacks against vertical federated learning. In 31st USENIX Security Symposium (USENIX Security 22), Boston, MA.
[9] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. 2023. Sequence to sequence text generation with diffusion models. In International Conference on Learning Representations.
[10] Siyuan Guo, Lixin Zou, Yiding Liu, Wenwen Ye, Suqi Cheng, Shuaiqiang Wang, Hechang Chen, Dawei Yin, and Yi Chang. 2021. Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate Estimation. In The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[11] Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, et al. 2020. Fedml: A research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518 (2020).
[12] Yuanqin He, Yan Kang, Jiahuan Luo, Lixin Fan, and Qiang Yang. 2022. A hybrid self-supervised learning framework for vertical federated learning. arXiv preprint arXiv:2208.08934 (2022).
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
[14] Kaggle. [n.d.]. Avazu dataset. https://www.kaggle.com/c/avazu-ctr-prediction.
[15] Kaggle. [n.d.]. Give Me Some Credit dataset. https://www.kaggle.com/c/GiveMeSomeCredit.
[16] Yan Kang, Yang Liu, and Xinle Liang. 2022. FedCVT: Semi-supervised vertical federated learning with cross-view training. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 4 (2022), 1–16.
[17] Vladimir Kolesnikov, Ranjit Kumaresan, Mike Rosulek, and Ni Trieu. 2016. Efficient batched oblivious PRF with applications to private set intersection. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 818–829.
[18] Criteo Labs. [n.d.]. Criteo dataset. https://labs.criteo.com/2014/02/download-kaggle-display-advertising-challenge-dataset/.
[19] Fan Lai, Yinwei Dai, Sanjay Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha Madhyastha, and Mosharaf Chowdhury. 2022. Fedscale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning. PMLR, 11814–11827.
[20] Oscar Li, Jiankai Sun, Xin Yang, Weihao Gao, Hongyi Zhang, Junyuan Xie, Virginia Smith, and Chong Wang. 2022. Label Leakage and Protection in Two-party Split Learning. In International Conference on Learning Representations.
[21] Wenjie Li, Qiaolin Xia, Hao Cheng, Kouyin Xue, and Shu-Tao Xia. 2022. Vertical semi-federated learning for efficient online advertising. arXiv preprint arXiv:2209.15635 (2022).
[22] Wenjie Li, Qiaolin Xia, Junfeng Deng, Hao Cheng, Jiangming Liu, Kouying Xue, Yong Cheng, and Shu-Tao Xia. 2022. Semi-supervised cross-silo advertising with partial knowledge transfer. International Workshop on Trustworthy Federated Learning in Conjunction with IJCAI 2022 (FL-IJCAI’22) (2022).
[23] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. 2022. Diffusion-LM Improves Controllable Text Generation. In Advances in Neural Information Processing Systems.
[24] Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018. Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval.
[25] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282.
[26] P. Mooney. [n.d.]. Breast histopathology images. https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images.✅
[27] Feiyang Pan, Xiang Ao, Pingzhong Tang, Min Lu, Dapeng Liu, Lei Xiao, and Qing He. 2020. Field-aware calibration: A simple and empirically strong method for reliable probabilistic predictions. In Proceedings of The Web Conference 2020. 729–739.
[28] PASCAL-Challenge-2008. [n.d.]. epsilon dataset. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
[29] Soumik Rakshit. [n.d.]. Yahoo answers dataset. https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset.
[30] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
[31] Jiankai Sun, Xin Yang, Yuanshun Yao, and Chong Wang. 2022. Label Leakage and Protection from Forward Embedding in Vertical Federated Learning. arXiv preprint arXiv:2203.01451 (2022).
[32] Jiankai Sun, Yuanshun Yao, Weihao Gao, Junyuan Xie, and Chong Wang. 2021. Defending against reconstruction attack in vertical federated learning. arXiv preprint arXiv:2107.09898 (2021).
[33] Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He, Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, et al. 2022. FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings. arXiv preprint arXiv:2210.04620 (2022).
[34] Praneeth Vepakomma, Otkrist Gupta, Abhimanyu Dubey, and Ramesh Raskar. 2018. Reducing leakage in distributed deep learning for sensitive health data. arXiv preprint arXiv:1812.00564 2 (2018).
[35] Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. 2018. Split learning for health: Distributed deep learning without sharing raw patient data. ICLR 2019 Workshop on AI for social good (2018).
[36] Zirui Wang, Yulia Tsvetkov, Orhan Firat, and Yuan Cao. 2021. Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models. In International Conference on Learning Representations.
[37] Penghui Wei, Weimin Zhang, Ruijie Hou, Jinquan Liu, Shaoguo Liu, Liang Wang, and Bo Zheng. 2022. Posterior Probability Matters: Doubly-Adaptive Calibration for Neural Predictions in Online Advertising. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2645–2649.
[38] Penghui Wei, Weimin Zhang, Zixuan Xu, Shaoguo Liu, Kuang-chih Lee, and Bo Zheng. 2021. AutoHERI: Automated Hierarchical Representation Integration for Post-Click Conversion Rate Estimation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3528–3532.
[39] Zixuan Xu, Penghui Wei, Weimin Zhang, Shaoguo Liu, Liang Wang, and Bo Zheng. 2022. UKD: Debiasing Conversion Rate Estimation via Uncertainty-regularized Knowledge Distillation. In Proceedings of The Web Conference 2022.
[40] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 2 (2019), 1–19.
[41] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33 (2020), 5824–5836.
[42] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.
[43] Yifei Zhang and Hao Zhu. 2020. Additively homomorphical encryption based deep neural network for asymmetrically collaborative machine learning. arXiv preprint arXiv