Here’s a breakdown of the paper’s key aspects:
1. Motivation:
- Traditional LLM alignment methods heavily rely on human preference data and computationally expensive fine-tuning, limiting scalability.
- Recent research suggests that alignment might simply be revealing knowledge already present in pretrained models.
2. AlignEZ Approach:
- Self-Generated Preference Data:
- The base LLM is prompted to generate its own preference data by describing characteristics of helpful and harmful responses.
- Using these characteristics, the LLM generates pairs of responses, simulating preference comparisons.
- Identifying Preference Directions:
- The self-generated preference pairs are used to identify directions in the LLM’s embedding space that correspond to helpful and harmful attributes.
- Two methods are explored:
- SVD-Based Identification: Applies Singular Value Decomposition (SVD) on the embedding matrix of preference data to extract the principal eigenvector as the preference direction.
- CCS-Based Identification: Utilizes a Contrastive Concept Shap (CCS) probe trained on the self-generated data to identify directions maximizing the difference between helpful and harmful attributes.
- Representation Editing:
- During inference, the LLM’s embeddings are modified by:
- Boosting components aligned with the helpful direction.
- Neutralizing components aligned with the harmful direction.
- During inference, the LLM’s embeddings are modified by:
3. Experiments and Results:
- AlignEZ significantly reduces the performance gap between base and traditionally aligned models by an average of 31.6% across various datasets and model architectures.
- It effectively expedites more expensive alignment methods like DPO by improving models trained with limited ground-truth data.
4. Key Findings:
- Self-alignment is achievable to a significant degree without external data or fine-tuning.
- AlignEZ offers a cost-effective way to improve LLM alignment, potentially enabling real-time personalization and fine-grained control.
5. Limitations and Future Work:
- The quality of self-generated preference data influences AlignEZ’s effectiveness.
- Further research is needed to explore its applicability to more complex alignment tasks and different data modalities.
In conclusion, AlignEZ presents a promising step towards free self-alignment, offering a cost-effective and potentially scalable approach to aligning LLMs with human preferences.
å…费自对é½ï¼šè®©è¯è¨€æ¨¡åž‹æ›´æ‡‚ä½ ï¼Ÿ
大型è¯è¨€æ¨¡åž‹ï¼ˆLLM)æ£åœ¨æ”¹å˜æˆ‘们的世界,但它们也å˜åœ¨ç€ä¸€äº›é—®é¢˜ã€‚比如,它们有时会生æˆä¸å‡†ç¡®ã€ä¸å‹å–„或带有åè§çš„ä¿¡æ¯ã€‚ä¸ºäº†è§£å†³è¿™äº›é—®é¢˜ï¼Œç ”ç©¶äººå‘˜ä¸€ç›´åœ¨åŠªåŠ›å¯¹é½ LLM,使其更符åˆäººç±»çš„价值观和å好。
ä¼ ç»Ÿçš„å¯¹é½æ–¹æ³•é€šå¸¸éœ€è¦å¤§é‡çš„æ ‡æ³¨æ•°æ®å’Œå¤§é‡çš„计算资æºï¼Œè¿™å¯¹äºŽè®¸å¤šç ”究人员和开å‘者æ¥è¯´éƒ½æ˜¯ä¸€ä¸ªå·¨å¤§çš„挑战。那么,有没有一ç§æ›´ç»æµŽã€æ›´ä¾¿æ·çš„对é½æ–¹æ³•å‘¢ï¼Ÿ
AlignEZï¼šå‡ ä¹Žå…费的对é½
最近,æ¥è‡ªå¨æ–¯åº·æ˜Ÿå¤§å¦éº¦è¿ªé€Šåˆ†æ ¡çš„ç ”ç©¶äººå‘˜æ出了一ç§å为 AlignEZ 的新方法,它å¯ä»¥å®žçŽ°å‡ 乎å…费的 LLM 自对é½ã€‚AlignEZ çš„æ ¸å¿ƒæ€æƒ³æ˜¯åˆ©ç”¨ LLM 自身生æˆçš„å好数æ®æ¥ä¿®æ”¹å…¶å†…部表示,从而引导模型生æˆæ›´ç¬¦åˆäººç±»æœŸæœ›çš„输出。
如何实现自对é½ï¼Ÿ
AlignEZ 的工作æµç¨‹ä¸»è¦åˆ†ä¸ºä¸‰ä¸ªæ¥éª¤ï¼š
- 生æˆå好数æ®ï¼š ç ”ç©¶äººå‘˜é¦–å…ˆä½¿ç”¨ LLM 自身生æˆå好数æ®ã€‚ä»–ä»¬å‘ LLM æ出一些问题,并è¦æ±‚ LLM æè¿°ç†æƒ³çš„回ç”å’Œä¸ç†æƒ³çš„回ç”应该具备的特å¾ã€‚然åŽï¼Œä»–们å†æ¬¡å‘ LLM æ出相åŒçš„问题,并è¦æ±‚ LLM æ ¹æ®ä¹‹å‰æ述的特å¾ç”Ÿæˆä¸åŒçš„回ç”ã€‚è¿™æ ·ï¼Œä»–ä»¬å°±å¾—åˆ°äº† LLM 自身生æˆçš„å好数æ®å¯¹ã€‚
- 识别å好方å‘: 接下æ¥ï¼Œç ”究人员使用这些å好数æ®å¯¹æ¥è¯†åˆ« LLM 内部表示空间ä¸ä¸Žäººç±»å好相关的方å‘。他们使用两ç§æ–¹æ³•æ¥å®žçŽ°è¿™ä¸€ç›®æ ‡ï¼š
- 奇异值分解 (SVD): SVD å¯ä»¥å¸®åŠ©è¯†åˆ« LLM 内部表示空间ä¸ä¸»è¦çš„æ–¹å‘,这些方å‘通常与人类å好相关。
- 对比一致性æœç´¢ (CCS): CCS 则å¯ä»¥å¸®åŠ©è¯†åˆ« LLM 内部表示空间ä¸çš„超平é¢ï¼Œè¿™ä¸ªè¶…å¹³é¢å¯ä»¥å°†ç†æƒ³çš„回ç”与ä¸ç†æƒ³çš„回ç”区分开æ¥ã€‚
- 编辑内部表示: 最åŽï¼Œç ”究人员使用识别出的å好方å‘æ¥ä¿®æ”¹ LLM 的内部表示。他们通过增强与人类å好相关的方å‘,并抑制与ä¸ç†æƒ³ç‰¹å¾ç›¸å…³çš„æ–¹å‘æ¥å¼•å¯¼ LLM 生æˆæ›´ç¬¦åˆäººç±»æœŸæœ›çš„输出。
实验结果:显著æ高模型性能
ç ”ç©¶äººå‘˜åœ¨å…个ä¸åŒçš„æ•°æ®é›†å’Œä¸‰ç§ä¸åŒçš„ LLM 架构上测试了 AlignEZ 的效果。结果表明,AlignEZ å¯ä»¥æ˜¾è‘—ç¼©å° LLM 与其对é½ç‰ˆæœ¬ä¹‹é—´çš„性能差è·ï¼Œå¹³å‡æ高了 31.6%。
æ›´é‡è¦çš„是,AlignEZ 还å¯ä»¥åŠ 速更昂贵的对é½æ–¹æ³•ï¼Œä¾‹å¦‚ DPOã€‚ç ”ç©¶äººå‘˜å‘现,AlignEZ å¯ä»¥æ高仅使用少é‡æ ‡æ³¨æ•°æ®è®ç»ƒçš„ DPO 模型的性能。
未æ¥å±•æœ›ï¼šæ›´ç²¾å‡†ã€æ›´ä¸ªæ€§åŒ–的对é½
AlignEZ 的出现为 LLM 对é½é¢†åŸŸå¼€è¾Ÿäº†æ–°çš„å¯èƒ½æ€§ã€‚ç ”ç©¶äººå‘˜å¸Œæœ›æœªæ¥èƒ½å¤Ÿè¿›ä¸€æ¥æ”¹è¿› AlignEZ,使其能够更精准地识别人类å好,并实现更个性化的对é½ã€‚
总结
AlignEZ 是一ç§æ–°é¢–çš„ LLM 自对é½æ–¹æ³•ï¼Œå®ƒå¯ä»¥åˆ©ç”¨ LLM 自身生æˆçš„å好数æ®æ¥å®žçŽ°å‡ 乎å…费的对é½ã€‚AlignEZ 的实验结果表明,它å¯ä»¥æ˜¾è‘—æ高 LLM çš„æ€§èƒ½ï¼Œå¹¶åŠ é€Ÿæ›´æ˜‚è´µçš„å¯¹é½æ–¹æ³•ã€‚AlignEZ 的出现为 LLM 对é½é¢†åŸŸå¼€è¾Ÿäº†æ–°çš„å¯èƒ½æ€§ï¼Œä¸ºæœªæ¥æ›´ç²¾å‡†ã€æ›´ä¸ªæ€§åŒ–çš„ LLM 对é½æŠ€æœ¯å¥ 定了基础。
å‚考文献
[1] AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
[2] Chuang et al. Debiasing vision-language models via biased prompts. arXiv preprint 2302.00070, 2023.
[3] Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[4] Bender et al. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
[5] Bommasani et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
[6] Burns et al. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
[7] Christiano et al. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
[8] Dalvi et al. Discovering latent concepts learned in bert. arXiv preprint arXiv:2205.07237, 2022.
[9] Cui et al. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
[10] Dettmers et al. Qlora: Efficient finetuning of quantized llms, 2023.
[11] Hoffmann et al. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.
[12] Jiang et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[13] Li et al. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023a.
[14] Li et al. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
[15] Lee et al. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
[16] Mangrulkar et al. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
[17] McIntosh et al. From google gemini to openai q*(q-star): A survey of reshaping the generative artificial intelligence (ai) research landscape. arXiv preprint arXiv:2312.10868, 2023.
[18] Ouyang et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
[19] Rafailov et al. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
[20] Sun et al. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.
[21] Li et al. Alpacaeval: An automatic evaluator of instruction-following models, 2023b.
[22] Limisiewicz et al. Debiasing algorithm through model adaptation. arXiv preprint arXiv:2310.18913, 2023.
[23] Lin et al. The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552, 2023.
[24] Loshchilov and Hutter. Decoupled weight decay regularization, 2019.
[25] Raschka. Finetuning llms with lora and qlora: Insights from hundreds of experiments, Oct 2023. URL https://lightning.ai/pages/community/lora-insights/?utm_medium=social&utm_source=twitter&utm_campaign=Education_10132023.
[26] Schulman et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[27] Tamkin et al. Understanding the capabilities, limitations, and societal impact of large language models. CoRR, abs/2102.02503, 2021. URL https://arxiv.org/abs/2102.02503.
[28] Tunstall et al. Zephyr: Direct distillation of lm alignment, 2023.
[29] Wang et al. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
[30] Wu et al. Reft: Representation finetuning for language models. arXiv preprint arXiv:2404.03592, 2024.
[31] Xie et al. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36:34201–34227, 2023.
[32] Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
[33] Zhou et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
[34] Introducing Meta Llama 3: The most capable openly available LLM to date — ai.meta.com. https://ai.meta.com/blog/meta-llama-3/, 2024.
[35] Adila et al. Zero-shot robustification of zero-shot models with foundation models. arXiv preprint arXiv:2309.04344, 2023.
[36] Fränken et al. Self-supervised alignment with mutual information: Learning to follow principles without preference labels. arXiv preprint arXiv:2404.14313, 2024.
[37] Han et al. Lm-switch: Lightweight language model conditioning in word embedding space. arXiv preprint arXiv:2305.12798, 2023.
[38] Guo et al. Human-instruction-free llm self-alignment with limited samples. arXiv preprint arXiv:2401.06785, 2024.
[39] Kenton et al. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021.
[40] Sun et al. Principle-driven self-alignment of language models from scratch with minimal human supervision. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 2511–2565. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/0764db1151b936aca59249e2c13886101-Paper-Conference.pdf.✅
[41] Zou et al. Representation engineering: A top-down approach to ai transparency, october 2023. URL http://arxiv.org/abs/2310.01405.