Is Free Self-Alignment Possible?

This paper investigates the possibility of aligning large language models (LLMs) without the need for human-annotated data or expensive fine-tuning. The authors propose AlignEZ, a novel method that leverages self-generated preference data and representation editing to achieve nearly cost-free alignment.

Here’s a breakdown of the paper’s key aspects:

1. Motivation:

Traditional LLM alignment methods heavily rely on human preference data and computationally expensive fine-tuning, limiting scalability.
Recent research suggests that alignment might simply be revealing knowledge already present in pretrained models.

2. AlignEZ Approach:

Self-Generated Preference Data:
- The base LLM is prompted to generate its own preference data by describing characteristics of helpful and harmful responses.
- Using these characteristics, the LLM generates pairs of responses, simulating preference comparisons.
Identifying Preference Directions:
- The self-generated preference pairs are used to identify directions in the LLM’s embedding space that correspond to helpful and harmful attributes.
- Two methods are explored:
  - SVD-Based Identification: Applies Singular Value Decomposition (SVD) on the embedding matrix of preference data to extract the principal eigenvector as the preference direction.
  - CCS-Based Identification: Utilizes a Contrastive Concept Shap (CCS) probe trained on the self-generated data to identify directions maximizing the difference between helpful and harmful attributes.
Representation Editing:
- During inference, the LLM’s embeddings are modified by:
  - Boosting components aligned with the helpful direction.
  - Neutralizing components aligned with the harmful direction.

3. Experiments and Results:

AlignEZ significantly reduces the performance gap between base and traditionally aligned models by an average of 31.6% across various datasets and model architectures.
It effectively expedites more expensive alignment methods like DPO by improving models trained with limited ground-truth data.

4. Key Findings:

Self-alignment is achievable to a significant degree without external data or fine-tuning.
AlignEZ offers a cost-effective way to improve LLM alignment, potentially enabling real-time personalization and fine-grained control.

5. Limitations and Future Work:

The quality of self-generated preference data influences AlignEZ’s effectiveness.
Further research is needed to explore its applicability to more complex alignment tasks and different data modalities.

In conclusion, AlignEZ presents a promising step towards free self-alignment, offering a cost-effective and potentially scalable approach to aligning LLMs with human preferences.

免费自对齐：让语言模型更懂你？

大型语言模型（LLM）正在改变我们的世界，但它们也存在着一些问题。比如，它们有时会生成不准确、不友善或带有偏见的信息。为了解决这些问题，研究人员一直在努力对齐 LLM，使其更符合人类的价值观和偏好。

传统的对齐方法通常需要大量的标注数据和大量的计算资源，这对于许多研究人员和开发者来说都是一个巨大的挑战。那么，有没有一种更经济、更便捷的对齐方法呢？

AlignEZ：几乎免费的对齐

最近，来自威斯康星大学麦迪逊分校的研究人员提出了一种名为 AlignEZ 的新方法，它可以实现几乎免费的 LLM 自对齐。AlignEZ 的核心思想是利用 LLM 自身生成的偏好数据来修改其内部表示，从而引导模型生成更符合人类期望的输出。

如何实现自对齐？

AlignEZ 的工作流程主要分为三个步骤：

生成偏好数据： 研究人员首先使用 LLM 自身生成偏好数据。他们向 LLM 提出一些问题，并要求 LLM 描述理想的回答和不理想的回答应该具备的特征。然后，他们再次向 LLM 提出相同的问题，并要求 LLM 根据之前描述的特征生成不同的回答。这样，他们就得到了 LLM 自身生成的偏好数据对。
识别偏好方向： 接下来，研究人员使用这些偏好数据对来识别 LLM 内部表示空间中与人类偏好相关的方向。他们使用两种方法来实现这一目标：
- 奇异值分解 (SVD)： SVD 可以帮助识别 LLM 内部表示空间中主要的方向，这些方向通常与人类偏好相关。
- 对比一致性搜索 (CCS)： CCS 则可以帮助识别 LLM 内部表示空间中的超平面，这个超平面可以将理想的回答与不理想的回答区分开来。
编辑内部表示： 最后，研究人员使用识别出的偏好方向来修改 LLM 的内部表示。他们通过增强与人类偏好相关的方向，并抑制与不理想特征相关的方向来引导 LLM 生成更符合人类期望的输出。

实验结果：显著提高模型性能

研究人员在六个不同的数据集和三种不同的 LLM 架构上测试了 AlignEZ 的效果。结果表明，AlignEZ 可以显著缩小 LLM 与其对齐版本之间的性能差距，平均提高了 31.6%。

更重要的是，AlignEZ 还可以加速更昂贵的对齐方法，例如 DPO。研究人员发现，AlignEZ 可以提高仅使用少量标注数据训练的 DPO 模型的性能。

未来展望：更精准、更个性化的对齐

AlignEZ 的出现为 LLM 对齐领域开辟了新的可能性。研究人员希望未来能够进一步改进 AlignEZ，使其能够更精准地识别人类偏好，并实现更个性化的对齐。

总结

AlignEZ 是一种新颖的 LLM 自对齐方法，它可以利用 LLM 自身生成的偏好数据来实现几乎免费的对齐。AlignEZ 的实验结果表明，它可以显著提高 LLM 的性能，并加速更昂贵的对齐方法。AlignEZ 的出现为 LLM 对齐领域开辟了新的可能性，为未来更精准、更个性化的 LLM 对齐技术奠定了基础。

参考文献

[1] AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.

[2] Chuang et al. Debiasing vision-language models via biased prompts. arXiv preprint 2302.00070, 2023.

[3] Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[4] Bender et al. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.

[5] Bommasani et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

[6] Burns et al. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.

[7] Christiano et al. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.

[8] Dalvi et al. Discovering latent concepts learned in bert. arXiv preprint arXiv:2205.07237, 2022.

[9] Cui et al. Ultrafeedback: Boosting language models with high-quality feedback, 2023.

[10] Dettmers et al. Qlora: Efficient finetuning of quantized llms, 2023.

[11] Hoffmann et al. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.

[12] Jiang et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.

[13] Li et al. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023a.

[14] Li et al. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.

[15] Lee et al. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.

[16] Mangrulkar et al. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.

[17] McIntosh et al. From google gemini to openai q*(q-star): A survey of reshaping the generative artificial intelligence (ai) research landscape. arXiv preprint arXiv:2312.10868, 2023.

[18] Ouyang et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.

[19] Rafailov et al. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.

[20] Sun et al. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.

[21] Li et al. Alpacaeval: An automatic evaluator of instruction-following models, 2023b.

[22] Limisiewicz et al. Debiasing algorithm through model adaptation. arXiv preprint arXiv:2310.18913, 2023.

[23] Lin et al. The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552, 2023.

[24] Loshchilov and Hutter. Decoupled weight decay regularization, 2019.

[25] Raschka. Finetuning llms with lora and qlora: Insights from hundreds of experiments, Oct 2023. URL https://lightning.ai/pages/community/lora-insights/?utm_medium=social&utm_source=twitter&utm_campaign=Education_10132023.

[26] Schulman et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[27] Tamkin et al. Understanding the capabilities, limitations, and societal impact of large language models. CoRR, abs/2102.02503, 2021. URL https://arxiv.org/abs/2102.02503.

[28] Tunstall et al. Zephyr: Direct distillation of lm alignment, 2023.

[29] Wang et al. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.

[30] Wu et al. Reft: Representation finetuning for language models. arXiv preprint arXiv:2404.03592, 2024.

[31] Xie et al. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36:34201–34227, 2023.

[32] Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.

[33] Zhou et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.

[34] Introducing Meta Llama 3: The most capable openly available LLM to date — ai.meta.com. https://ai.meta.com/blog/meta-llama-3/, 2024.

[35] Adila et al. Zero-shot robustification of zero-shot models with foundation models. arXiv preprint arXiv:2309.04344, 2023.

[36] Fränken et al. Self-supervised alignment with mutual information: Learning to follow principles without preference labels. arXiv preprint arXiv:2404.14313, 2024.

[37] Han et al. Lm-switch: Lightweight language model conditioning in word embedding space. arXiv preprint arXiv:2305.12798, 2023.

[38] Guo et al. Human-instruction-free llm self-alignment with limited samples. arXiv preprint arXiv:2401.06785, 2024.

[39] Kenton et al. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021.

[40] Sun et al. Principle-driven self-alignment of language models from scratch with minimal human supervision. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 2511–2565. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/0764db1151b936aca59249e2c13886101-Paper-Conference.pdf.✅

[41] Zou et al. Representation engineering: A top-down approach to ai transparency, october 2023. URL http://arxiv.org/abs/2310.01405.