近年来,基础模型在各个领域都取得了令人瞩目的成就,然而,其黑箱特性也为调试、监控、控制和信任这些模型带来了巨大挑战。概念解释作为一种新兴方法,试图利用诸如物体属性(例如条纹)或语言情感(例如快乐)等单个概念来解释模型的行为。通过将模型学习到的表示分解为多个概念向量,可以推导出这些概念。例如,模型对一张狗的图像的嵌入可以分解为代表其毛发、鼻子和尾巴的概念向量的总和。
现有方法的不足
现有的基于 PCA 或 KMeans 等方法的工作能够很好地提取基本概念的向量表示。例如,图 1 展示了从 CLIP 模型中提取的 CUB 数据集中的图像,这些图像包含了 PCA 学习到的概念。这些技术能够正确地提取诸如“白色鸟类”和“小型鸟类”等概念的表示,然而,将它们的表示相加并不能得到“小型白色鸟类”这一概念的表示。
概念组合性的重要性
概念的组合性对于以下几个用例至关重要:
- 模型预测解释: 通过组合概念来解释模型预测。
- 模型行为编辑: 组合性概念允许编辑细粒度的模型行为,例如在不影响其他行为的情况下提高大型语言模型的真实性。
- 新任务训练: 可以训练模型组合基本概念来完成新任务,例如使用喙的形状、翅膀的颜色和环境等概念对鸟类进行分类。
概念组合性的评估
为了评估概念组合性,我们首先在受控环境下验证了概念的真实表示的组合性。我们观察到,概念可以被分组为属性,其中每个属性都包含关于某些共同属性的概念,例如物体的颜色或形状。来自不同属性的概念(例如蓝色和立方体)可以组合,而来自同一属性的概念(例如红色和绿色)则不能。我们还观察到,来自不同属性的概念大致正交,而来自同一属性的概念则不然。
概念组合性提取 (CCE)
为了提取组合性概念,我们提出了 CCE 方法。该方法的关键思想是一次性搜索整个概念子空间,而不是单个概念,从而允许 CCE 强制执行上述组合性概念的属性。CCE 算法主要包含以下步骤:
- 学习子空间 (LearnSubspace): 优化一个子空间,使得该子空间中的数据能够根据固定的聚类中心进行良好的聚类。
- 学习概念 (LearnConcepts): 在学习到的子空间中执行球形 K 均值聚类,以识别概念。
- 迭代优化: 交替执行学习子空间和学习概念步骤,直到收敛。
实验结果
我们在视觉和语言数据集上进行了广泛的实验,结果表明:
- 在受控环境下,CCE 比现有方法更能有效地组合概念。
- 在真实数据环境下,CCE 能够成功地发现新的、有意义的组合性概念。
- CCE 提取的组合性概念可以提高下游任务的性能。
结论
本文从组合性的角度研究了基础模型的概念解释。我们验证了从这些模型中提取的真实概念是组合性的,而现有的无监督概念提取方法通常不能保证组合性。为了解决这个问题,我们首先确定了组合性概念表示的两个显著属性,并设计了一种新的概念提取方法 CCE,该方法在设计上尊重这些属性。通过对视觉和语言数据集进行的大量实验,我们证明了 CCE 不仅可以学习组合性概念,还可以提高下游任务的性能。
参考文献
- Andreas, J. (2019). Measuring compositionality in representation learning.
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2001). Latent Dirichlet allocation.
- Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings.
- Bricken, A., Liang, P., & Gilpin, L. H. (2023). Dictionary learning with transformers for interpretable image classification.
- Chen, T. Q., Li, X., Grosse, R. B., & Duvenaud, D. K. (2020). Isolating sources of disentanglement in variational autoencoders.
- Espinosa Zarlenga, J. M., Cogswell, M., Goh, H., & Romero, A. (2022). Improving robustness and calibration in medical imaging with semantic-aware contrastive learning.
- Fel, T., Bau, D., & Regev, I. (2023). Craft: Concept-driven representation learning by adaptive feature transformation.
- Frankland, S. J., & Greene, M. R. (2020). Generative models of visual imagination.
- Ghorbani, A., Abid, A., Zhu, J., Liu, C., Huang, X., & Schuetz, A. (2019). Towards automatic concept-based explanations.
- Havaldar, P., Stein, A., & Naik, M. (2023). Explaining by aligning: Interpreting decisions by aligning model behavior across groups.
- Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., … & Lerchner, A. (2016). Beta-vae: Learning basic visual concepts with a constrained variational framework.
- Hill, F., Cho, K., & Korhonen, A. (2018). Learning distributed representations of sentences from unlabelled data.
- Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning.
- Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., & Sayres, R. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav).
- Koh, P. W., Nguyen, T., Tang, Y. S., Wang, S., Pierson, E., Kim, B., … & Liang, P. (2020). Concept bottleneck models.
- Kwon, Y., Kim, S., & Yoon, J. (2023). Editing compositional transformations in latent space.
- Lake, B. M. (2014). Concept learning in humans and machines: Fundamental issues and a possible solution.
- Lewis, H., Purdy, W., & Steinhardt, J. (2022). Transformers learn in-context compositionality.
- Lovering, J., & Pavlick, E. (2022). Relational probes: A technique for analyzing the compositional capabilities of language models.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient estimation of word representations in vector space.
- Mitchell, T. (1999). Machine learning and data mining.
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning transferable visual models from natural language supervision.
- Rigotti, M., Spooner, T., Dodge, J., Gould, S., & Gordon, G. J. (2022). Learning to explain by concept discovery.
- Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.
- Santurkar, S., Friedman, C., Wallace, E., Kulkarni, A., Zeng, A., & Thorat, A. (2021). Editing memories: Towards controllable and safe generation with transformers.
- Schaeffer, R., Wang, S., Huang, D., Dhingra, B., Sanchez-Lengeling, B., Blunsom, P., & Cohen, W. W. (2024). Beyond language: Towards a unified foundation model evaluation benchmark.
- Srivastava, A., Mittal, A., Thiebaut, J., Yen-Chun Chen, T., Manjunath, A., Jain, A., … & Salakhutdinov, R. (2023). Beyond the imitation game: Quantifying and extrapolating capabilities of language models.
- Stein, A., Havaldar, P., & Naik, M. (2023). Towards group-fair concept-based explanations.
- Tamkin, A., Brundage, M., Clark, J., & Amodei, D. (2023). Understanding the capabilities, limitations, and societal impact of large language models.
- Todd, S., Srivastava, S., Htut, P. M., & Chang, M. W. (2023). Discovering latent knowledge in language models without supervision.
- Todd, S., Srivastava, S., Htut, P. M., & Chang, M. W. (2024). Probing for robust, factual knowledge in language models.
- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., … & Lample, G. (2023). Llama 2: Open foundation and fine-tuned chat models.
- Trager, M., Elhage, A., Bharadwaj, S., Kenton, Z., Bhatia, N. S., Cobb, A., … & Amodei, D. (2023). Linear algebra audits: Explaining and controlling language model behavior.
- Tschandl, P., Rosendahl, C., & Kittler, H. (2018). The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.
- Turpin, M., Khandelwal, U., Roberts, A., Lee, J., Raffel, C., Shazeer, N., & Irwin, J. (2024). Foundation models for decision-making: Problems, opportunities, and risks.
- Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 dataset.
- Wang, C., Locatello, F., Schmidhuber, J., & Lapedriza, A. (2022). Pivron: Privacy-preserving representation learning with variational inference.
- Wang, Z., Xie, Y., Wang, X., Zhou, W., Yu, S., Sun, S., … & Zhou, J. (2023). Compositional visual reasoning with large language models.
- Wong, E., Schmidt, L., Torralba, A., & Jegelka, S. (2021). Discovering concepts for realistic counterfactual explanations.
- Wu, Y., Wang, X., Tan, C., Huang, D., Wei, F., Zhou, M., … & Zhou, J. (2023). Reasoning with heterogeneous knowledge: Uncovering emergent reasoning abilities of large language models in e-commerce.
- Xu, Y., Zhao, S., Song, J., Zhao, H., Eskenazi, M., LeCun, Y., & Romero, A. (2022). How well do vision transformers learn inductive biases? a case study in object recognition.
- Yang, J., Wang, S., Zhou, D., Liu, M., Chang, P.-Y., & Zhao, W. X. (2023). Conceptgraph: Mining concept knowledge graph from pretrained language models for interpretable logical reasoning.
- Yeh, C.-K., Hsieh, C.-Y., Suggala, A., Ravikumar, P., & Kumar, P. (2020). On textembeddings for numerical features.
- Yuksekgonul, M., Gokmen, S., Elhoseiny, M., & Cicek, O. (2023). Zero-shot concept recognition for robot manipulation with large language models.
- Yun, S., Oh, S. J., Bastani, O., & Lee, K. (2021). Transformers provide surprisingly effective representations for online and offline dictionary learning.
- Zhai, C. (1997). Exploiting context to identify ambiguous terms.
- Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2018). Learning deep features for discriminative localization.
- Zou, Y., Wang, L., Hu, Z., Li, Z., Wang, W., Tu, Z., … & Sun, M. (2023a). Controllable generation from pre-trained language models via inverse prompting.
- Zou, Y., Wang, L., Hu, Z., Li, Z., Wang, W., Tu, Z., … & Sun, M. (2023b). Controllable generation from pre-trained language models via inverse prompting.