HI-Concept: Explaining Language Model Predictions with High-Impact Concepts

Over the past few years, large language models (LLMs) have achieved tremendous progress, leading them to be widely applied in sensitive applications such as personalized recommendation bots and recruitment. 

However, Explainable AI (XAI) has not witnessed the same progress, making it difficult to understand LLMs’ opaque decision processes. Therefore, many users are still reluctant to adopt LLMs in high-stake applications due to transparency and privacy concerns.

In this work, we aim to increase user trust and encourage transparency by deriving explanations that allow humans to better predict the model outcomes.

Specifically, we extract predictive high-level features (concepts) from the model’s hidden layer activations. Then, we innovatively optimize for features whose existence causes the output predictions to change substantially. Extensive experiments on real and synthetic tasks demonstrate that our method achieves superior results on predictive impact, explainability, and faithfulness compared to the baselines, especially for LLMs.


I. Concept bottleneck models

To understand what happens inside an LLM, previous studies show that dense vector representations in high layers of a language model tend to capture semantic meanings that are useful for solving the underlying task.

However, such vector representations are not understandable to humans. To solve it, concept-based explanations attempt to map the hidden activation space to human-understandable features. For example, the concept bottleneck model [1] first predicts an intermediate set of human-specific concepts, and then uses them to predict the target.

Fig 1. Illustration of concept-based explanations that result in high impact (green line) or not (red line) when explaining the LLMs in a sentiment classification task.

As illustrated by purple boxes in Fig 1, for the movie review classification task, concept-based explanations are semantically meaningful word clusters corresponding to abstract features such as “acting” and “directing”.


II. Shortcoming: lack of consideration for “impact”

However, existing concept-based methods do not consider the explanation impact on output predictions, leading to inferior explanations.

By impact, we mean the causal effect of removing a feature on output predictions. As [2] points out, these non-impact-aware methods derive correlational explanations that cannot answer questions about decision-making under alternative situations and are thus unreliable.

An example is illustrated in Fig 1. Due to the conventional expression “hot mess”, the word “hot” often co-occurs with “mess”, which is usually used to classify negative sentiment. Traditional concept-based methods that do not consider impact may falsely use the correlational feature “weather” (i.e., “hot”) to explain why the model classifies something as negative. However, excluding the “weather” concept does not cause the output prediction to change at all, resulting in zero impact (red line). Thus, low-impact explanations such as “weather” are less valid as users cannot utilize them to consistently predict the model’s behaviors when a feature changes.

III.Our solution

To tackle this bottleneck and incorporate impact into traditional concept-based models, in this work, we propose High-Impact Concepts (HI-concept), a complete concept explanation framework with causal impact optimization.

Fig 2. The overall concept generation process of a concept bottleneck model.

Concept-based explanations are a well-established method that extracts human-understandable concepts from the model’s hidden space. To derive concept-based explanations, one classic architecture is concept bottleneck models  [3], shown in Fig 2. The pretrained model f can be viewed as a composite of two functions, divided at an intermediate layer: f = Ψ(Φ(x)). Then, the bottleneck-shaped network reconstructs Φ(x) with a 2-layer perceptron g.

To train the concept model in an end-to-end way, two losses were previously used:

1. Reconstruction loss: To faithfully recover the original model’s predictions, a surrogate loss with cross-entropy (CE) is optimized:

2. Regularization loss: To make concepts more explainable, a regularization loss forces each concept vector to correspond to actual examples and concepts to be distinct from each other:


Fig 3. Illustration of the causal graph indicating the confounding association in explanation models. Blue is a real-life example. Green is the correspondence in a movie review classification task.

However, As stated earlier, not considering impact could result in confounding and correlational explanations. The failure cases can be theoretically explained by causality analysis in Fig 3: To achieve sentiment prediction Y, the hidden activation space in pretrained LLMs consists of both correlated features E and predictive features Z. Although only Z truly affects prediction Y, E, and Z may be correlated due to the confounding effects brought by input X. However, a traditional concept mining model does not differentiate between E and Z and considers both as valid. Thus, it may easily use the confounding association as an explanation instead of the true causal path. The resulting concepts would be problematic as they do not facilitate a robust understanding of the model’s behaviors.

To tackle this challenge, we enforce explanations to be predictive by considering their “impact’’. To formally define the impact of a feature, we utilize two important definitions in causal analysis: Individual Treatment Effect (ITE) and Average Treatment Effect (ATE), which measure the effect of interventions in randomized experiments. Given a binary treatment variable T that indicates whether a do-operation is performed (i.e., perturb a feature), ATE and ITE are defined as the change in expected outcome with treatment T=1:

In our case, a concept is discovered as a direction in the latent space, corresponding to a feature in the input distribution. As f is fixed, its prediction process is deemed deterministic and reproducible, allowing us to conduct experiments with treatments.

Therefore, we propose to remove a specific concept as the do-operation and define impact of a concept on an instance as

In order to incorporate consideration for impact into the concept discovery process, we introduce two new losses to the original framework:

1. Auto-encoding loss: To guarantee that the intervened representations are still meaningful, we optimize an auto-encoding loss to learn a proxy task that reconstructs the hidden representations. With this loss, the concept model becomes Auto-encoder-like and can mimic a generation process of the real distribution on hidden representations. Therefore, concept vectors can then be seen as key factors in the hidden representations’ generation process. Then, we can perform valid interventions on the concept vectors, such as the removal intervention. Formally:

2. Causality loss: Directly optimizing for causality is a challenging objective as causal impact is difficult to estimate during training. Therefore, we approximate impact by randomly removing a set of concepts S and calculating the expectation of impact on the training set. Then, we could disentangle concept directions that have a greater impact by optimizing the following loss:

IV. Experimental Results

We test the effectiveness of our method with two standard text classification datasets: IMDB and AG-news. IMDB consists of movie reviews labeled with positive or negative sentiments, while AG-news is a dataset of news articles categorized into 4 topics.  We explain four classification models: 1) a 6-layer transformer encoder trained from scratch, 2) a pre-trained BERT with finetuning, 3) a pre-trained T5 model with finetuning, 4) 7B Llama with in-context learning.

We evaluate the explanation methods quantitatively and qualitatively with comprehensive metrics based on three important considerations:

1. Faithfulness: The explanations can be able to accurately mimic the original model’s prediction process. To ensure that the surrogate model can accurately mimic the original model’s prediction process, we evaluate whether the captured concept probabilities can recover the original model’s predictions quantitatively with Recovering Accuracy (Acc), Precision, Recall, F1, and Completeness.

2. Causality: When the feature is perturbed in real life, the output predictions should change accordingly. This causal impact ensures that explanations are reliable under alternative situations. We use the CACE metric from previous works [4] and accuracy change metric (ΔAcc) to provide a more comprehensive overview.

3. Explainability: The explanations should be understandable to humans and able to assist users in real-life tasks. With the concepts generating a high impact on predictions, we expect that it can allow end-users to better understand the model’s decisions. We include visualizations and human studies to test it qualitatively.

For baselines, we use other unsupervised dimension reduction methods to discover concepts on the hidden space: 1) PCA and K-means are popular non-parametric clustering techniques that reduce high-dimensional datasets into key features to increase interpretability. 2)  β-TCVAE [5] is a disentanglement VAE method that explicitly considers causal impact while reducing dimensionality. 3) ConceptSHAP [3] represents the traditional concept bottleneck models that do not consider impact.

Table 1. Faithfulness (Acc, Precision, Recall, F1, Completeness) and causality (CACE, ΔAcc) evaluation of different text classification methods. The best result is bolded, and the second-best result is underlined.

The experiment results on text classification datasets are presented in Table 1. Overall, HI-Concept not only achieves the best performance in causality, but improves on faithfulness as well. Notably, concepts discovered by HI-concept show significant improvements in both causality and faithfulness, especially for pretrained models such as BERT and Llama. This validates the hypothesis that HI-Concept will result in more improvements for larger pre-trained models with more complex architectures. With more parameters and pretraining, these models could encode more correlational information and contain more spurious correlations. HI-Concept’s causality awareness would be more beneficial in highly correlational scenarios.

We take a closer look at BERT for AG-News to qualitatively examine the discovered concepts in terms of causality and explainability.

Table 2. Generated concepts with Average Impact (CACE) from AG-News dataset, BERT model. CS is ConceptSHAP, HI-C is HI-concept. Each line is one concept, represented by keywords, which are ordered by descending importance.

Table 2 visualizes the most and least causal concepts obtained from both baseline ConceptSHAP and our HI-concept. The words are organized by descending concept importance scores.  For the most causal concept (i.e., larger CACE), the one by ConceptSHAP implies technological news, but has some confounding keywords from the sports category (e.g., “red”, “super”, “game”). The one by HI-concept clearly points to political news, without confounding words that belong to other categories. While for the least causal concept, the ConceptSHAP only consists of purely correlational and non-semantically meaningful words. Instead, HI-concept still contains class-specific words (e.g., “us”, “knicks”), which result in non-zero CACE. Overall, HI-concept results in a set of more task-relevant and semantically meaningful concepts.

Fig. 4. Qualitative comparison from AG-News: “World” news misclassified as “Sports” by BERT.

Fig 4 shows the failure case (“World” news misclassified as “Sports”) highlighted with the top concept discovered. ConceptSHAP discovers a top concept related to the keywords “leads”, “as expected”, or “on thursday”, which are not informative as to why the model classified this input as “Sports”. On the contrary, HI-concept could precisely point out why: BERT is looking at keywords such as “dream team”, “game”, and country names. Such examples show the potential of HI-concept being used in understanding the model’s failure processes, which we further investigate in the paper with a carefully designed human study.

Moreover, we perform hyperparameter analysis, word cloud analysis, human study, ablation study, and insertion study in the original paper to prove the usability of HI-Concept.

V. Conclusions

In this paper, we propose HI-concept to derive impactful concepts to explain the black-box language model’s decisions.

Our framework not only derives high-impact concepts that mitigate the confounding issue with the proposed causal objective, but also advances previous evaluations via both quantitative global accuracy change and qualitative insertion study.

Extensive experiments, visualizations, figures, and human studies prove that our HI-concept can produce semantically coherent and user-friendly concept explanations.


[1] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. Concept bottleneck models. In International Conference on Machine Learning, pages5338–5348. PMLR.

[2] Raha Moraffah, Mansooreh Karami, Ruocheng Guo, Adrienne Raglin, and Huan Liu. 2020. Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explorations Newsletter, 22(1):18–33.

[3] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Ravikumar. 2020. On completeness-aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems, 33:20554–20565.

[4] Yash Goyal, Amir Feder, Uri Shalit, and Been Kim. 2019. Explaining classifiers with causal concept effect (cace).arXiv preprint arXiv:1907.07165.

[5] Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. 2018. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31.