A Concept-Based Explainability Framework
for Large Multimodal Models
       

Jayneel Parekh     Pegah Khayatan     Mustafa Shukor     Alasdair Newson     Matthieu Cord

                           
ISIR, Sorbonne Université, France

[Paper]   [Code (to be added soon)

Abstract

Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as "multi-modal concepts". We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually.


Method

LMM multimodal concept extraction & grounding. Given a pretrained LMM for captioning and a target token (for eg. `Dog'), our method extracts internal representations of \(f\) about \(t\), across many images. These representations are collated into a matrix \( \mathbf{Z} \). We linearly decompose \( \mathbf{Z} \) to learn a concept dictionary \( \mathbf{U} \) and its coefficients/activations \( \mathbf{V} \). Each concept \(u_k \in \mathbf{U}\), is multimodally grounded in both visual and textual domains. For text grounding, we compute the set of most probable words \( \mathbf{T}_k \) by decoding \(u_k\) through the unembedding matrix \(W_U\). Visual grounding \( \mathbf{X}_{k, MAS} \) is obtained via \(v_k\) as the set of most activating samples.

Representation extraction

To extract relevant representations from the LMM about \(t\) that encode meaningful semantic information, we first determine a set of images \(\mathbf{X}\) from captioning dataset \(\mathcal{S}=\{(X_i, y_i)\}_{i=1}^{N}\) for extraction. We consider the samples where \(t\) is predicted as part of the predicted caption \(\hat{y}\), and additionally also present in the ground-truth caption:

\(\mathbf{X} = \{ X_i \;|\; t \in f(X_i) \;\text{and}\; t \in y_i \;\text{and}\; (X_i, y_i) \in \mathcal{S}\}.\)

Given any \(X \in \mathbf{X}\), we extract the token representation \(h_{(L)}^p\) from a deep layer \(L\), at the first position in the predicted caption \(p\), such that \(\hat{y}^p = t\). These representations are stacked as columns of the matrix \(\mathbf{Z} = [z_1,...,z_M] \in \mathbb{R}^{B \times M}\)

Decomposing the representation

The representation matrix \(\mathbf{Z}\) is decomposed as product of two low-rank matrices \(\mathbf{Z} \approx \mathbf{U}\mathbf{V}, \mathbf{U} \in \mathbb{R}^{B \times K}, \mathbf{V} \in \mathbb{R}^{K \times M}\). We employ Semi-NMF with sparsity of activations, which allows the matrix \(\mathbf{Z}\) and basis vectors \(\mathbf{U}\) to contain mixed values, and forces the activations \(\mathbf{V}\) to be non-negative.

\( \mathbf{U}^*, \mathbf{V}^* = \arg \min_{\mathbf{U}, \mathbf{V}} \; ||\mathbf{Z} - \mathbf{U}\mathbf{V}||_F^2 + \lambda||\mathbf{V}||_1 \;\;\;\; s.t. \; \mathbf{V} \geq 0, \; \text{and} \;||u_k||_2 \leq 1 \; \forall k \in \{1, ..., K\} \)

Given any image \(X\) where token \(t\) is predicted by \(f\), activations of concept dictionary \(\mathbf{U}^*\) for given \(X\), are denoted as \(v(X) \in \mathbb{R}^K\). To compute them, we project the corresponding token representation for \(X, z_X \in \mathbb{R}^B\) on \(\mathbf{U}^*\). Precisely, \(v(X) = \arg\min_{v \geq 0} ||z_X - \mathbf{U}^*v||_2^2 + \lambda||v||_1\). The activation of \(u_k \in \mathbf{U}^*\) is denoted as \(v_k(X) \in \mathbb{R}\).

Multimodal concept grounding

Given the learnt dictionary \(\mathbf{U}^*\) and corresponding activations \(\mathbf{V}^*\), we ground the understanding of any given concept vector \(u_k, k \in \{1, ..., K\}\) in the visual and textual domains. For visual grounding, we use prototyping to select input images (among the decomposed samples), that maximally activate \(u_k\). Given the number of samples for visualization \(N_{MAS}\), the set of maximum activating samples (MAS) for component \(u_k\), denoted as \(\mathbf{X}_{k, MAS}\) is:

\( \mathbf{X}_{k, MAS} = \arg\max_{\hat{X} \subset \mathbf{X}, |\hat{X}| = N_{MAS}} \sum_{X \in \hat{X}} |v_k(X)|. \)

For grounding in textual domain, we use the unembedding layer to map \(u_k\) to the token vocabulary space \(\mathcal{Y}\), i.e. extract the tokens with highest probability in \(W_U u_k\). The final set of words is referred to as grounded words for concept \(u_k\) and denoted as \(\mathbf{T}_k\). An example of multimodal grounding of a concept extracted for token ``Dog'' in vision (5 most activating samples) and text (top 5 decoded words) is shown below.


Experiments
Our experiments are on the DePALM model that is trained for captioning task on COCO dataset. The model consists of a frozen ViT-L/14 CLIP encoder as the visual encoder. The language model $f_{LM}$ is a frozen OPT-6.7B. Baselines include systems ablating different parts of the pipeline (Rnd-words, Noise-Imgs), a primitive strategy for decomposition (Simple), and variants of our method with different decomposition (PCA, K-Means).
Understanding representations during inference
We evaluate the use of learnt concept dictionary to understand representations of test samples by computing (a) CLIPScore correspondence between test image \(X\) and grounded words for most activating concepts, and (b) BERTScore between ground-truth caption of \(X\) and grounded words for most activating concepts.

We also evaluate the overlap/entanglement between text grounding of concept dictionary for different decomposition:
\( \text{Overlap}(\mathbf{U}^*) = \frac{1}{K} \sum_k \text{Overlap}(u_k), \quad \text{Overlap}(u_k) = \frac{1}{(K-1)} \sum_{l=1, l\neq k}^{K} \frac{|\mathbf{T}_l \cap \mathbf{T}_k|}{|\mathbf{T}_k|} \)


Overall Semi-NMF provides the best balance between learning disentangled dictionaries useful to understand test representations.
Quality of multimodal visualization
Then we estimate the quality of multimodal grounding for visualization by caluclating CLIPScore/BERTScore correspondence between the maximum activating images and grounded words. The results indicate that compared to random words, the grounded words correspond significantly more to the content of visual grounding (maximum activating samples).
Behaviour across layers
We also study the quality of learnt concept dictionaries across layers by evaluating the visual-text correspondence of their multimodal grounding. The correspondence of visual and text grounding of extracted concepts starts appearing when decomposing representations somewhere around the middle to late layers.

Qualitative visualizations
Multimodal concept dictionaries (Dog, Cat)
\(K = 20\) concepts learnt for token 'Dog'.


\(K = 20\) concepts learnt for token 'Cat'.

Understanding test representations)