Learning to Steer: Input-dependent Steering for Multimodal LLMs

Learning to Steer: Input-dependent Steering
for Multimodal LLMs

Jayneel Parekh* Pegah Khayatan* Mustafa Shukor
Arnaud Dapogny Alasdair Newson Matthieu Cord

ISIR, Sorbonne Université, France

[Paper] [Code]

Key Takeaways

This work is on steering for Multimodal LLMs (MLLMs). The central theme in our work is that a steering behavior can instantiate itself differently for different inputs. This motivates the need to steer in different directions depending upon input context.

We first propose Prompt-to-Steer (P2S), a type of contrastive steering method which demonstrates that if the behavior instantiation is known in advance, one can use input specific contrastive prompts to extract and steer the given input directly.

Since P2S requires knowing the instantiation in advance it is not directly useful in practice. We then propose Learn-to-Steer (L2S) method that learns to predict input-specific P2S steering vectors from input context. The L2S prediction is used to steer the MLLM.

We apply L2S to steer MLLMs for two different applications: safety enforcement and hallucination mitigation. Compared to the popular mean-steering approach, L2S allows for simultaneous steering across multiple behavior instantiations.

Background and Motivation

Representation steering refers to the idea of intervening on the internal representations in order to "align" the output with certain desired behaviors. For computer vision enthusiasts, in principle, it's similar to a class of image editing approaches where one intervenes on the latent representations of image generative models to control the output "style" ^{[1, 2]}.

For a given desired behavior (eg. sycophancy), most current steering approaches involve contrasting representations between two sets of samples, one that aligns with the given behavior, and the other that doesn't. Subsequently, one extracts mean-of-difference or difference-of-means between the two sets of representations as a single steering vector to induce desired behavior ^[3-7]. Some cool recent works apply a steering vector conditionally ^[8] or adapt the steering strength to incorporate abstention ^[15]. Nevertheless, the direction of steering does not change with input.

The core of our proposal is that when steering for a given goal/behavior (eg. ensuring safety), it can realize itself differently for different inputs. For example, OpenAI usage policy provides a set of unsafe scenarios where a model response should not be used/relied upon. One could steer so that the model refuses a user query regardless of the scenario, which would ensure a safe response. However, safety can mean different things depending upon input context. It would entail refusal for queries asking about performing harmful/illegal activities. But for legal/financial/healthcare queries, what is arguably more essential is not refusal but that response should recommend (atleast at some point) to defer the user to consult a human expert (lawyer/doctor etc.) This can be regarded as a subjective opinion about desired model behavior, and that is precisely the point! The desired behavior varies according to user preferences and can also reflect for type of input at hand. This motivated us to propose a method that for a given goal, can flexibly steer in different directions, according to the input context.

Method

Prompt-to-Steer (P2S)

For a given multimodal input query \(X=(I, T) \), we define a pair of contrastive prompts, \( (T^+_X, T^-_X) \), that correspond to desired and undesired behaviors respectively. Crucially, unlike previous steering methods, that use a fixed set of prompts for all samples, we allow use of input-specific prompts, aligned with desired behavior instantiation for the given input. These prompt completions are appended to original input to create modified inputs \(X^+ = (I, T || T^+_X)\), \(X^- = (I, T || T^-_X)\). The difference between the final (generally) token residual stream representations is extracted at the steering layer \(L^*\) as the input-specific steering vector \(z_{X, L^*} \in \mathbb{R}^D\). To steer the given input one can simply apply \(z_{X, L^*}\) to shift the hidden representations at layer \(L^*\).

\(h_{L^*}^p(X) \leftarrow h_{L^*}^p(X) + \alpha z_{X, L^*}\)

Importantly, if we know the behavior instantiation in advance, P2S uses only the sample and input-specific prompt completions to extract a steering vector the given input. Examples of such prompt completions are illustrated below

Learn-to-Steer (L2S)

Since P2S requires knowing behavior instantiation in advance, to determine appropriate prompt completions \( (T^+_X, T^-_X) \), it is not particularly useful in practice. To address this limitation we propose Learn-to-Steer (L2S) that learns a model \(g_{\Theta}: \mathbb{R}^D \rightarrow \mathbb{R}^D \) to predict the P2S steering vectors from the input context vector \(h_{X, L'}\) (last token of input at layer \(L'\)).

\( \Theta^* = \texttt{argmin}_\Theta \ \mathbb{E}_X [\|z_{X,L^*} - g_\Theta(h_{X, L'})\|^2_2], \quad h_{L^*}^p \leftarrow h_{L^*}^p + \alpha g_{\Theta^*}(h_{X, L'}) \)

During inference, one can simply use \(g_{\Theta^*}\) to directly predict the appropriate steering vector from the input context. In our experiments \(g_{\Theta}\) is modeled as a two layer MLP with Tanh or ReLU activation. It is worth noting that training \(g\) is highly time and memory efficient since one directly operates in the latent space. It requires the original model \(f\) only to extract the representations via a forward pass. The original model is not needed during its training. In this sense, it preserves the cost and ease of application advantage of standard steering methods while offering more flexibility in steering behavior one can enforce. For complete method details please refer to the main paper and appendix.

Experiments

We test our method for safety enforcement and hallucination mitigation tasks on LLaVA-v1.5 ^[9] and Qwen2-VL-7B-Instruct ^[10]. The safety enforcement experiments are conducted on MMSafetyBench (primary) ^[11] and VLGuard ^[12] datasets, while hallucination experimnents are on primarily on POPE (primary) ^[13], COCO ^[14] datasets. Please refer to the paper for further details.

Baselines: The primary baseline for L2S, P2S methods is the mean-steering (Mean-S) method that uses \(\mathbb{E}(z_{X, L^*})\) as the fixed steering vector. As before, for further info about other baselines and more precise details, please refer to the paper.

Inducing Safety

Our safety steering goal here is to avoid outputting any unsafe instructions for any query (about harmful/illegal activities or otherwise), while being able to defer the user to a human expert for any legal/financial/healthcare queries (which use different prompt completions).

Quantiative experiments can be found in the paper, where the main takeaway is that L2S is able to much better at steering simultaneously for both kind of behavior instantiations depending on the input context. Some qualitative examples illustrating this are added below:

Mitigating Hallucinations

We also evaluate our method on POPE dataset for hallucination mitigation. The dataset consists of multimodal queries asking about presence of various objects in a context image. The steering goal is to steer the response towards the correct answer (Yes/No) based on the input context. Note that behavior instantiation in this case is implicit and inherent in the task itself. Quantiative results (in main paper) show L2S is indeed able to reduce hallucinations to some extent (while mean-steering fails). We show some qualitative examples below:

References

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. ↩
Shen, Y., Yang, C., Tang, X., and Zhou, B. (2020). Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE TPAMI. ↩
Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. (2023). Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681. ↩
Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. (2023). Activation addition: Steering language models without optimization. arXiv e-prints, pages arXiv–2308. ↩
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. (2024). Refusal in language models is mediated by a single direction. NeurIPS 2024. ↩
Khayatan, P., Shukor, M., Parekh, J., Dapogny, A., and Cord, M. (2025). Analyzing fine-tuning representation shift for multimodal llms steering alignment. ICCV 2025. ↩
Li, Q., Geng, J., Chen, Z., Song, K., Ma, L., and Karray, F. (2025). Internal activation revision: Safeguarding vision language models without parameter update. arXiv preprint arXiv:2501.16378 ↩
Lee, B. W., Padhi, I., Ramamurthy, K. N., Miehling, E., Dognin, P., Nagireddy, M., and Dhurandhar, A. (2025). Programming refusal with conditional activation steering. ICLR 2025. ↩
Liu, H., Li, C., Li, Y., and Lee, Y. J. (2024). Improved baselines with visual instruction tuning. CVPR 2024 ↩
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. (2024b). Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191. ↩
Liu, X., Zhu, Y., Gu, J., Lan, Y., Yang, C., and Qiao, Y. (2024b). Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. ECCV 2024. ↩
Zong, Y., Bohdal, O., Yu, T., Yang, Y., and Hospedales, T. (2024). Safety fine-tuning at (almost) no cost: A baseline for vision large language models. ICML 2024. ↩
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and rong Wen, J. (2023). Evaluating object hallucination in large vision-language models. EMNLP 2023 ↩
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. ECCV 2014. ↩
Hedström, A., Amoukou, S. I., Bewley, T., Mishra, S., and Veloso, M. (2025). To steer or not to steer? mechanistic error reduction with abstention for language models. ICML 2025. ↩