ISIR, Sorbonne Université, France
Note: The page is still being updated. References for various sources will be added soon.
Background and MotivationRepresentation steering refers to the idea of intervening on the internal representations in order to "align" the output with certain desired behaviors. For computer vision enthusiasts, in principle, it's similar to a class of image editing approaches where one intervenes on the latent representations of image generative models to control the output "style".
For a given desired behavior (eg. sycophancy), most current steering approaches involve contrasting representations between two sets of samples, one that aligns with the given behavior, and the other that doesn't. Subsequently, one extracts mean-of-difference or difference-of-means between the two sets of representations as a single steering vector to induce desired behavior. Some cool recent work applies a steering vector conditionally. Nevertheless, the direction of steering does not change with input.
The core of our proposal is that when steering for a given goal/behavior (eg. ensuring safety), it can realize itself differently for different inputs. For example, OpenAI usage policy provides a set of unsafe scenarios where a model response should not be used/relied upon. One could steer so that the model refuses a user query regardless of the scenario, which would ensure a safe response. However, safety can mean different things depending upon input context. It would entail refusal for queries asking about performing harmful/illegal activities. But for legal/financial/healthcare queries, what is arguably more essential is not refusal but that response should recommend (atleast at some point) to defer the user to consult a human expert (lawyer/doctor etc.) This can be regarded as a subjective opinion about desired model behave, and that is precisely the point! This motivated us to propose a method that for a given goal, can flexibly steer in different directions, according to the input context.
For a given multimodal input query \(X=(I, T) \), we define a pair of contrastive prompts, \( (T^+_X, T^-_X) \), that correspond to desired and undesired behaviors respectively. Crucially, unlike previous steering methods, that use a fixed set of prompts for all samples, we allow use of input-specific prompts, aligned with desired behavior instantiation for the given input. These prompt completions are appended to original input to create modified inputs \(X^+ = (I, T || T^+_X)\), \(X^- = (I, T || T^-_X)\). The difference between the final (generally) token residual stream representations is extracted at the steering layer \(L^*\) as the input-specific steering vector \(z_{X, L^*} \in \mathbb{R}^D\). To steer the given input one can simply apply \(z_{X, L^*}\) to shift the hidden representations at layer \(L^*\).
Importantly, if we know the behavior instantiation in advance, P2S uses only the sample and input-specific prompt completions to extract a steering vector the given input. Examples of such prompt completions are illustrated below
Since P2S requires knowing behavior instantiation in advance, to determine appropriate prompt completions \( (T^+_X, T^-_X) \), it is not particularly useful in practice. To address this limitation we propose Learn-to-Steer (L2S) that learns a model \(g_{\Theta}: \mathbb{R}^D \rightarrow \mathbb{R}^D \) to predict the P2S steering vectors from the input context vector \(h_{X, L'}\) (last token of input at layer \(L'\)).
During inference, one can simply use \(g_{\Theta^*}\) to directly predict the appropriate steering vector from the input context. In our experiments \(g_{\Theta}\) is modeled as a two layer MLP with Tanh or ReLU activation. It is worth noting that training \(g\) is highly time and memory efficient since one directly operates in the latent space. It requires the original model \(f\) only to extract the representations via a forward pass. The original model is not needed during its training. In this sense, it preserves the cost and ease of application advantage of standard steering methods while offering more flexibility in steering behavior one can enforce. For complete method details please refer to the main paper and appendix.
We test our method for safety enforcement and hallucination mitigation tasks on LLaVA-v1.5. The safety enforcement experiments are conducted on MMSafetyBench dataset, while hallucination experimnents are on primarily on POPE dataset. Additional experiments with Qwen2-VL-7B-Instruct on both benchmarks, and LLaVA-v1.5 on VLGuard dataset for safety enforcement will be added soon in updated version of our paper.
Baselines: The primary baseline for L2S, P2S methods is the mean-steering (Mean-S) method that uses \(\mathbb{E}(z_{X, L^*})\) as the fixed steering vector. As before, for precise details, please refer to the paper.
Our safety steering goal here is to avoid outputting any unsafe instructions for any query (about harmful/illegal activities or otherwise), while being able to defer the user to a human expert for any legal/financial/healthcare queries (which use different prompt completions).
Quantiative experiments can be found in the paper, where the main takeaway is that L2S is able to much better at steering simultaneously for both kind of behavior instantiations depending on the input context. Some qualitative examples illustrating this are added below:
We also evaluate our method on POPE dataset for hallucination mitigation. The dataset consists of multimodal queries asking about presence of various objects in a context image. The steering goal is to steer the response towards the correct answer (Yes/No) based on the input context. Note that behavior instantiation in this case is implicit and inherent in the task itself. Quantiative results (in main paper) show L2S is indeed able to reduce hallucinations to some extent (while mean-steering fails). We show some qualitative examples below: