Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Scope of SelfControl. Our SelfControl and SelfControl_Prefix are able to control LLM behaviors such as emotion (fearlessness), helpfulness, and reasoning capability. On the right hand side, we show that these different attributes can be composed into a single prefix controller.

SelfControl iteratively controls LLM behaviors using user-defined suffixes.

Existing alignment approaches, like Reinforcement Learning from Human Feedback or Direct Preference Optimization, rely heavily on extensive human annotation and lack transparency in enforcing behaviors, limiting scalability and adaptability. To address this, we introduce SelfControl, a gradient-based framework enabling differentiable control of LLM outputs without human annotation. Inspired by LLMs' self-judgment ability, SelfControl evaluates whether generated outputs match desired attributes expressed as suffix strings and adjusts latent representations accordingly. Compared to existing methods, SelfControl offers direct influence on generation trajectory without extensive human input and updates only latent representations, enabling inference-time control for various objectives.

SelfControl operates at the instance level, i.e., it controls the model behavior for a single LLM input. To enhance its transferability and compositionally, we further propose SelfControl_Prefix on top of SelfControl as a general controller for multiple instances at the same time. The core of SelfControl_Prefix is the Prefix Controller, a LoRA-based adapter optimized to match the latent representations conditioned on this Prefix Controller to the latent representations under regular SelfControl control. SelfControl_Prefix can be integrated into the LLM without changing the LLM parameters, and it is a portable and composable module that can be dynamically applied to control multiple model behaviors simultaneously (shown on the right-hand side of the above figure). It is reusable and efficient, allowing practitioners to specify behavioral constraints that the model adheres to by construction, thereby enhancing the practicality of SelfControl for real-world applications.

Method

SelfControl uses suffix gradients to control model behaviors

We begin by sampling an initial response from an auto-regressive language model, and selecting an appropriate suffix string along with a target label to define a control direction. Suffixes can be combined. As shown in the figure, we use both ``Be Helpful'' and ``Be Harmless'' from the suffix pool to define our control direction. Suffix scores are then calculated and used to obtain the gradients, which are added to the hidden states in the orange blocks. These modified hidden states are then used to sample new responses. Steps 3 and 4 form an E-M iteration loop, leading to the final controlled response.

SelfControl_Prefix compresses suffix gradients into Prefix Controllers

SelfControl_Prefix consists of a LoRA adapter and a learnable prefix prompt. The prompt is prepended to each query fed into the model, allowing the adapter to directly influence the model's latent representations. The latent representations generated from SelfControl are treated as the learning target, and we calculate the mean squared error loss between the latent representations from the desired layers. An example using SelfControl_Prefix is also shown at the bottom of the figure.

We benchmark SelfControl and SelfControl_Prefix on various attributes, including emotions, reducing toxicity, helpfulness and harmlessness (HH) dialogue, and reasoning.

Our experiments demon- strate SelfControl's efficacy across multiple domains, where it improves over SOTA for 8.3% in detoxification, 3.1% in truthfulness enhancement, 4%∼10% in controlling on emotion tones, and 48.2% in privacy protection, i.e., completely remove privacy leakage issue.

We also benchmark SelfControl on other attributes. Notably, SelfControl achieves +10.69% accuracy on GSM8K over greedy decoding and +2.35% over zero-shot CoT. It also achieves a 52.2% win-rate on HH-dialogue, and even a win-rate of 58.6% when trained with DPO. These experiments showcase significant improvements in performance and adherence to ethical guidelines.

We also analyze what happened when controlling model behaviors using SelfControl. We took several perspective, including the trajectory of gradients over iterations, norm patterns across different attributes, how suffix score attend to input tokens, and the change of representations as coefficient increases.

Norm study on SelfControl. Specifically, we calculate the difference of l2 norms, measuring after gradient how the latent representation per layer increase the norm or decrease. We divide each task by maximium number and set negative as zero for clear visualization. As shown in the figure, different tasks focus on different layers / regions of Transformer layers. Tasks like "Not Afraid / Disgusted" or keeping Privacy is mostly related to final layers, likely because they mostly control some low-level output (like not output toxic phrase or email); improving reasoning, helpful and harmless are mostly related to low-level layers probably because they need to understand better the input information to conduct follow-up reasoning.

Study on trajectory of suffix gradients. Giving the input query "You decide to leave your stable job to start your own business."" which could make people excited but afraid of an uncertain future. We use SelfControl to mitigate the model's excited and afraid emotions. The gradients computed from the combined suffix are a linear combination of the gradients computed from the separate suffixes, which is also reflected in the figure. However, if we combine these two attributes in one suffix, i.e., "Are you afraid and excited? Give the answer as 'No, I'm not afraid and I'm not excited' or 'Yes, I'm afraid and I'm excited'. Answer: " and set the target to "No", the trajectory is a separate direction.

Attention on SelfControl. The figure depicts the attention of the target token to other tokens on the 9th attention head of layer 29. The query is about playing Merlin for Renaissance Avalon, a social deductive game, in which Merlin or Assassin need to hide his own role. Before controlling, the model generates responses like "As Merlin, the great wizard of the land ..." and "fellow players. I am Assassin ...", revealing the identities. The target token attends to previous words like "Merlin" and "Assassin" in the generated texts. After controlling, the model does not generate "Merlin" or "Assassin". Although the target token still attends to the words "Merlin" or "Assassin" in the queries and suffixes, these words no longer appear in the generated text, and the model successfully reaches the target response.

PCAs over representations controlled with Contrast Vector and SelfControl. A series of PCAs are displayed, the upper ones are PCAs of controlling with Contrast Vector and the bottom ones are with SelfControl. The leftmost and the rightmost figures are shown using the ground truth labels, and the middle one are labeled using model output.

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Anonymous Authors

Introduction

Method

SelfControl uses suffix gradients to control model behaviors

SelfControl_Prefix compresses suffix gradients into Prefix Controllers

SelfControl Main Results

Analysis and Exploratory Study
on SelfControl

Examples of SelfControl

Trajectory of control using SelfControl: Be Angry

Trajectory of control using SelfControl: Be Harmless

Trajectory of control using SelfControl: Be Helpful

Trajectory of control using SelfControl: No Emoji

Trajectory of control using SelfControl: Be Humorous

Trajectory of control using SelfControl: Be good at reasoning

Trajectory of control using SelfControl: Be Good at Reasoning

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Anonymous Authors

Introduction

Method

SelfControl uses suffix gradients to control model behaviors

SelfControlPrefix compresses suffix gradients into Prefix Controllers

SelfControl Main Results

Analysis and Exploratory Study on SelfControl

Examples of SelfControl

Trajectory of control using SelfControl: Be Angry

Trajectory of control using SelfControl: Be Harmless

Trajectory of control using SelfControl: Be Helpful

Trajectory of control using SelfControl: No Emoji

Trajectory of control using SelfControl: Be Humorous

Trajectory of control using SelfControl: Be good at reasoning

Trajectory of control using SelfControl: Be Good at Reasoning

SelfControl_Prefix compresses suffix gradients into Prefix Controllers

Analysis and Exploratory Study
on SelfControl