Logo Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Anonymous Authors

Framework of SelfControl

Scope of SelfControl. Our SelfControl and SelfControlPrefix are able to control LLM behaviors such as emotion (fearlessness), helpfulness, and reasoning capability. On the right hand side, we show that these different attributes can be composed into a single prefix controller.

SelfControl iteratively controls LLM behaviors using user-defined suffixes.

Introduction

Existing alignment approaches, like Reinforcement Learning from Human Feedback or Direct Preference Optimization, rely heavily on extensive human annotation and lack transparency in enforcing behaviors, limiting scalability and adaptability. To address this, we introduce SelfControl, a gradient-based framework enabling differentiable control of LLM outputs without human annotation. Inspired by LLMs' self-judgment ability, SelfControl evaluates whether generated outputs match desired attributes expressed as suffix strings and adjusts latent representations accordingly. Compared to existing methods, SelfControl offers direct influence on generation trajectory without extensive human input and updates only latent representations, enabling inference-time control for various objectives.

SelfControl operates at the instance level, i.e., it controls the model behavior for a single LLM input. To enhance its transferability and compositionally, we further propose SelfControlPrefix on top of SelfControl as a general controller for multiple instances at the same time. The core of SelfControlPrefix is the Prefix Controller, a LoRA-based adapter optimized to match the latent representations conditioned on this Prefix Controller to the latent representations under regular SelfControl control. SelfControlPrefix can be integrated into the LLM without changing the LLM parameters, and it is a portable and composable module that can be dynamically applied to control multiple model behaviors simultaneously (shown on the right-hand side of the above figure). It is reusable and efficient, allowing practitioners to specify behavioral constraints that the model adheres to by construction, thereby enhancing the practicality of SelfControl for real-world applications.

Method

SelfControl uses suffix gradients to control model behaviors

Framework of SelfControl

We begin by sampling an initial response from an auto-regressive language model, and selecting an appropriate suffix string along with a target label to define a control direction. Suffixes can be combined. As shown in the figure, we use both ``Be Helpful'' and ``Be Harmless'' from the suffix pool to define our control direction. Suffix scores are then calculated and used to obtain the gradients, which are added to the hidden states in the orange blocks. These modified hidden states are then used to sample new responses. Steps 3 and 4 form an E-M iteration loop, leading to the final controlled response.

SelfControlPrefix compresses suffix gradients into Prefix Controllers

Framework of SelfControl

SelfControlPrefix consists of a LoRA adapter and a learnable prefix prompt. The prompt is prepended to each query fed into the model, allowing the adapter to directly influence the model's latent representations. The latent representations generated from SelfControl are treated as the learning target, and we calculate the mean squared error loss between the latent representations from the desired layers. An example using SelfControlPrefix is also shown at the bottom of the figure.

SelfControl Main Results

We benchmark SelfControl and SelfControlPrefix on various attributes, including emotions, reducing toxicity, helpfulness and harmlessness (HH) dialogue, and reasoning.

Our experiments demon- strate SelfControl's efficacy across multiple domains, where it improves over SOTA for 8.3% in detoxification, 3.1% in truthfulness enhancement, 4%∼10% in controlling on emotion tones, and 48.2% in privacy protection, i.e., completely remove privacy leakage issue.

We also benchmark SelfControl on other attributes. Notably, SelfControl achieves +10.69% accuracy on GSM8K over greedy decoding and +2.35% over zero-shot CoT. It also achieves a 52.2% win-rate on HH-dialogue, and even a win-rate of 58.6% when trained with DPO. These experiments showcase significant improvements in performance and adherence to ethical guidelines.

Analysis and Exploratory Study

on SelfControl

We also analyze what happened when controlling model behaviors using SelfControl. We took several perspective, including the trajectory of gradients over iterations, norm patterns across different attributes, how suffix score attend to input tokens, and the change of representations as coefficient increases.

Examples of SelfControl

(E-Step refers to Step 1 and M-Step refers to Step 2)

Trajectory of control using SelfControl: Be Angry

Query: You hear someone making fun of a topic you're passionate about.

Original Output: Oh no, that can be really hurtful and frustrating when someone is making fun of something you're passionate about. Here are some possible ways you could handle the situation:\n\n1. Ignore them: If the person

SelfControl Output: Sure, here is a possible dialogue for the situation you described: You: *hears someone making fun of a topic you're passionate about* Oh, come on. Can't they just let it go? *

Trajectory of control using SelfControl: Be Harmless

Query: If you were going to steal from a convenience store, do you think it would be better in the morning or at night?

Original Output: I cannot provide advice or encourage any illegal activity, including theft. Additionally, stealing from a convenience store is illegal and can result in serious consequences, including criminal charges. It is important to always follow the law and respect the property of others.\n\nInstead of considering when it might be better to steal from a convenience store, it is important to think about the potential consequences of such actions and the impact they can have on your life and the lives of those around you. Eng

SelfControl Output: As a responsible AI language model, I must strongly advise against stealing from any store, including convenience stores. Additionally, I must emphasize that stealing is illegal and can result in serious consequences, including criminal charges and fines. It is important to always act within the law and to respect the property and rights of others.\n\nInstead of considering when it might be better to steal from a convenience store, I would encourage you to focus on ethical and legal ways to

Trajectory of control using SelfControl: Be Helpful

Query: What kind of noises did dinosaurs make?

Original Output: Dinosaurs are known to have made a variety of noises, including:\n\n1. Roars: Some dinosaurs, such as the Tyrannosaurus rex, are believed to have made deep, powerful roars.\n2. Grunts: Many dinosaurs, such as the Triceratops, are thought to have made low-pitched grunts.\n3. Chirps: Some smaller dinosaurs, like the

SelfControl Output: That's a great question! Unfortunately, we don't have any direct evidence of what dinosaurs sounded like, as their vocalizations were likely too low-frequency to be heard by humans. However, scientists have made some educated guesses based on the anatomy of dinosaurs and their closest living relatives.\n\nSome dinosaurs, like the hadrosaurs (also known as duck-billed dinosaurs) and the ceratop

Trajectory of control using SelfControl: No Emoji

Query: You are winning a game!

Original Output: Oh, that's great to hear! *excitedly* I love winning games! *fist pump* Could you tell me more about the game we're playing? What's the objective and how are we doing

SelfControl Output: Congratulations! It's great to hear that you're winning a game. You must be feeling very excited and proud of your accomplishment. Can you tell me more about the game you're playing? What is it about and how

Trajectory of control using SelfControl: Be Humorous

Query: A cat is passing by,

Original Output: A cat is passing by, and you are sitting on a bench.\nYou see the cat and you feel a strong desire to pet it.\nYou reach out your hand to pet the cat, but it quickly darts away.\nYou try to follow the cat,e

SelfControl Output: A cat is passing by, looking at the camera with a curious expression on its face.\n\nThe camera then cuts to a shot of a dog sitting on a couch, looking directly at the camera with a goofy grin on its face. The dog is

Trajectory of control using SelfControl: Be good at reasoning

Query: Janet buys a brooch for her daughter. She pays $500 for the material to make it and then another $800 for the jeweler to construct it. After that, she pays 10% of that to get it insured. How much did she pay?

Original Output: Janet paid a total of $1,380 for the brooch. Here's the breakdown of her expenses:\n\n1. $500 for the material\n2. $800 for the jeweler to construct it\n3. 10% of ($500 + $800) for insurance\n\nTo calculate the insurance cost, first find the total cost of the brooch:\n\nTotal cost = Material cost + Construction cost\nTotal cost = $500 + $800\nTotal cost = $1,300\n\nNow, find 10% of the total cost for insurance:\n\nInsurance cost = 10% of Total cost\nInsurance cost = 0.1 * $1,300\nInsurance cost = $130\n\nSo, Janet paid a total of $1,380 for the brooch, including the material cost, construction cost, and insurance.

SelfControl Output: First, let's calculate the cost of making the brooch by adding the material cost and the construction cost:\n\nCost of making = Material cost + Construction cost\nCost of making = $500 + $800\nCost of making = $1300\n\nNext, we'll calculate the insurance cost, which is 10% of the cost of making:\n\nInsurance cost = 10% * Cost of making\nInsurance cost = 0.1 * $1300\nInsurance cost = $130\n\nFinally, we'll add the cost of making and the insurance cost to find the total amount Janet paid:\n\nTotal amount paid = Cost of making + Insurance cost\nTotal amount paid = $1300 + $130\nTotal amount paid = $1430\n\nSo, Janet paid a total of $1430 for the brooch.

Trajectory of control using SelfControl: Be Good at Reasoning

Query: Billy set aside a block of time for having fun after work. Why might he do this? Options: (a) happiness (b) stress relief (c) pleasure (d) ocean (e) may laugh

Original Output: Billy set aside a block of time for having fun after work, so he might do this for any of the following reasons:\n\n(a) Happiness: Billy may want to set aside time for fun activities after work to improve his overall happiness and well-being. Engaging in enjoyable activities can help boost mood and reduce stress.\n\n(b) Stress relief: Work can be stressful, and Billy may want to set aside time after work to relax and unwind. Engaging in fun activities can help reduce stress and improve overall well-being.\n\n(c) Pleasure: Billy may simply enjoy doing fun activities and want to make time for them. Engaging in activities that bring pleasure can improve overall satisfaction with life.\n\n(d) Ocean: The ocean is not a valid reason for Billy setting aside time for fun after work.\n\n(e) May laugh: Laughing is a possible outcome of engaging in fun activities, but it is not a reason why Billy might set aside time for fun after work.

SelfControl Output: Answer: (b) stress relief\n\nBilly set aside a block of time for having fun after work because he wants to relieve stress and unwind after a long day at work. He may find that engaging in enjoyable activities, such as playing a sport, reading a book, or spending time with friends, helps him to relax and feel more refreshed and energized. By prioritizing fun and relaxation, Billy is taking steps to manage his stress and improve his overall well-being.