Share
Explore

icon picker
Open Problems in Activation Engineering

Activation engineering is a promising area of work. Doing vector arithmetic on intermediate cached values of input prompts is enough to have refined granular control over the output without destroying the quality of the output, all at inference time. Further, it seems we can get even finer steering capabilities depending on where in the model we selectively splice these modified activations into. This gives us a totally new dimension to causal intervention techniques such as patching from mechanistic interpretability. There are so many ripe fruits just waiting to fall from anyone bumping into its tree. Let’s just try everything, throw a million darts and see what sticks?

Check out this tentative concrete open problems-type list, which has some ideas for inspiration on what directions to explore! It’s far from comprehensive though, we’re excited to see what anyone else can come up with. Really this is an exercise in field building. We need to see just how far we can get with this relatively rudimentary approach of adding and subtracting numbers we already had in new ways. Everything goes.
Add reflection
The list
Category
Intro blurb / theme
Number
Search
Category
Intro blurb / theme
Number
Reflection
Sentiment
1
Just try random stuff!
1.0
2
Just try random stuff!
Think of target behavior you think would be cool or fun to observe
1.1
Can you make the model suddenly switch to a different style of speech: e.g. talk like a pirate, shakespeare, celebrity, stereotypes, broken english
3
Just try random stuff!
Destroying inference
1.2
Can you completely destroy some outputs with a specific addition? Can you reliably destroy inference across various inputs with the same bias?
4
Just try random stuff!
Applications to alignment
1.3
Do some additions reliably steer across many different inputs? Can we robustly avoid harmful behavior? How can we use this for alignment?
5
Just try random stuff!
Prompt engineering
1.4
Throw every prompt engineering trick you’ve ever heard of at this, both as a base input and as a bias term. Does anything work the same? Are they useless? Can you render a model more robust to prompt engineering with the right addition?
6
Just try random stuff!
Modifying the coefficient
1.5
What happens with different coefficients on the biases? Why are some large coefficients “safe” but others damage output? Are small coefficients useful?
7
Just try random stuff!
Patterns and trends
1.6
After trying a bunch of different approaches, are there any overarching trends you observe with what seems to work and what doesn’t? Are certain layers systematically better for steering than others? Does it depend on the type of activation addition being applied?
8
Techniques in discovering features and steering vectors
Averaging prompts
2.1
Try creating a single direction by averaging lots of prompts that collectively convey the steering goal. Does this ever work? A weighted average?
9
Techniques in discovering features and steering vectors
Many-shot prompting
2.2
Once we have many example prompts that seem to work well, use many-shot prompting on the examples we have with a larger model to generate more.
10
Techniques in discovering features and steering vectors
Using PCA
2.3
Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
Try varying the corpus to vary the principle directions
11
Techniques in discovering features and steering vectors
Inference time interventions paper
2.4
The ITI vector generation approach: Averaging over residual streams right before truthful vs nontruthful generations. Generalize this to other harmful behavior?
12
Techniques in discovering features and steering vectors
2.5
Dictionary Learning: Autoencoders find feature directions and small subspaces at inference time. Seems unbelievably high value. Definitely more people should be looking at this work!
Apply the vectors found from this, are they usually effective?
13
Tooling
Tracking the impact on capabilities
3.1
Pulling data from Hugging Face, sampling 100 questions from various capabilities datasets, render them: displaying the performance and change in performance in the streamlit app
14
Tooling
3.2
People are working in the backend trying to get support for new large language models e.g. LLaMA-2 working! If you’re an ML engineer experienced in that kind of work, maybe you could help out?
15
Heuristics and benchmarks
Use whatever approaches we have for finding features and vectors, stuff them into their own dataset and chuck them at other huggingface datasets. See what happens!
4.1
Akin to the “wedding” direction, we can form a dataset of other single token additions (hand selected or generated) that crosses many semantic categories, e.g. nouns, themes, verbs, punctuation, etc. to try other ideas on in bulk
16
Heuristics and benchmarks
4.2
Toss all human-generated steering vector ideas into a csv and call it a day
17
Heuristics and benchmarks
4.3
Beyond counting “wedding words”, can we form more principled heuristics for measuring both steering efficacy and magnitude? This becomes more useful when we do have other benchmarks and bulk tests to perform
18
Composing steering vectors
If just one steering addition is so powerful, think of what we could do with two or more! Eventually we’re going for localizing additions to certain heads/sublayer, so keep this in mind when exploring this space.
5.1
Horizontally: applying different additions to multiple heads in the same layer, or subsets of MLP layers
19
Composing steering vectors
5.2
Vertically: applying same or different additions at multiple layers
Could be consecutive layers, could not be a consideration at all
20
Composing steering vectors
5.3
Take a circuit studied from existing literature on GPT2, or find another one using
. Targeting the nodes in these circuits, can you learn anything more about them and generally about how activation additions interact with circuits?
21
Composing steering vectors
5.4
Composition heuristic: on a fixed act. addition (or small set of them), randomly add to k heads or MLP outputs sampled out of all layers per base prompt from Pile slice (or other distributional data). Count which nodes were affected and the corresponding output perplexity (or other metric) for a robust sense of which components respond better and contribute more to effective steering
22
Interpretability
Activation additions seem like a good complement to activation patching: We have an entirely new dimension to explore, by varying a highly customizable bias term. Can we do anything useful with this that aids mechanistic interpretability?
6.1
Find an activation addition that works well (e.g. " weddings"). Figure out which heads are responsible for most of the effect. Does this lead to a "wedding-topic" circuit?
23
Interpretability
6.2
Does the effectiveness of activation additions depend on which activation function is used?
Might be hard to measure given that the effect of steering vectors is more easily observed on larger models with more understandable outputs
24
Interpretability
TinyStories has small 1L and 2L language models trained on children’s stories, that generates legible english! Would be interesting to test efficacy here
6.3
Do activation additions work on TinyStories? Can we thoroughly understand the impact of activation additions on the 1L or 2L model and distinguish the kinds of processing that happen on each layer?
25
Interpretability
Can you “break” any previously discovered results in interpretability with activation additions?
6.4
Is this a real break in the sense that a previous conclusion is no longer valid; or does the steering significantly modify the underlying computation in a way that the previous conclusion no longer applies?
26
Interpretability
6.5
Can activation additions further stress test mechanistic interpretability techniques? e.g. causal scrubbing
27
Interpretability
Thomas Kwa’s ideas
6.6
What's the mechanism by which adding a steering vector with too large a coefficient breaks the model?
28
Interpretability
6.7
Adding steering vectors at different layers surely means you are intervening at different "stages of processing". What do the model's internal concepts look like at different stages?
29
Interpretability
6.8
What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
There are no rows in this table
/
Categories

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.