Think of target behavior you think would be cool or fun to observe
1.1
Can you make the model suddenly switch to a different style of speech: e.g. talk like a pirate, shakespeare, celebrity, stereotypes, broken english
Destroying inference
1.2
Can you completely destroy some outputs with a specific addition? Can you reliably destroy inference across various inputs with the same bias?
Applications to alignment
1.3
Do some additions reliably steer across many different inputs? Can we robustly avoid harmful behavior? How can we use this for alignment?
Prompt engineering
1.4
Throw every prompt engineering trick you’ve ever heard of at this, both as a base input and as a bias term. Does anything work the same? Are they useless? Can you render a model more robust to prompt engineering with the right addition?
Modifying the coefficient
1.5
What happens with different coefficients on the biases? Why are some large coefficients “safe” but others damage output? Are small coefficients useful?
Patterns and trends
1.6
After trying a bunch of different approaches, are there any overarching trends you observe with what seems to work and what doesn’t? Are certain layers systematically better for steering than others? Does it depend on the type of activation addition being applied?
Techniques in discovering features and steering vectors
Averaging prompts
2.1
Try creating a single direction by averaging lots of prompts that collectively convey the steering goal. Does this ever work? A weighted average?
Techniques in discovering features and steering vectors
Many-shot prompting
2.2
Once we have many example prompts that seem to work well, use many-shot prompting on the examples we have with a larger model to generate more.
Techniques in discovering features and steering vectors
Using PCA
2.3
Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
Try varying the corpus to vary the principle directions Techniques in discovering features and steering vectors
Inference time interventions paper
2.4
The ITI vector generation approach: Averaging over residual streams right before truthful vs nontruthful generations. Generalize this to other harmful behavior?
Techniques in discovering features and steering vectors
2.5
Dictionary Learning: Autoencoders find feature directions and small subspaces at inference time. Seems unbelievably high value. Definitely more people should be looking at this work!
Apply the vectors found from this, are they usually effective? Tracking the impact on capabilities
3.1
Pulling data from Hugging Face, sampling 100 questions from various capabilities datasets, render them: displaying the performance and change in performance in the streamlit app
3.2
People are working in the backend trying to get support for new large language models e.g. LLaMA-2 working! If you’re an ML engineer experienced in that kind of work, maybe you could help out?
Heuristics and benchmarks
Use whatever approaches we have for finding features and vectors, stuff them into their own dataset and chuck them at other huggingface datasets. See what happens!
4.1
Akin to the “wedding” direction, we can form a dataset of other single token additions (hand selected or generated) that crosses many semantic categories, e.g. nouns, themes, verbs, punctuation, etc. to try other ideas on in bulk
Heuristics and benchmarks
4.2
Toss all human-generated steering vector ideas into a csv and call it a day
Heuristics and benchmarks
4.3
Beyond counting “wedding words”, can we form more principled heuristics for measuring both steering efficacy and magnitude? This becomes more useful when we do have other benchmarks and bulk tests to perform
Composing steering vectors
If just one steering addition is so powerful, think of what we could do with two or more! Eventually we’re going for localizing additions to certain heads/sublayer, so keep this in mind when exploring this space.
5.1
Horizontally: applying different additions to multiple heads in the same layer, or subsets of MLP layers
Composing steering vectors
5.2
Vertically: applying same or different additions at multiple layers Could be consecutive layers, could not be a consideration at all Composing steering vectors
5.3
Take a circuit studied from existing literature on GPT2, or find another one using . Targeting the nodes in these circuits, can you learn anything more about them and generally about how activation additions interact with circuits? Composing steering vectors
5.4
Composition heuristic: on a fixed act. addition (or small set of them), randomly add to k heads or MLP outputs sampled out of all layers per base prompt from Pile slice (or other distributional data). Count which nodes were affected and the corresponding output perplexity (or other metric) for a robust sense of which components respond better and contribute more to effective steering
Activation additions seem like a good complement to activation patching: We have an entirely new dimension to explore, by varying a highly customizable bias term. Can we do anything useful with this that aids mechanistic interpretability?
6.1
Find an activation addition that works well (e.g. " weddings"). Figure out which heads are responsible for most of the effect. Does this lead to a "wedding-topic" circuit?
6.2
Does the effectiveness of activation additions depend on which activation function is used?
Might be hard to measure given that the effect of steering vectors is more easily observed on larger models with more understandable outputs TinyStories has small 1L and 2L language models trained on children’s stories, that generates legible english! Would be interesting to test efficacy here
6.3
Do activation additions work on TinyStories? Can we thoroughly understand the impact of activation additions on the 1L or 2L model and distinguish the kinds of processing that happen on each layer?
Can you “break” any previously discovered results in interpretability with activation additions?
6.4
Is this a real break in the sense that a previous conclusion is no longer valid; or does the steering significantly modify the underlying computation in a way that the previous conclusion no longer applies?
6.5
Can activation additions further stress test mechanistic interpretability techniques? e.g. causal scrubbing
Thomas Kwa’s ideas
6.6
What's the mechanism by which adding a steering vector with too large a coefficient breaks the model?
6.7
Adding steering vectors at different layers surely means you are intervening at different "stages of processing". What do the model's internal concepts look like at different stages?
6.8
What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?