Tooling and Automation

Please see Neel’s post on for a more detailed description of the problems.
Tooling
Search
Breaking current techniques
6.1
Try to find concrete edge cases where a technique breaks - start with a misleading example in a real model or training a toy model with one.
Breaking current techniques
6.7
Find edge cases where ablations break. (Start w/ backup name movers in the IOI circuit, where we know zero ablations break)
ROME activation patching
6.15
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). How do results change when you do single layers?
ROME activation patching
6.16
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching specific neurons?
Automatically find circuits
6.21
Automate ways to find translation heads. (Bonus: Add to TransformerLens!)
Refine max activating dataset examples
6.36
Using 6.28: Finding the minimal example to activate a neuron by truncating the text - how often does this work?
Refine max activating dataset examples
6.37
Using 6.28: Can you replicate the results of the interpretability illusion for Neel's toy models by finding seemingly monosemantic neurons on Python code or C4 (web text), but are polysemantic when combined?
Breaking current techniques
6.2
Break direct logit attribution - start by looking at GPT-Neo Small where the logit lens (precursor to direct logit attribution) seems to work badly, but works well if you include the final layer and the unembed.
Breaking current techniques
6.4
Find edge cases where linearising LayerNorm breaks. See some work by Eric Winsor at Conjecture.
Breaking current techniques
6.5
Find edge cases where activation patching breaks. (It should break when you patch one variable but there's dependence on multiples)
Breaking current techniques
6.8
Can you find places where one ablation (zero, mean, random) breaks but the others don't?
Breaking current techniques
6.9
Find edge cases where composition scores break. (They don't work well for the IOI circuit)
Breaking current techniques
6.1
Find edge cases where eigenvalue copying scores break.
6.12
Try looking for composition on a specific input. Decompose the residual stream into the sum of outputs of previous heads, then decompose query, key, value into sums of terms from each previous head. Are any larger than the others / matter more if you ablate them / etc?
6.14
Compare causal tracing to activation patching. Do they give the same outputs? Can you find situations where one breaks and the other doesn't? (Try IOI task or factual recall task)
ROME activation patching
6.17
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching some set of neurons? (E.g, the neurons that activate the most within the 10 layers?)
Automatically find circuits
6.22
Automate ways to find few shot learning heads. (Bonus: Add to TransformerLens!)
Automatically find circuits
6.23
Can you find an automated way to detect pointer arithmetic based induction heads vs. classic induction heads?
Automatically find circuits
6.24
Can you find an automated way to detect the heads used in the IOI Circuit? (S-inhibition, name mover, negative name mover, backup name mover)
Automatically find circuits
6.25
Can you automate detection of the heads used in factual recall to move information about the fact to the final token? (Try activation patching)
Automatically find circuits
6.26
(Infrastructure) Combine some of the head detectors from 6.18-6.25 to make a "wiki" for a range of models, with information and scores for each head for how it falls into different categories. MVP: Pandas Dataframes with a row for each head and a column for each metric.
Refine max activating dataset examples
6.3
Using 6.28: Corrupt different token embeddings in a sequence to see which matter.
Refine max activating dataset examples
6.31
Using 6.28: Compare to randomly chosen directions in neuron activation space to see how clustered/monosemantic things seem.
Refine max activating dataset examples
6.32
Using 6.28: Validate these by comparing to direct effect of neuron on the logits, or output vocab logits most boosted by that neuron.
Refine max activating dataset examples
6.33
Using 6.28: Use a model like GPT-3 to find similar text to an existing example and see if they also activate the neuron. Bonus: Use them to replace specific tokens.
Refine max activating dataset examples
6.34
Using 6.28: Look at dataset examples at different quantiles for neuron activations (25%, 50%, 75%, 90%, 95%). Does that change anything?
Refine max activating dataset examples
6.38
Using 6.28: In SoLU models, compare max activating results for pre-SoLU, post-SoLU, and post LayerNorm activations. ('pre', 'mid', 'post' in TransformerLens). How consistent are they? Does one seem more principled?
Interpreting models with LLM's
6.39
Can GPT-3 figure out trends in max activating examples for a neuron?
Interpreting models with LLM's
6.4
Can you use GPT-3 to generate counterfactual prompts with lined up tokens to do activation patching on novel problems? (E.g, "John gave a bottle of milk to -> Mary" vs. "Mary gave a bottle of milk to -> John")
Apply techniques from non-mechanistic interpretability
6.42
How well does feature attribution work on circuits we understand?
6.48
Resolve some of the open issues/feature requests for TransformerLens.
Taking the "diff" of two models
6.5
Using 6.49, run it on a bunch of text and look at the biggest per-token log prob difference.
Taking the "diff" of two models
6.51
Using 6.49, run them on various benchmarks and compare performance.
Taking the "diff" of two models
6.52
Using 6.49, try "benchmarks" like performing algorithmic tasks like IOI, acronyms, etc. as from Circuits In the Wild.
Taking the "diff" of two models
6.53
Using 6.49, try qualitative exploration like just generating text from the models and look for ideas.
Taking the "diff" of two models
6.54
Build tooling to take the diff of two models with the same internal structure. Includes 6.49 but also lets you compare model internals!
Taking the "diff" of two models
6.55
Using 6.54, look for the largest difference in weights.
Taking the "diff" of two models
6.56
Using 6.54, run them on a bunch of text and look for largest difference in activations.
Taking the "diff" of two models
6.57
Using 6.54, look at the direct logit attribution of layers and heads on various texts, and look for the biggest differences.
Taking the "diff" of two models
6.58
Using 6.54, do activation patching on a piece of text where one model does much better than the other - are some parts key to improved performance?
Breaking current techniques
6.3
Can you fix direct logit attribution in GPT-Neo small, e.g, by finding a linear approximation to the final layer by taking gradients? (Eleuther's tuned lens in #interp-across-depth would be a good place to start)
Breaking current techniques
6.6
Find edge cases where causal scrubbing breaks.
Breaking current techniques
6.11
Automate ways to identify heads that compose. Start with IOI circuit and the composition scores in A Mathematical Framework.
6.13
Can you automate direct path patching as used in the IOI paper?
Automatically find circuits
6.27
Can you automate the detection of something in neuron interpretability? E.g, trigram neurons
Automatically find circuits
6.28
Find good ways to find the equivalent of max activating dataset examples for attention heads. Validate on induction circuits, then IOI. See post for ideas.
Refine max activating dataset examples
6.29
Refine the max activating dataset examples technique for neuron interpretability to find minimal or diverse examples.
Refine max activating dataset examples
6.35
Using 6.28: (Infrastructure) Add any of 6.29-6.34 to Neuroscope. Email Neel (neelnanda27@gmail.com) for codebase access.
Apply techniques from non-mechanistic interpretability
6.43
Can you use probing to get evidence for or against predictions in Toy Models of Superposition?
Apply techniques from non-mechanistic interpretability
6.44
Pick anything interesting from Rauker et al and try to apply the techniques to circuits we understand.
6.46
Take existing circuits and explore quantitative ways to characterise that it's a true circuit (or disprove it!) Try causal scrubbing to start.
6.47
Build on Arthur Conmy's work to automatically find circuits via recursive path patching
Taking the "diff" of two models
6.49
Build tooling to take the "diff" of two models, treating them as a black box mapping inputs to outputs, so it works with models with different internal structure
6.59
We understand how attention is calculated for a head using the QK matrix. This doesn't work for rotary attention. Can you find a principled alternative?
Interpreting models with LLM's
6.41
Choose your own adventure - can you find a way to usefully use an LLM to interpret models?
Apply techniques from non-mechanistic interpretability
6.45
Wiles et al gives an automated set of techniques to analyse bugs in image classification models. Can you get any traction adapting this to language models?
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.