Skip to content

Open Problems in Mechanistic Interpretability

Open Problems in Mechanistic Interpretability

Card View for COP

Toy language models

Circuits in the wild

Interpreting Algorithmic Problems

Polysemanticity and Superposition

Analysing Training Dynamics

Tooling and Automation

Image Model Interpretability

RL interpretability

Learned features in language models

Difficulty A problems

Difficulty B problems

Difficulty C problems

Difficulty D problems

More

Share

Explore

Tooling and Automation

Please see Neel’s post on

Techniques, Tooling, and Automation⁠

for a more detailed description of the problems.

Tooling

Tooling

Category

Difficulty

Existing Work

Currently working

Help Wanted?

Search

Techniques, Tooling, and Automation

Breaking current techniques

A

6.1

Try to find concrete edge cases where a technique breaks - start with a misleading example in a real model or training a toy model with one.

Techniques, Tooling, and Automation

Breaking current techniques

A

6.7

Find edge cases where ablations break. (Start w/ backup name movers in the IOI circuit, where we know zero ablations break)

Techniques, Tooling, and Automation

ROME activation patching

A

6.15

In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). How do results change when you do single layers?

Techniques, Tooling, and Automation

ROME activation patching

A

6.16

In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching specific neurons?

Techniques, Tooling, and Automation

Automatically find circuits

A

6.21

Automate ways to find translation heads. (Bonus: Add to TransformerLens!)

Techniques, Tooling, and Automation

Refine max activating dataset examples

A

6.36

Using 6.28: Finding the minimal example to activate a neuron by truncating the text - how often does this work?

Techniques, Tooling, and Automation

Refine max activating dataset examples

A

6.37

Using 6.28: Can you replicate the results of the interpretability illusion for Neel's toy models by finding seemingly monosemantic neurons on Python code or C4 (web text), but are polysemantic when combined?

Techniques, Tooling, and Automation

Breaking current techniques

B

6.2

Break direct logit attribution - start by looking at GPT-Neo Small where the logit lens (precursor to direct logit attribution) seems to work badly, but works well if you include the final layer and the unembed.

Techniques, Tooling, and Automation

Breaking current techniques

B

6.4

Find edge cases where linearising LayerNorm breaks. See some work by Eric Winsor at Conjecture.

Techniques, Tooling, and Automation

Breaking current techniques

B

6.5

Find edge cases where activation patching breaks. (It should break when you patch one variable but there's dependence on multiples)

Techniques, Tooling, and Automation

Breaking current techniques

B

6.8

Can you find places where one ablation (zero, mean, random) breaks but the others don't?

Techniques, Tooling, and Automation

Breaking current techniques

B

6.9

Find edge cases where composition scores break. (They don't work well for the IOI circuit)

Techniques, Tooling, and Automation

Breaking current techniques

B

6.1

Find edge cases where eigenvalue copying scores break.

Techniques, Tooling, and Automation

B

6.12

Try looking for composition on a specific input. Decompose the residual stream into the sum of outputs of previous heads, then decompose query, key, value into sums of terms from each previous head. Are any larger than the others / matter more if you ablate them / etc?

Techniques, Tooling, and Automation

B

6.14

Compare causal tracing to activation patching. Do they give the same outputs? Can you find situations where one breaks and the other doesn't? (Try IOI task or factual recall task)

Techniques, Tooling, and Automation

ROME activation patching

B

6.17

In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching some set of neurons? (E.g, the neurons that activate the most within the 10 layers?)

Techniques, Tooling, and Automation

Automatically find circuits

B

6.22

Automate ways to find few shot learning heads. (Bonus: Add to TransformerLens!)

Techniques, Tooling, and Automation

Automatically find circuits

B

6.23

Can you find an automated way to detect pointer arithmetic based induction heads vs. classic induction heads?

Techniques, Tooling, and Automation

Automatically find circuits

B

6.24

Can you find an automated way to detect the heads used in the IOI Circuit? (S-inhibition, name mover, negative name mover, backup name mover)

Techniques, Tooling, and Automation

Automatically find circuits

B

6.25

Can you automate detection of the heads used in factual recall to move information about the fact to the final token? (Try activation patching)

Techniques, Tooling, and Automation

Automatically find circuits

B

6.26

(Infrastructure) Combine some of the head detectors from 6.18-6.25 to make a "wiki" for a range of models, with information and scores for each head for how it falls into different categories. MVP: Pandas Dataframes with a row for each head and a column for each metric.

Techniques, Tooling, and Automation

Refine max activating dataset examples

B

6.3

Using 6.28: Corrupt different token embeddings in a sequence to see which matter.

Techniques, Tooling, and Automation

Refine max activating dataset examples

B

6.31

Using 6.28: Compare to randomly chosen directions in neuron activation space to see how clustered/monosemantic things seem.

Techniques, Tooling, and Automation

Refine max activating dataset examples

B

6.32

Using 6.28: Validate these by comparing to direct effect of neuron on the logits, or output vocab logits most boosted by that neuron.

Techniques, Tooling, and Automation

Refine max activating dataset examples

B

6.33

Using 6.28: Use a model like GPT-3 to find similar text to an existing example and see if they also activate the neuron. Bonus: Use them to replace specific tokens.

Techniques, Tooling, and Automation

Refine max activating dataset examples

B

6.34

Using 6.28: Look at dataset examples at different quantiles for neuron activations (25%, 50%, 75%, 90%, 95%). Does that change anything?

Techniques, Tooling, and Automation

Refine max activating dataset examples

B

6.38

Using 6.28: In SoLU models, compare max activating results for pre-SoLU, post-SoLU, and post LayerNorm activations. ('pre', 'mid', 'post' in TransformerLens). How consistent are they? Does one seem more principled?

Techniques, Tooling, and Automation

Interpreting models with LLM's

B

6.39

Can GPT-3 figure out trends in max activating examples for a neuron?

Techniques, Tooling, and Automation

Interpreting models with LLM's

B

6.4

Can you use GPT-3 to generate counterfactual prompts with lined up tokens to do activation patching on novel problems? (E.g, "John gave a bottle of milk to -> Mary" vs. "Mary gave a bottle of milk to -> John")

Techniques, Tooling, and Automation

Apply techniques from non-mechanistic interpretability

B

6.42

How well does feature attribution work on circuits we understand?

Techniques, Tooling, and Automation

B

6.48

Resolve some of the open issues/feature requests for TransformerLens.

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.5

Using 6.49, run it on a bunch of text and look at the biggest per-token log prob difference.

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.51

Using 6.49, run them on various benchmarks and compare performance.

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.52

Using 6.49, try "benchmarks" like performing algorithmic tasks like IOI, acronyms, etc. as from Circuits In the Wild.

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.53

Using 6.49, try qualitative exploration like just generating text from the models and look for ideas.

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.54

Build tooling to take the diff of two models with the same internal structure. Includes 6.49 but also lets you compare model internals!

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.55

Using 6.54, look for the largest difference in weights.

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.56

Using 6.54, run them on a bunch of text and look for largest difference in activations.

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.57

Using 6.54, look at the direct logit attribution of layers and heads on various texts, and look for the biggest differences.

Techniques, Tooling, and Automation

Taking the "diff" of two models

B

6.58

Using 6.54, do activation patching on a piece of text where one model does much better than the other - are some parts key to improved performance?

Techniques, Tooling, and Automation

Breaking current techniques

C

6.3

Can you fix direct logit attribution in GPT-Neo small, e.g, by finding a linear approximation to the final layer by taking gradients? (Eleuther's tuned lens in #interp-across-depth would be a good place to start)

Techniques, Tooling, and Automation

Breaking current techniques

C

6.6

Find edge cases where causal scrubbing breaks.

Techniques, Tooling, and Automation

Breaking current techniques

C

6.11

Automate ways to identify heads that compose. Start with IOI circuit and the composition scores in A Mathematical Framework.

Techniques, Tooling, and Automation

C

6.13

Can you automate direct path patching as used in the IOI paper?

Techniques, Tooling, and Automation

Automatically find circuits

C

6.27

Can you automate the detection of something in neuron interpretability? E.g, trigram neurons

Techniques, Tooling, and Automation

Automatically find circuits

C

6.28

Find good ways to find the equivalent of max activating dataset examples for attention heads. Validate on induction circuits, then IOI. See post for ideas.

Techniques, Tooling, and Automation

Refine max activating dataset examples

C

6.29

Refine the max activating dataset examples technique for neuron interpretability to find minimal or diverse examples.

Techniques, Tooling, and Automation

Refine max activating dataset examples

C

6.35

Using 6.28: (Infrastructure) Add any of 6.29-6.34 to Neuroscope. Email Neel (neelnanda27@gmail.com) for codebase access.

Techniques, Tooling, and Automation

Apply techniques from non-mechanistic interpretability

C

6.43

Can you use probing to get evidence for or against predictions in Toy Models of Superposition?

Techniques, Tooling, and Automation

Apply techniques from non-mechanistic interpretability

C

6.44

Pick anything interesting from Rauker et al and try to apply the techniques to circuits we understand.

Techniques, Tooling, and Automation

C

6.46

Take existing circuits and explore quantitative ways to characterise that it's a true circuit (or disprove it!) Try causal scrubbing to start.

Techniques, Tooling, and Automation

C

6.47

Build on Arthur Conmy's work to automatically find circuits via recursive path patching

Techniques, Tooling, and Automation

Taking the "diff" of two models

C

6.49

Build tooling to take the "diff" of two models, treating them as a black box mapping inputs to outputs, so it works with models with different internal structure

Techniques, Tooling, and Automation

C

6.59

We understand how attention is calculated for a head using the QK matrix. This doesn't work for rotary attention. Can you find a principled alternative?

Techniques, Tooling, and Automation

Interpreting models with LLM's

D

6.41

Choose your own adventure - can you find a way to usefully use an LLM to interpret models?

Techniques, Tooling, and Automation

Apply techniques from non-mechanistic interpretability

D

6.45

Wiles et al gives an automated set of techniques to analyse bugs in image classification models. Can you get any traction adapting this to language models?

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.