Difficulty B problems · Open Problems in Mechanistic Interpretability

Understanding neurons

1.1

How far can you get deeply reverse engineering a neuron in a 1L model? 1L is particularly easy since each neuron's output adds directly to the logits.

Understanding neurons

1.2

Find an interesting neuron you think represents a feature. Can you fully reverse engineer which direction should activate that feature, and compare to neuron input direction?

Understanding neurons

1.3

Look for trigram neurons and try to reverse engineer them. in a 1L model.(e.g, "ice cream -> sundae")

Understanding neurons

1.4

Check out the SoLU paper for more ideas on 1L neurons to find and reverse engineer.

Understanding neurons

1.8

Are there neurons whose behaviour can be matched by a regex or other code? If so, run it on a ton of text and compare the output.

How do larger models differ?

1.9

How do 3-layer and 4-layer attention-only models differ from 2L? (For instance, induction heads only appeared with 2L. Can you find something useful that only appears at 3L or higher?)

How do larger models differ?

1.1

How do 3-layer and 4-layer attention-only models differ from 2L? Look for composition scores - try to identify pairs of heads that compose a lot.

How do larger models differ?

1.11

How do 3-layer and 4-layer attention-only models differ from 2L? Look for evidence of composition.

How do larger models differ?

1.12

How do 3-layer and 4-layer attention-only models differ from 2L? Ablate a single head and run the model on a lot of text. Look at the change in performance. Do any heads matter a lot that aren't induction heads?

How do larger models differ?

1.13

Look for tasks that an n-layer model can't do, but an n+1-layer model can, and look for a circuit that explains this. (Start by running both models on a bunch of text and look for per-token probability differences)

How do larger models differ?

1.14

How do 1L SoLU/GELU models differ from 1L attention-only?

How do larger models differ?

1.15

How do 2L SoLU models differ from 1L?

How do larger models differ?

1.16

How does 1L GELU differ from 1L SoLU?

How do larger models differ?

1.17

Analyse how a larger model "fixes the bugs" of a smaller model.

How do larger models differ?

1.18

Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?

How do larger models differ?

1.19

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Try looking at split-token induction, where the current token has a preceding space and is one token, but the earlier occurrence has no preceding space and is two tokens. E.g " Claire" vs. "Cl" "aire"

How do larger models differ?

1.2

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at misfiring when the previous token appears multiple times with different following tokens

How do larger models differ?

1.21

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at stopping induction on a token that likely shows the end of a repeated string (e.g, . or ! or ")

How do larger models differ?

1.22

Does a 2L MLP model fix these bugs (1.19 -1.21) too?

Circuits in natural language

2.1

Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?

Circuits in natural language

2.2

Continuing sequences that are common in natural language (E.g, "1 2 3 4" -> "5", "Monday\nTuesday\n" -> "Wednesday"

I did some preliminary work on this during a hackathon this July, and found components shared between sequence contnuation tasks such as head 9.1 that were found to output the “next member” of a circuit. The work was rushed and crude but I am looking to polish and continue it in the future. A link to it can be found here:

https://alignmentjam.com/project/one-is-1-analyzing-activations-of-numerical-words-vs-digits

⁠

Pablo Hansen- April 18- 2024

Circuits in natural language

2.3

A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!

Circuits in natural language

2.4

3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU

Circuits in natural language

2.5

Converting names to emails, like "Katy Johnson <" -> "katy_johnson"

Circuits in natural language

2.8

Learning that words after full stops are capital letters.

Circuits in natural language

2.9

Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)

Circuits in natural language

2.11

Reverse engineer an induction head in a non-toy model.

Circuits in natural language

2.12

Choosing the right pronouns (E.g, "Lina is a great friend, isn't")

⁠

Code

⁠

Alana Xiang - 5 May 2023

Circuits in code models

2.14

Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.

Circuits in code models

2.15

Closing HTML tags

Extensions to IOI paper

2.2

Is there a general pattern for backup-ness? (Follows 2.19)

Manan Suri - 14 July, 2023

Extensions to IOI paper

2.22

Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.

Extensions to IOI paper

2.25

GPT-Neo wasn't trained with dropout - check 2.24 on this.

Extensions to IOI paper

2.26

Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.

Confusing things

2.29

Why do models have so many induction heads? How do they specialise, and why does the model need so many?

Confusing things

2.3

Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?

Confusing things

2.31

Can we find evidence of the residual stream as shared bandwidth hypothesis?

Confusing things

2.32

Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?

Confusing things

2.33

What happens to the memory in an induction circuit? (See 2.32)

Harder problems

3.8

5-digit addition/subtraction.

Harder problems

3.9

Predicting the output to simple code function. E.g, problems like "a = 1 2 3. a[2] = 4. a -> 1 2 4"

⁠

Code

⁠

Harder problems

3.1

Graph theory problems like this. Unsure of the correct input format. Try a bunch. See here

Harder problems

3.11

Train a model on multiple algorithmic tasks we understand (like modular addition and subtraction). Compare to a model trained on each task. Does it learn the same circuits? Is there superposition?

Joshua ; jhdhill@uwaterloo.ca ; jan 31 2024

Harder problems

3.12

Train models for automata tasks and interpret them. Do your results match the theory?

Harder problems

3.13

In-Context Linear Regression - the transformer gets a sequence (x_1, y_1, x_2, y_2, ...) where y_i = Ax_i + b. A and b are different for each prompt, and need to be learned in-context. (Code here)

Harder problems

3.16

Predict repeated subsequences in randomly generated tokens, and see if you can find and reverse engineer induction heads.

3.18

Build a toy model of Indirect Object Identification - train a tiny attention-only model on an algorithmic task simulating IOI - and reverse-engineer the learned solution. Compare it to the circuit found in GPT-2 Small.

Questions about language models

3.22

Train a 3L attention-only transformer to perform the Indirect Object Identification task. Can it do the task? Does it learn the same circuit found in GPT-2 Small?

Questions about language models

3.23

Redo Neel's modular addition analysis with GELU. Does it change things?

Questions about language models

3.26

In modular addition, look at what different dimensionality reduction techniques do on different weight matrices. Can you identify which weights matter most? Which neurons form clusters for each frequency? Anything from activations?

Extending Othello-GPT

3.32

Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and don't work. Try to build our understanding of transformer MLP's in general.

Confusions to study in Toy Models of Superposition

4.2

Replicate their absolute value model and study some of the variants of the ReLU output models.

May 4, 2023 - Kunvar (firstuserhere)

Confusions to study in Toy Models of Superposition

4.3

Explore neuron superposition by training their absolute value model on a more complex function like x -> x^2.

Confusions to study in Toy Models of Superposition

4.4

What happens to their ReLU output model when there's non-uniform sparsity? E.g, one class of less sparse features and another of very sparse

Confusions to study in Toy Models of Superposition

4.6

Explore neuron superposition by training their absolute value model on functions of multiple variables. Keep the inputs as uniform reals in [0, 1] and look at max(x, y)

Confusions to study in Toy Models of Superposition

4.8

Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features discrete (1, 2, 3)

Confusions to study in Toy Models of Superposition

4.9

Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features uniform [0.5, 1]

April 30, 2023; Kunvar(firstuserhere)

Studying bottleneck superposition in real language models

4.21

Induction heads copy the token they attend to the output, which involves storing which of 50,000 tokens it is. How are these stored in a 64-dimensional space?

Studying bottleneck superposition in real language models

4.22

How does the previous token head in an induction circuit communicate the value of the previous token to the key of the induction head? Bonus: What residual stream subspace does it take up? Is there interference?

Studying bottleneck superposition in real language models

4.23

How does the IOI circuit communicate names/positions between composing heads?

Studying bottleneck superposition in real language models

4.24

Are there dedicated dimensions for positional embeddings? Do any other components write to those dimensions?

Studying neuron superposition in real models

4.29

Look at a polysemantic neuron in a 1L language model. Can you figure out how the model disambiguates what feature it is?

Studying neuron superposition in real models

4.31

Take a feature that's part of a polysemantic neuron in a 1L language model and try to identify every neuron that represents that feature. Is it sparse or diffuse?

Comparing SoLU/GELU

4.38

Can you find any better metrics for polysemanticity?

Comparing SoLU/GELU

4.39

The paper speculates LayerNorm lets the model "smuggle through" superposition in SoLU models by smearing features across many dimensions and letting LayerNorm scale it up. Can you find evidence of this?

Comparing SoLU/GELU

4.4

How similar are the neurons between SoLU/GELU models of the same layers?

Algorithmic tasks - understanding grokking

5.1

Understanding why 5 digit addition has a phase change per digit (so 6 total?!)

Algorithmic tasks - understanding grokking

5.3

Look at the PCA of logits on the full dataset, or the PCA of a stack of flattened weights. If you plot a scatter plot of the first 2 components, the different phases of training are clearly visible. What's up with this?

Algorithmic tasks - understanding grokking

5.6

What happens if we include in the loss one of the progress measures in Neel's grokking post? Can we accelerate or stop grokking?

Algorithmic tasks - understanding grokking

5.7

Adam Jermyn provides an analytical argument and some toy models for why phase transition should be an inherent part of (some of) how models learn. Can you find evidence of this in more complex models?

Algorithmic tasks - understanding grokking

5.8

Build on and refine Adam Jermyn's arguments and toy models - think about how they deviate from a real transformer, and build more faithful models.

Algorithmic tasks - lottery tickets

5.9

For a toy model trained to form induction heads, is there a lottery-ticket style thing going on? Can you disrupt induction head formation by messing with the initialisation?

Algorithmic tasks - lottery tickets

5.11

If we knock out the parameters that form important circuits at the end of training on some toy task, but knock them out at the start of training, how much does that delay/stop generalisation?

Algorithmic tasks - lottery tickets

5.12

Analysing how pairs of heads in an induction circuit compose over time - Can you find progress measures which predict these?

Algorithmic tasks - lottery tickets

5.13

Analysing how pairs of heads in an induction circuit compose over time - Can we predict which heads will learn to compose first?

Algorithmic tasks - lottery tickets

5.14

Analysing how pairs of heads in an induction circuit compose over time -Does the composition develop as a phase transition?

Understanding fine-tuning

5.17

How is the model different on fine-tuned text? Look at examples where the model does much better after fine-tuning, and some normal text.

Understanding fine-tuning

5.18

Try activation patching between the old and fine-tuned model and see how hard recovering performance is.

Understanding fine-tuning

5.19

Look at max activating text for various neurons in the original models. How has it changed post fine-tuning?

Understanding fine-tuning

5.2

Explore further and see what's going on with fine-tuning mechanistically.

Understanding training dynamics in language models

5.22

Can you replicate the induction head phase transition results in the various checkpointed models in TransformerLens? (If code works for attn-only-2l it should work for them all)

Understanding training dynamics in language models

5.23

Look at the neurons in TransformerLens SoLU models during training. Do they tend to form as a phase transition?

Finding phase transitions

5.27

Try digging into the specific heads that act on IOI and look for phase transitions. Use direct logit attribution for the name movers.

Finding phase transitions

5.28

Study the attention patterns of each category of heads in IOI for phase transitions.

Finding phase transitions

5.29

Look for phase transitions in simple IOI-style algorithmic tasks, like few-shot learning, addition, sorting words alphabetically...

Finding phase transitions

5.3

Look for phase transitions in soft induction heads like translation.

Studying path dependence

5.34

How much do the Stanford CRFM models differ with algorithmic tasks like Indirect Object Identification?

Studying path dependence

5.36

When model scale varies (e.g, GPT-2 small vs. medium) is there anything the smaller model can do that the larger one can't do? (Look at difference in per token log prob)

Studying path dependence

5.37

Try applying the Git Re-Basin techniques to a 2L MLP trained for modular addition. Does this work? If you use Neel's grokking work to analyse the circuits involved, how does the re-basin technique map onto the circuits?

Breaking current techniques

6.2

Break direct logit attribution - start by looking at GPT-Neo Small where the logit lens (precursor to direct logit attribution) seems to work badly, but works well if you include the final layer and the unembed.

Breaking current techniques

6.4

Find edge cases where linearising LayerNorm breaks. See some work by Eric Winsor at Conjecture.

Breaking current techniques

6.5

Find edge cases where activation patching breaks. (It should break when you patch one variable but there's dependence on multiples)

Breaking current techniques

6.8

Can you find places where one ablation (zero, mean, random) breaks but the others don't?

Breaking current techniques

6.9

Find edge cases where composition scores break. (They don't work well for the IOI circuit)

Breaking current techniques

6.1

Find edge cases where eigenvalue copying scores break.

6.12

Try looking for composition on a specific input. Decompose the residual stream into the sum of outputs of previous heads, then decompose query, key, value into sums of terms from each previous head. Are any larger than the others / matter more if you ablate them / etc?

6.14

Compare causal tracing to activation patching. Do they give the same outputs? Can you find situations where one breaks and the other doesn't? (Try IOI task or factual recall task)

ROME activation patching

6.17

In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching some set of neurons? (E.g, the neurons that activate the most within the 10 layers?)

Automatically find circuits

6.22

Automate ways to find few shot learning heads. (Bonus: Add to TransformerLens!)

Automatically find circuits

6.23

Can you find an automated way to detect pointer arithmetic based induction heads vs. classic induction heads?

Automatically find circuits

6.24

Can you find an automated way to detect the heads used in the IOI Circuit? (S-inhibition, name mover, negative name mover, backup name mover)

Automatically find circuits

6.25

Can you automate detection of the heads used in factual recall to move information about the fact to the final token? (Try activation patching)

Automatically find circuits

6.26

(Infrastructure) Combine some of the head detectors from 6.18-6.25 to make a "wiki" for a range of models, with information and scores for each head for how it falls into different categories. MVP: Pandas Dataframes with a row for each head and a column for each metric.

Refine max activating dataset examples

6.3

Using 6.28: Corrupt different token embeddings in a sequence to see which matter.

Refine max activating dataset examples

6.31

Using 6.28: Compare to randomly chosen directions in neuron activation space to see how clustered/monosemantic things seem.

Refine max activating dataset examples

6.32

Using 6.28: Validate these by comparing to direct effect of neuron on the logits, or output vocab logits most boosted by that neuron.

Refine max activating dataset examples

6.33

Using 6.28: Use a model like GPT-3 to find similar text to an existing example and see if they also activate the neuron. Bonus: Use them to replace specific tokens.

Refine max activating dataset examples

6.34

Using 6.28: Look at dataset examples at different quantiles for neuron activations (25%, 50%, 75%, 90%, 95%). Does that change anything?

Refine max activating dataset examples

6.38

Using 6.28: In SoLU models, compare max activating results for pre-SoLU, post-SoLU, and post LayerNorm activations. ('pre', 'mid', 'post' in TransformerLens). How consistent are they? Does one seem more principled?

Interpreting models with LLM's

6.39

Can GPT-3 figure out trends in max activating examples for a neuron?

Interpreting models with LLM's

6.4

Can you use GPT-3 to generate counterfactual prompts with lined up tokens to do activation patching on novel problems? (E.g, "John gave a bottle of milk to -> Mary" vs. "Mary gave a bottle of milk to -> John")

Apply techniques from non-mechanistic interpretability

6.42

How well does feature attribution work on circuits we understand?

6.48

Resolve some of the open issues/feature requests for TransformerLens.