Open Problems in Mechanistic Interpretability

Understanding neurons

1.1

How far can you get deeply reverse engineering a neuron in a 1L model? 1L is particularly easy since each neuron's output adds directly to the logits.

Understanding neurons

1.2

Find an interesting neuron you think represents a feature. Can you fully reverse engineer which direction should activate that feature, and compare to neuron input direction?

Understanding neurons

1.3

Look for trigram neurons and try to reverse engineer them. in a 1L model.(e.g, "ice cream -> sundae")

Understanding neurons

1.4

Check out the SoLU paper for more ideas on 1L neurons to find and reverse engineer.

Understanding neurons

1.5

How far can you get deeply reverse engineering a neuron in a 2+ layer model?

Understanding neurons

1.6

Hunt through Neuroscope for the toy models and look for interesting neurons to focus on.

Understanding neurons

1.7

Can you find any polysemantic neurons in Neuroscope? Explore this.

Understanding neurons

1.8

Are there neurons whose behaviour can be matched by a regex or other code? If so, run it on a ton of text and compare the output.

How do larger models differ?

1.9

How do 3-layer and 4-layer attention-only models differ from 2L? (For instance, induction heads only appeared with 2L. Can you find something useful that only appears at 3L or higher?)

How do larger models differ?

1.1

How do 3-layer and 4-layer attention-only models differ from 2L? Look for composition scores - try to identify pairs of heads that compose a lot.

How do larger models differ?

1.11

How do 3-layer and 4-layer attention-only models differ from 2L? Look for evidence of composition.

How do larger models differ?

1.12

How do 3-layer and 4-layer attention-only models differ from 2L? Ablate a single head and run the model on a lot of text. Look at the change in performance. Do any heads matter a lot that aren't induction heads?

How do larger models differ?

1.13

Look for tasks that an n-layer model can't do, but an n+1-layer model can, and look for a circuit that explains this. (Start by running both models on a bunch of text and look for per-token probability differences)

How do larger models differ?

1.14

How do 1L SoLU/GELU models differ from 1L attention-only?

How do larger models differ?

1.15

How do 2L SoLU models differ from 1L?

How do larger models differ?

1.16

How does 1L GELU differ from 1L SoLU?

How do larger models differ?

1.17

Analyse how a larger model "fixes the bugs" of a smaller model.

How do larger models differ?

1.18

Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?

How do larger models differ?

1.19

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Try looking at split-token induction, where the current token has a preceding space and is one token, but the earlier occurrence has no preceding space and is two tokens. E.g " Claire" vs. "Cl" "aire"

How do larger models differ?

1.2

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at misfiring when the previous token appears multiple times with different following tokens

How do larger models differ?

1.21

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at stopping induction on a token that likely shows the end of a repeated string (e.g, . or ! or ")

How do larger models differ?

1.22

Does a 2L MLP model fix these bugs (1.19 -1.21) too?

1.23

Choose your own adventure: Take a bunch of text with interesting patterns and run the models over it. Look for tokens they do really well on and try to reverse engineer what's going on!

Circuits in natural language

2.1

Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?

Circuits in natural language

2.2

Continuing sequences that are common in natural language (E.g, "1 2 3 4" -> "5", "Monday\nTuesday\n" -> "Wednesday"

I did some preliminary work on this during a hackathon this July, and found components shared between sequence contnuation tasks such as head 9.1 that were found to output the “next member” of a circuit. The work was rushed and crude but I am looking to polish and continue it in the future. A link to it can be found here:

https://alignmentjam.com/project/one-is-1-analyzing-activations-of-numerical-words-vs-digits

⁠

Pablo Hansen- April 18- 2024

Circuits in natural language

2.3

A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!

Circuits in natural language

2.4

3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU

Circuits in natural language

2.5

Converting names to emails, like "Katy Johnson <" -> "katy_johnson"

Circuits in natural language

2.6

A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail

Circuits in natural language

2.7

Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?

⁠

Paper

⁠

Circuits in natural language

2.8

Learning that words after full stops are capital letters.

Circuits in natural language

2.9

Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)

Circuits in natural language

2.1

Interpreting memorisation. Sometimes GPT knows phone numbers. How?

Circuits in natural language

2.11

Reverse engineer an induction head in a non-toy model.

Circuits in natural language

2.12

Choosing the right pronouns (E.g, "Lina is a great friend, isn't")

⁠

Code

⁠

Alana Xiang - 5 May 2023

Circuits in natural language

2.13

Choose your own adventure! Try finding behaviours of your own related to natural language circuits.

Circuits in code models

2.14

Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.

Circuits in code models

2.15

Closing HTML tags

Circuits in code models

2.16

Methods depend on object type (e.g, x.append a list, x.update a dictionary)

Circuits in code models

2.17

Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.

Extensions to IOI paper

2.18

Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)

⁠

Code

⁠

Extensions to IOI paper

2.19

Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?

Extensions to IOI paper

2.2

Is there a general pattern for backup-ness? (Follows 2.19)

Manan Suri - 14 July, 2023

Extensions to IOI paper

2.21

Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?

Extensions to IOI paper

2.22

Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.

Extensions to IOI paper

2.23

What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?

Extensions to IOI paper

2.24

What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?

Extensions to IOI paper

2.25

GPT-Neo wasn't trained with dropout - check 2.24 on this.

Extensions to IOI paper

2.26

Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.

Extensions to IOI paper

2.27

MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?

Extensions to IOI paper

2.28

Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)

Confusing things

2.29

Why do models have so many induction heads? How do they specialise, and why does the model need so many?

Confusing things

2.3

Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?

Confusing things

2.31

Can we find evidence of the residual stream as shared bandwidth hypothesis?

Confusing things

2.32

Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?

Confusing things

2.33

What happens to the memory in an induction circuit? (See 2.32)

Studying larger models

2.34

GPT-J contains translation heads. Can you interpret how they work and what they do?

Studying larger models

2.35

Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.

Studying larger models

2.36

What's up with few-shot learning? How does it work?

Studying larger models

2.37

How does addition work? (Focus on 2-digit)

Studying larger models

2.38

What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?

Beginner problems

3.1

Sorting fixed-length lists. (format - START 4 6 2 9 MID 2 4 6 9)

⁠

Code

⁠

Beginner problems

3.2

Sorting variable-length lists. (What's the sorting algorithm? What's the longest list you can get do? How does length affect accuracy?)

Beginner problems

3.3

Interpret a 2L MLP (one hidden layer) trained to do modular addition. (Analogous to Neel's grokking work)

Beginner problems

3.4

Interpret a 1L MLP trained to do modular subtraction (Analogous to Neel's grokking work)

Beginner problems

3.5

Taking the minimum or maximum of two ints

⁠

Code

⁠

Beginner problems

3.6

Permuting lists

Beginner problems

3.7

Calculating sequences with Fibonnaci-style recurrence (predicting next element from the previous two)

Harder problems

3.8

5-digit addition/subtraction.

Harder problems

3.9

Predicting the output to simple code function. E.g, problems like "a = 1 2 3. a[2] = 4. a -> 1 2 4"

⁠

Code

⁠

Harder problems

3.1

Graph theory problems like this. Unsure of the correct input format. Try a bunch. See here

Harder problems

3.11

Train a model on multiple algorithmic tasks we understand (like modular addition and subtraction). Compare to a model trained on each task. Does it learn the same circuits? Is there superposition?

Joshua ; jhdhill@uwaterloo.ca ; jan 31 2024

Harder problems

3.12

Train models for automata tasks and interpret them. Do your results match the theory?

Harder problems

3.13

In-Context Linear Regression - the transformer gets a sequence (x_1, y_1, x_2, y_2, ...) where y_i = Ax_i + b. A and b are different for each prompt, and need to be learned in-context. (Code here)

Harder problems

3.14

Problems in In-Context Linear Regression that are in-context learned. See 3.13.

Harder problems

3.15

5 digit (or binary) multiplication

Harder problems

3.16

Predict repeated subsequences in randomly generated tokens, and see if you can find and reverse engineer induction heads.

Harder problems

3.17

Choose your own adventure! Find your own algorithmic problem. Leetcode easy is probably a good source.

3.18

Build a toy model of Indirect Object Identification - train a tiny attention-only model on an algorithmic task simulating IOI - and reverse-engineer the learned solution. Compare it to the circuit found in GPT-2 Small.

3.19

Is 3.18 consistent across random seeds, or can other algorithms be learned? Can a 2L model learn this? What happens if you add more MLP's or more layers?

3.2

Reverse-engineer Othello-GPT. Can you reverse-engineer the algorithms it learns, or the features the probes find?

Questions about language models

3.21

Train a 1L attention-only transformer with rotary to predict the previous token and reverse engineer how it does this.

5/7/23: Eric (repo: https://github.com/DKdekes/rotary-interp)

Questions about language models

3.22

Train a 3L attention-only transformer to perform the Indirect Object Identification task. Can it do the task? Does it learn the same circuit found in GPT-2 Small?

Questions about language models

3.23

Redo Neel's modular addition analysis with GELU. Does it change things?

Questions about language models

3.24

How does memorisation work? Try training a one hidden layer MLP to memorise random data, or training a transformer on a fixed set of random strings of tokens.

Questions about language models

3.25

Compare different dimensionality reduction techniques on modular addition or a problem you feel you understand.

Questions about language models

3.26

In modular addition, look at what different dimensionality reduction techniques do on different weight matrices. Can you identify which weights matter most? Which neurons form clusters for each frequency? Anything from activations?

Questions about language models

3.27

Is direct logit attribution always useful? Can you find examples where it's highly misleading?

Deep learning mysteries

3.28

Explore the Lottery Ticket Hypothesis

Deep learning mysteries

3.29

Explore Deep Double Descent

Extending Othello-GPT

3.3

Try one of Neel's concrete Othello-GPT projects.

Extending Othello-GPT

3.31

Looking for modular circuits - try to find the circuits used to compute the world model and to use the world model to compute the next move. Try to understand each in isolation and use this to understand how they fit together. See what you can learn about finding modular circuits in general.

⁠

Conceptual post

⁠

Extending Othello-GPT

3.32

Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and don't work. Try to build our understanding of transformer MLP's in general.

Extending Othello-GPT

3.33

Transformer Circuits Laboratory - Explore and test other conjectures about transformer circuits - e.g, can we figure out how the model manages memory in the residual stream?