Open Problems in Mechanistic Interpretability

The purpose of this doc is to let people quickly browse through problems in Neel’s .


There are filters for difficulty, existing work (blank/not blank), help wanted, currently working on, etc. There are also individual pages accessible from the top/side menu of this doc for you to browse specific difficulty level of problems, an individual category of problems, or a card view of this page.
The “Existing Work” column is for completed posts, papers, or other documents. The “Currently Working On” column is for drafts, brainstorms, and people who want to work on it but haven’t produced anything yet. The “Help Wanted” column is for people who’re working on a problem but would like additional collaborators or mentors. Please specify what help you’re looking for if you add to that column. Please add a date if you’re currently working on something so it’s clear if you expressed interest yesterday or two years ago. Write it up as a comment and I'll approve it fairly promptly - people will still see the comment until then.
This sequence is long! What this means is that not all relevant information is contained in this spreadsheet! There is lots of great context for the sequence as a whole and each section in general, including motivation and useful resources. Some problems are copied word-for-word - many are not.
If you are interested in a problem, please take a look at the problem in the original post before deciding to tackle it! Often it includes relevant context or links that didn’t make it into the spreadsheet for space reasons. I’d also recommend looking at the first part of the relevant post, before seriously tackling one of its problems.
Huge thanks to Neel Nanda for his work in creating this sequence and building the field.
Open Problems
14
Category
Difficulty
Existing Work
Currently working
Help Wanted?
Search
Category
Subcategory
Difficulty
Number
Problem
Existing Work
Currently working
Help Wanted?
1
Toy Language Models
Understanding neurons
B
1.1
How far can you get deeply reverse engineering a neuron in a 1L model? 1L is particularly easy since each neuron's output adds directly to the logits.
2
Toy Language Models
Understanding neurons
B
1.2
Find an interesting neuron you think represents a feature. Can you fully reverse engineer which direction should activate that feature, and compare to neuron input direction?
3
Toy Language Models
Understanding neurons
B
1.3
Look for trigram neurons and try to reverse engineer them. in a 1L model.(e.g, "ice cream -> sundae")
4
Toy Language Models
Understanding neurons
B
1.4
Check out the SoLU paper for more ideas on 1L neurons to find and reverse engineer.
5
Toy Language Models
Understanding neurons
C
1.5
How far can you get deeply reverse engineering a neuron in a 2+ layer model?
6
Toy Language Models
Understanding neurons
A
1.6
Hunt through Neuroscope for the toy models and look for interesting neurons to focus on.
7
Toy Language Models
Understanding neurons
A
1.7
Can you find any polysemantic neurons in Neuroscope? Explore this.
8
Toy Language Models
Understanding neurons
B
1.8
Are there neurons whose behaviour can be matched by a regex or other code? If so, run it on a ton of text and compare the output.
9
Toy Language Models
How do larger models differ?
B
1.9
How do 3-layer and 4-layer attention-only models differ from 2L? (For instance, induction heads only appeared with 2L. Can you find something useful that only appears at 3L or higher?)
10
Toy Language Models
How do larger models differ?
B
1.1
How do 3-layer and 4-layer attention-only models differ from 2L? Look for composition scores - try to identify pairs of heads that compose a lot.
11
Toy Language Models
How do larger models differ?
B
1.11
How do 3-layer and 4-layer attention-only models differ from 2L? Look for evidence of composition.
12
Toy Language Models
How do larger models differ?
B
1.12
How do 3-layer and 4-layer attention-only models differ from 2L? Ablate a single head and run the model on a lot of text. Look at the change in performance. Do any heads matter a lot that aren't induction heads?
13
Toy Language Models
How do larger models differ?
B
1.13
Look for tasks that an n-layer model can't do, but an n+1-layer model can, and look for a circuit that explains this. (Start by running both models on a bunch of text and look for per-token probability differences)
14
Toy Language Models
How do larger models differ?
B
1.14
How do 1L SoLU/GELU models differ from 1L attention-only?
15
Toy Language Models
How do larger models differ?
B
1.15
How do 2L SoLU models differ from 1L?
16
Toy Language Models
How do larger models differ?
B
1.16
How does 1L GELU differ from 1L SoLU?
17
Toy Language Models
How do larger models differ?
B
1.17
Analyse how a larger model "fixes the bugs" of a smaller model.
18
Toy Language Models
How do larger models differ?
B
1.18
Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?
19
Toy Language Models
How do larger models differ?
B
1.19
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Try looking at split-token induction, where the current token has a preceding space and is one token, but the earlier occurrence has no preceding space and is two tokens. E.g " Claire" vs. "Cl" "aire"
20
Toy Language Models
How do larger models differ?
B
1.2
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at misfiring when the previous token appears multiple times with different following tokens
21
Toy Language Models
How do larger models differ?
B
1.21
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at stopping induction on a token that likely shows the end of a repeated string (e.g, . or ! or ")
22
Toy Language Models
How do larger models differ?
B
1.22
Does a 2L MLP model fix these bugs (1.19 -1.21) too?
23
Toy Language Models
A
1.23
Choose your own adventure: Take a bunch of text with interesting patterns and run the models over it. Look for tokens they do really well on and try to reverse engineer what's going on!
24
Circuits In The Wild
Circuits in natural language
B
2.1
Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?
25
Circuits In The Wild
Circuits in natural language
B
2.2
Continuing sequences that are common in natural language (E.g, "1 2 3 4" -> "5", "Monday\nTuesday\n" -> "Wednesday"
I did some preliminary work on this during a hackathon this July, and found components shared between sequence contnuation tasks such as head 9.1 that were found to output the “next member” of a circuit. The work was rushed and crude but I am looking to polish and continue it in the future. A link to it can be found here:
Pablo Hansen- April 18- 2024
26
Circuits In The Wild
Circuits in natural language
B
2.3
A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!
27
Circuits In The Wild
Circuits in natural language
B
2.4
3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU
28
Circuits In The Wild
Circuits in natural language
B
2.5
Converting names to emails, like "Katy Johnson <" -> "katy_johnson"
29
Circuits In The Wild
Circuits in natural language
C
2.6
A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail
30
Circuits In The Wild
Circuits in natural language
C
2.7
Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?
31
Circuits In The Wild
Circuits in natural language
B
2.8
Learning that words after full stops are capital letters.
32
Circuits In The Wild
Circuits in natural language
B
2.9
Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)
33
Circuits In The Wild
Circuits in natural language
C
2.1
Interpreting memorisation. Sometimes GPT knows phone numbers. How?
34
Circuits In The Wild
Circuits in natural language
B
2.11
Reverse engineer an induction head in a non-toy model.
35
Circuits In The Wild
Circuits in natural language
B
2.12
Choosing the right pronouns (E.g, "Lina is a great friend, isn't")
Alana Xiang - 5 May 2023
36
Circuits In The Wild
Circuits in natural language
A
2.13
Choose your own adventure! Try finding behaviours of your own related to natural language circuits.
37
Circuits In The Wild
Circuits in code models
B
2.14
Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.
38
Circuits In The Wild
Circuits in code models
B
2.15
Closing HTML tags
39
Circuits In The Wild
Circuits in code models
C
2.16
Methods depend on object type (e.g, x.append a list, x.update a dictionary)
40
Circuits In The Wild
Circuits in code models
A
2.17
Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.
41
Circuits In The Wild
Extensions to IOI paper
A
2.18
Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)
42
Circuits In The Wild
Extensions to IOI paper
A
2.19
Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?
43
Circuits In The Wild
Extensions to IOI paper
B
2.2
Is there a general pattern for backup-ness? (Follows 2.19)
Manan Suri - 14 July, 2023
44
Circuits In The Wild
Extensions to IOI paper
A
2.21
Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?
45
Circuits In The Wild
Extensions to IOI paper
B
2.22
Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.
46
Circuits In The Wild
Extensions to IOI paper
C
2.23
What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?
47
Circuits In The Wild
Extensions to IOI paper
C
2.24
What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?
48
Circuits In The Wild
Extensions to IOI paper
B
2.25
GPT-Neo wasn't trained with dropout - check 2.24 on this.
49
Circuits In The Wild
Extensions to IOI paper
B
2.26
Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.
50
Circuits In The Wild
Extensions to IOI paper
C
2.27
MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?
51
Circuits In The Wild
Extensions to IOI paper
C
2.28
Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)
52
Circuits In The Wild
Confusing things
B
2.29
Why do models have so many induction heads? How do they specialise, and why does the model need so many?
53
Circuits In The Wild
Confusing things
B
2.3
Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?
54
Circuits In The Wild
Confusing things
B
2.31
Can we find evidence of the residual stream as shared bandwidth hypothesis?
55
Circuits In The Wild
Confusing things
B
2.32
Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?
56
Circuits In The Wild
Confusing things
B
2.33
What happens to the memory in an induction circuit? (See 2.32)
57
Circuits In The Wild
Studying larger models
C
2.34
GPT-J contains translation heads. Can you interpret how they work and what they do?
58
Circuits In The Wild
Studying larger models
C
2.35
Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.
59
Circuits In The Wild
Studying larger models
C
2.36
What's up with few-shot learning? How does it work?
60
Circuits In The Wild
Studying larger models
C
2.37
How does addition work? (Focus on 2-digit)
61
Circuits In The Wild
Studying larger models
C
2.38
What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?
62
Interpreting Algorithmic Problems
Beginner problems
A
3.1
Sorting fixed-length lists. (format - START 4 6 2 9 MID 2 4 6 9)
63
Interpreting Algorithmic Problems
Beginner problems
A
3.2
Sorting variable-length lists. (What's the sorting algorithm? What's the longest list you can get do? How does length affect accuracy?)
64
Interpreting Algorithmic Problems
Beginner problems
A
3.3
Interpret a 2L MLP (one hidden layer) trained to do modular addition. (Analogous to Neel's grokking work)
65
Interpreting Algorithmic Problems
Beginner problems
A
3.4
Interpret a 1L MLP trained to do modular subtraction (Analogous to Neel's grokking work)
66
Interpreting Algorithmic Problems
Beginner problems
A
3.5
Taking the minimum or maximum of two ints
67
Interpreting Algorithmic Problems
Beginner problems
A
3.6
Permuting lists
68
Interpreting Algorithmic Problems
Beginner problems
A
3.7
Calculating sequences with Fibonnaci-style recurrence (predicting next element from the previous two)
69
Interpreting Algorithmic Problems
Harder problems
B
3.8
5-digit addition/subtraction.
70
Interpreting Algorithmic Problems
Harder problems
B
3.9
Predicting the output to simple code function. E.g, problems like "a = 1 2 3. a[2] = 4. a -> 1 2 4"
71
Interpreting Algorithmic Problems
Harder problems
B
3.1
Graph theory problems like this. Unsure of the correct input format. Try a bunch. See here
72
Interpreting Algorithmic Problems
Harder problems
B
3.11
Train a model on multiple algorithmic tasks we understand (like modular addition and subtraction). Compare to a model trained on each task. Does it learn the same circuits? Is there superposition?
Joshua ; jhdhill@uwaterloo.ca ; jan 31 2024
73
Interpreting Algorithmic Problems
Harder problems
B
3.12
Train models for automata tasks and interpret them. Do your results match the theory?
74
Interpreting Algorithmic Problems
Harder problems
B
3.13
In-Context Linear Regression - the transformer gets a sequence (x_1, y_1, x_2, y_2, ...) where y_i = Ax_i + b. A and b are different for each prompt, and need to be learned in-context. (Code here)
75
Interpreting Algorithmic Problems
Harder problems
C
3.14
Problems in In-Context Linear Regression that are in-context learned. See 3.13.
76
Interpreting Algorithmic Problems
Harder problems
C
3.15
5 digit (or binary) multiplication
77
Interpreting Algorithmic Problems
Harder problems
B
3.16
Predict repeated subsequences in randomly generated tokens, and see if you can find and reverse engineer induction heads.
78
Interpreting Algorithmic Problems
Harder problems
C
3.17
Choose your own adventure! Find your own algorithmic problem. Leetcode easy is probably a good source.
79
Interpreting Algorithmic Problems
B
3.18
Build a toy model of Indirect Object Identification - train a tiny attention-only model on an algorithmic task simulating IOI - and reverse-engineer the learned solution. Compare it to the circuit found in GPT-2 Small.
80
Interpreting Algorithmic Problems
C
3.19
Is 3.18 consistent across random seeds, or can other algorithms be learned? Can a 2L model learn this? What happens if you add more MLP's or more layers?
81
Interpreting Algorithmic Problems
C
3.2
Reverse-engineer Othello-GPT. Can you reverse-engineer the algorithms it learns, or the features the probes find?
82
Interpreting Algorithmic Problems
Questions about language models
A
3.21
Train a 1L attention-only transformer with rotary to predict the previous token and reverse engineer how it does this.
5/7/23: Eric (repo: https://github.com/DKdekes/rotary-interp)
83
Interpreting Algorithmic Problems
Questions about language models
B
3.22
Train a 3L attention-only transformer to perform the Indirect Object Identification task. Can it do the task? Does it learn the same circuit found in GPT-2 Small?
84
Interpreting Algorithmic Problems
Questions about language models
B
3.23
Redo Neel's modular addition analysis with GELU. Does it change things?
85
Interpreting Algorithmic Problems
Questions about language models
C
3.24
How does memorisation work? Try training a one hidden layer MLP to memorise random data, or training a transformer on a fixed set of random strings of tokens.
86
Interpreting Algorithmic Problems
Questions about language models
C
3.25
Compare different dimensionality reduction techniques on modular addition or a problem you feel you understand.
87
Interpreting Algorithmic Problems
Questions about language models
B
3.26
In modular addition, look at what different dimensionality reduction techniques do on different weight matrices. Can you identify which weights matter most? Which neurons form clusters for each frequency? Anything from activations?
88
Interpreting Algorithmic Problems
Questions about language models
C
3.27
Is direct logit attribution always useful? Can you find examples where it's highly misleading?
89
Interpreting Algorithmic Problems
Deep learning mysteries
D
3.28
Explore the Lottery Ticket Hypothesis
90
Interpreting Algorithmic Problems
Deep learning mysteries
D
3.29
Explore Deep Double Descent
91
Interpreting Algorithmic Problems
Extending Othello-GPT
A
3.3
Try one of Neel's concrete Othello-GPT projects.
92
Interpreting Algorithmic Problems
Extending Othello-GPT
C
3.31
Looking for modular circuits - try to find the circuits used to compute the world model and to use the world model to compute the next move. Try to understand each in isolation and use this to understand how they fit together. See what you can learn about finding modular circuits in general.
93
Interpreting Algorithmic Problems
Extending Othello-GPT
B
3.32
Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and don't work. Try to build our understanding of transformer MLP's in general.
94
Interpreting Algorithmic Problems
Extending Othello-GPT
C
3.33
Transformer Circuits Laboratory - Explore and test other conjectures about transformer circuits - e.g, can we figure out how the model manages memory in the residual stream?
95
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
A
4.1
Does dropout create a privileged basis? Put dropout on the hidden layer of the ReLU output model and study how this changes the results.
Post
14 April 2023: Kunvar (firstuserhere)
96
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
B
4.2
Replicate their absolute value model and study some of the variants of the ReLU output models.
May 4, 2023 - Kunvar (firstuserhere)
97
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
B
4.3
Explore neuron superposition by training their absolute value model on a more complex function like x -> x^2.
98
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
B
4.4
What happens to their ReLU output model when there's non-uniform sparsity? E.g, one class of less sparse features and another of very sparse
99
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
A
4.5
Explore neuron superposition by training their absolute value model on functions of multiple variables. Make inputs binary (0/1) and look at the AND and OR of element pairs.
100
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
B
4.6
Explore neuron superposition by training their absolute value model on functions of multiple variables. Keep the inputs as uniform reals in [0, 1] and look at max(x, y)
101
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
A
4.7
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features 1 (i.e, two possible values)
102
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
B
4.8
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features discrete (1, 2, 3)
103
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
B
4.9
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features uniform [0.5, 1]
April 30, 2023; Kunvar(firstuserhere)
104
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
A
4.1
What happens if you replace ReLU's with GeLU's in the toy models?
May 1, 2023 - Kunvar (firstuserhere)
105
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
C
4.11
Can you find a toy model where GELU acts significantly differently from ReLU?
May 1, 2023 - Kunvar (firstuserhere)
106
Exploring Polysemanticity and Superposition
Building toy models of superposition
C
4.12
Build a toy model of a classification problem with cross-entropy loss
November 10, 2023 - Lucas Hayne ()
107
Exploring Polysemanticity and Superposition
Building toy models of superposition
C
4.13
Build a toy model of neuron superposition that has many more hidden features than output features
108
Exploring Polysemanticity and Superposition
Building toy models of superposition
C
4.14
Build a toy model that needs multiple hidden layers of ReLU's. Can computation in superposition happen across several layers? Eg max (|x|, |y|)
109
Exploring Polysemanticity and Superposition
Building toy models of superposition
C
4.15
Build a toy model of attention head superposition/polysemanticity. Can you find a task where the model wants to do different things with an attention head on different inputs? How does it represent things internally / deal with interference?
110
Exploring Polysemanticity and Superposition
Building toy models of superposition
D
4.16
Build a toy model with a mdoel needs to deal with simultaneous interference, and try to understand how it does it, or if it can.
111
Exploring Polysemanticity and Superposition
Making toy model counterexamples
C
4.17
Make toy models that are counterexamples in MI. A learned example of a network with a non-linear representation.
112
Exploring Polysemanticity and Superposition
Making toy model counterexamples
C
4.18
Make toy models that are counterexamples in MI. A network without a discrete number of features.
113
Exploring Polysemanticity and Superposition
Making toy model counterexamples
C
4.19
Make toy models that are counterexamples in MI. A non-decomposable neural network.
114
Exploring Polysemanticity and Superposition
Making toy model counterexamples
C
4.2
Make toy models that are counterexamples in MI. A task where networks can learn multiple different sets of features.
115
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
B
4.21
Induction heads copy the token they attend to the output, which involves storing which of 50,000 tokens it is. How are these stored in a 64-dimensional space?
116
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
B
4.22
How does the previous token head in an induction circuit communicate the value of the previous token to the key of the induction head? Bonus: What residual stream subspace does it take up? Is there interference?
117
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
B
4.23
How does the IOI circuit communicate names/positions between composing heads?
118
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
B
4.24
Are there dedicated dimensions for positional embeddings? Do any other components write to those dimensions?
119
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
A
4.25
Can you find any examples of the geometric superposition configurations in the residual stream of a language model?
120
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
C
4.26
Can you find any examples of locally almost-orthogonal bases?
121
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
C
4.27
Do language models have "genre" directions that detect the type of text, and then represent features specific to each genre in the same subspace?
122
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
D
4.28
Can you find examples of a model learning to deal with simultaneous interference?
123
Exploring Polysemanticity and Superposition
Studying neuron superposition in real models
B
4.29
Look at a polysemantic neuron in a 1L language model. Can you figure out how the model disambiguates what feature it is?
124
Exploring Polysemanticity and Superposition
Studying neuron superposition in real models
C
4.3
Look at a polysemantic neuron in a 2L language model. Can you figure out how the model disambiguates what feature it is?
125
Exploring Polysemanticity and Superposition
Studying neuron superposition in real models
B
4.31
Take a feature that's part of a polysemantic neuron in a 1L language model and try to identify every neuron that represents that feature. Is it sparse or diffuse?
126
Exploring Polysemanticity and Superposition
Studying neuron superposition in real models
C
4.32
Try to fully reverse engineer a feature discovered in 4.31.
127
Exploring Polysemanticity and Superposition
Studying neuron superposition in real models
C
4.33
Can you use superposition to create an adversarial example for a neuron?
128
Exploring Polysemanticity and Superposition
Studying neuron superposition in real models
C
4.34
Can you find any examples of the asymmetric superposition motif in the MLP of a 1-2 layer language model?
129
Exploring Polysemanticity and Superposition
C
4.35
Pick a simple feature of language (e.g, is number, is base64) and train a linear probe to detect that in the MLP activations of a 1L language model.
130
Exploring Polysemanticity and Superposition
D
4.36
Look for features in Neuroscope that seem to be represented by various neurons in a 1-2 layer language model. Train probes to detect some of them. Compare probe performance vs. neuron performance.
131
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
A
4.37
How do TransformerLens SoLU / GeLU models compare in Neuroscope under the SoLU polysemanticity metric? (What fraction of neurons seem monosemantic)
132
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
B
4.38
Can you find any better metrics for polysemanticity?
133
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
B
4.39
The paper speculates LayerNorm lets the model "smuggle through" superposition in SoLU models by smearing features across many dimensions and letting LayerNorm scale it up. Can you find evidence of this?
134
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
B
4.4
How similar are the neurons between SoLU/GELU models of the same layers?
135
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
C
4.41
How does GELU vs. ReLU compare re: polysemanticity. Replicate SoLU analysis.
136
Exploring Polysemanticity and Superposition
Getting rid of superposition
C
4.42
If you train a 1L/2L language model with d_mlp = 100 * d_model, does superposition go away?
137
Exploring Polysemanticity and Superposition
Getting rid of superposition
C
4.43
Study the T5 XXL. It's 11B params and not supported by TransformerLens. Expect major infrastructure pain.
138
Exploring Polysemanticity and Superposition
Getting rid of superposition
D
4.44
Can you take a trained model, freeze all weights except an MLP layer, x10 that layer's width, copy each neuron 10 times, add noise, and fine-tune? Does this remove superposition / add new features?
139
Exploring Polysemanticity and Superposition
Getting rid of superposition
C
4.45
Pick an open problem at the end of Toy Models of Superposition.
140
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
B
5.1
Understanding why 5 digit addition has a phase change per digit (so 6 total?!)
141
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
C
5.2
Why do 5-digit addition phase changes happen in that order?
142
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
B
5.3
Look at the PCA of logits on the full dataset, or the PCA of a stack of flattened weights. If you plot a scatter plot of the first 2 components, the different phases of training are clearly visible. What's up with this?
143
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
C
5.4
Can we predict when grokking will happen? Bonus: Without using any future information?
144
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
C
5.5
Understanding why the model chooses specific frequencies (and why it switches mid-training sometimes!)
145
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
B
5.6
What happens if we include in the loss one of the progress measures in Neel's grokking post? Can we accelerate or stop grokking?
146
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
B
5.7
Adam Jermyn provides an analytical argument and some toy models for why phase transition should be an inherent part of (some of) how models learn. Can you find evidence of this in more complex models?
147
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
B
5.8
Build on and refine Adam Jermyn's arguments and toy models - think about how they deviate from a real transformer, and build more faithful models.
148
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
B
5.9
For a toy model trained to form induction heads, is there a lottery-ticket style thing going on? Can you disrupt induction head formation by messing with the initialisation?
149
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
C
5.1
All Neel's toy models (attn-only, gelu, solu) were trained with the same data shuffle and weight initialisation. Many induction heads aren't shared, but L2H3 in 3L and L1H6 in 2L always are. What's up with that?
150
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
B
5.11
If we knock out the parameters that form important circuits at the end of training on some toy task, but knock them out at the start of training, how much does that delay/stop generalisation?
151
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
B
5.12
Analysing how pairs of heads in an induction circuit compose over time - Can you find progress measures which predict these?
152
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
B
5.13
Analysing how pairs of heads in an induction circuit compose over time - Can we predict which heads will learn to compose first?
153
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
B
5.14
Analysing how pairs of heads in an induction circuit compose over time -Does the composition develop as a phase transition?
154
Analysing Training Dynamics
Understanding fine-tuning
C
5.15
Build a toy model of fine-tuning (train on task 1, fine-tune on task 2). What is going on internally? Any interesting motifs?
155
Analysing Training Dynamics
Understanding fine-tuning
A
5.16
How does model performance change on the original training distribution when finetuning?
156
Analysing Training Dynamics
Understanding fine-tuning
B
5.17
How is the model different on fine-tuned text? Look at examples where the model does much better after fine-tuning, and some normal text.
157
Analysing Training Dynamics
Understanding fine-tuning
B
5.18
Try activation patching between the old and fine-tuned model and see how hard recovering performance is.
158
Analysing Training Dynamics
Understanding fine-tuning
B
5.19
Look at max activating text for various neurons in the original models. How has it changed post fine-tuning?
159
Analysing Training Dynamics
Understanding fine-tuning
B
5.2
Explore further and see what's going on with fine-tuning mechanistically.
160
Analysing Training Dynamics
Understanding fine-tuning
C
5.21
Can you find any phase transitions in the fine-tuning checkpoints?
161
Analysing Training Dynamics
Understanding training dynamics in language models
B
5.22
Can you replicate the induction head phase transition results in the various checkpointed models in TransformerLens? (If code works for attn-only-2l it should work for them all)
162
Analysing Training Dynamics
Understanding training dynamics in language models
B
5.23
Look at the neurons in TransformerLens SoLU models during training. Do they tend to form as a phase transition?
163
Analysing Training Dynamics
Understanding training dynamics in language models
C
5.24
Use the per-token loss analysis technique from the induction heads paper to look for more phase changes.
164
Analysing Training Dynamics
Understanding training dynamics in language models
A
5.25
Look at attention heads on various texts and see if any have recognisable attention patterns, then analyse them over training.
165
Analysing Training Dynamics
Finding phase transitions
A
5.26
Look for phase transitions in the Indirect Object Identification task. (Note: This might not have a phase change)
166
Analysing Training Dynamics
Finding phase transitions
B
5.27
Try digging into the specific heads that act on IOI and look for phase transitions. Use direct logit attribution for the name movers.
167
Analysing Training Dynamics
Finding phase transitions
B