Card View for COP

Card view
14
Category
Difficulty
Search
Toy Language Models
23
Circuits In The Wild
38
Interpreting Algorithmic Problems
33
Exploring Polysemanticity and Superposition
45
Analysing Training Dynamics
38
Techniques, Tooling, and Automation
59
Image Model Interpretability
18
Interpreting Reinforcement Learning
22
Studying Learned Features in Language Models
63
Understanding neurons
Difficulty
A
Number
1.6
Problem
Hunt through Neuroscope for the toy models and look for interesting neurons to focus on.
Understanding neurons
Difficulty
A
Number
1.7
Problem
Can you find any polysemantic neurons in Neuroscope? Explore this.
Difficulty
A
Number
1.23
Problem
Choose your own adventure: Take a bunch of text with interesting patterns and run the models over it. Look for tokens they do really well on and try to reverse engineer what's going on!
Understanding neurons
Difficulty
B
Number
1.1
Problem
How far can you get deeply reverse engineering a neuron in a 1L model? 1L is particularly easy since each neuron's output adds directly to the logits.
Understanding neurons
Difficulty
B
Number
1.2
Problem
Find an interesting neuron you think represents a feature. Can you fully reverse engineer which direction should activate that feature, and compare to neuron input direction?
Understanding neurons
Difficulty
B
Number
1.3
Problem
Look for trigram neurons and try to reverse engineer them. in a 1L model.(e.g, "ice cream -> sundae")
Understanding neurons
Difficulty
B
Number
1.4
Problem
Check out the SoLU paper for more ideas on 1L neurons to find and reverse engineer.
Understanding neurons
Difficulty
B
Number
1.8
Problem
Are there neurons whose behaviour can be matched by a regex or other code? If so, run it on a ton of text and compare the output.
How do larger models differ?
Difficulty
B
Number
1.9
Problem
How do 3-layer and 4-layer attention-only models differ from 2L? (For instance, induction heads only appeared with 2L. Can you find something useful that only appears at 3L or higher?)
How do larger models differ?
Difficulty
B
Number
1.1
Problem
How do 3-layer and 4-layer attention-only models differ from 2L? Look for composition scores - try to identify pairs of heads that compose a lot.
How do larger models differ?
Difficulty
B
Number
1.11
Problem
How do 3-layer and 4-layer attention-only models differ from 2L? Look for evidence of composition.
How do larger models differ?
Difficulty
B
Number
1.12
Problem
How do 3-layer and 4-layer attention-only models differ from 2L? Ablate a single head and run the model on a lot of text. Look at the change in performance. Do any heads matter a lot that aren't induction heads?
How do larger models differ?
Difficulty
B
Number
1.13
Problem
Look for tasks that an n-layer model can't do, but an n+1-layer model can, and look for a circuit that explains this. (Start by running both models on a bunch of text and look for per-token probability differences)
How do larger models differ?
Difficulty
B
Number
1.14
Problem
How do 1L SoLU/GELU models differ from 1L attention-only?
How do larger models differ?
Difficulty
B
Number
1.15
Problem
How do 2L SoLU models differ from 1L?
How do larger models differ?
Difficulty
B
Number
1.16
Problem
How does 1L GELU differ from 1L SoLU?
How do larger models differ?
Difficulty
B
Number
1.17
Problem
Analyse how a larger model "fixes the bugs" of a smaller model.
How do larger models differ?
Difficulty
B
Number
1.18
Problem
Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?
How do larger models differ?
Difficulty
B
Number
1.19
Problem
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Try looking at split-token induction, where the current token has a preceding space and is one token, but the earlier occurrence has no preceding space and is two tokens. E.g " Claire" vs. "Cl" "aire"
How do larger models differ?
Difficulty
B
Number
1.2
Problem
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at misfiring when the previous token appears multiple times with different following tokens
How do larger models differ?
Difficulty
B
Number
1.21
Problem
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at stopping induction on a token that likely shows the end of a repeated string (e.g, . or ! or ")
How do larger models differ?
Difficulty
B
Number
1.22
Problem
Does a 2L MLP model fix these bugs (1.19 -1.21) too?
Understanding neurons
Difficulty
C
Number
1.5
Problem
How far can you get deeply reverse engineering a neuron in a 2+ layer model?
Circuits in natural language
Difficulty
A
Number
2.13
Problem
Choose your own adventure! Try finding behaviours of your own related to natural language circuits.
Circuits in code models
Difficulty
A
Number
2.17
Problem
Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.
Extensions to IOI paper
Difficulty
A
Number
2.18
Problem
Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)
Existing Work
Extensions to IOI paper
Difficulty
A
Number
2.19
Problem
Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?
Extensions to IOI paper
Difficulty
A
Number
2.21
Problem
Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?
Circuits in natural language
Difficulty
B
Number
2.1
Problem
Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?
Circuits in natural language
Difficulty
B
Number
2.2
Problem
Continuing sequences that are common in natural language (E.g, "1 2 3 4" -> "5", "Monday\nTuesday\n" -> "Wednesday"
Existing Work
Currently working
Pablo Hansen- April 18- 2024
Circuits in natural language
Difficulty
B
Number
2.3
Problem
A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!
Circuits in natural language
Difficulty
B
Number
2.4
Problem
3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU
Circuits in natural language
Difficulty
B
Number
2.5
Problem
Converting names to emails, like "Katy Johnson <" -> "katy_johnson"
Circuits in natural language
Difficulty
B
Number
2.8
Problem
Learning that words after full stops are capital letters.
Circuits in natural language
Difficulty
B
Number
2.9
Problem
Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)
Circuits in natural language
Difficulty
B
Number
2.11
Problem
Reverse engineer an induction head in a non-toy model.
Circuits in natural language
Difficulty
B
Number
2.12
Problem
Choosing the right pronouns (E.g, "Lina is a great friend, isn't")
Existing Work
Currently working
Alana Xiang - 5 May 2023
Circuits in code models
Difficulty
B
Number
2.14
Problem
Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.
Circuits in code models
Difficulty
B
Number
2.15
Problem
Closing HTML tags
Extensions to IOI paper
Difficulty
B
Number
2.2
Problem
Is there a general pattern for backup-ness? (Follows 2.19)
Currently working
Manan Suri - 14 July, 2023
Extensions to IOI paper
Difficulty
B
Number
2.22
Problem
Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.
Extensions to IOI paper
Difficulty
B
Number
2.25
Problem
GPT-Neo wasn't trained with dropout - check 2.24 on this.
Extensions to IOI paper
Difficulty
B
Number
2.26
Problem
Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.
Confusing things
Difficulty
B
Number
2.29
Problem
Why do models have so many induction heads? How do they specialise, and why does the model need so many?
Confusing things
Difficulty
B
Number
2.3
Problem
Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?
Confusing things
Difficulty
B
Number
2.31
Problem
Can we find evidence of the residual stream as shared bandwidth hypothesis?
Confusing things
Difficulty
B
Number
2.32
Problem
Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?
Confusing things
Difficulty
B
Number
2.33
Problem
What happens to the memory in an induction circuit? (See 2.32)
Circuits in natural language
Difficulty
C
Number
2.6
Problem
A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail
Circuits in natural language
Difficulty
C
Number
2.7
Problem
Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?
Existing Work
Circuits in natural language
Difficulty
C
Number
2.1
Problem
Interpreting memorisation. Sometimes GPT knows phone numbers. How?
Circuits in code models
Difficulty
C
Number
2.16
Problem
Methods depend on object type (e.g, x.append a list, x.update a dictionary)
Extensions to IOI paper
Difficulty
C
Number
2.23
Problem
What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?
Extensions to IOI paper
Difficulty
C
Number
2.24
Problem
What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?
Extensions to IOI paper
Difficulty
C
Number
2.27
Problem
MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?
Extensions to IOI paper
Difficulty
C
Number
2.28
Problem
Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)
Studying larger models
Difficulty
C
Number
2.34
Problem
GPT-J contains translation heads. Can you interpret how they work and what they do?
Studying larger models
Difficulty
C
Number
2.35
Problem
Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.
Studying larger models
Difficulty
C
Number
2.36
Problem
What's up with few-shot learning? How does it work?
Studying larger models
Difficulty
C
Number
2.37
Problem
How does addition work? (Focus on 2-digit)
Studying larger models
Difficulty
C
Number
2.38
Problem
What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?
Beginner problems
Difficulty
A
Number
3.1
Problem
Sorting fixed-length lists. (format - START 4 6 2 9 MID 2 4 6 9)
Existing Work
Beginner problems
Difficulty
A
Number
3.2
Problem
Sorting variable-length lists. (What's the sorting algorithm? What's the longest list you can get do? How does length affect accuracy?)
Beginner problems
Difficulty
A
Number
3.3
Problem
Interpret a 2L MLP (one hidden layer) trained to do modular addition. (Analogous to Neel's grokking work)
Beginner problems
Difficulty
A
Number
3.4
Problem
Interpret a 1L MLP trained to do modular subtraction (Analogous to Neel's grokking work)
Beginner problems
Difficulty
A
Number
3.5
Problem
Taking the minimum or maximum of two ints
Existing Work
Beginner problems
Difficulty
A
Number
3.6
Problem
Permuting lists
Beginner problems
Difficulty
A
Number
3.7
Problem
Calculating sequences with Fibonnaci-style recurrence (predicting next element from the previous two)
Questions about language models
Difficulty
A
Number
3.21
Problem
Train a 1L attention-only transformer with rotary to predict the previous token and reverse engineer how it does this.
Currently working
5/7/23: Eric (repo: https://github.com/DKdekes/rotary-interp)
Extending Othello-GPT
Difficulty
A
Number
3.3
Problem
Try one of Neel's concrete Othello-GPT projects.
Harder problems
Difficulty
B
Number
3.8
Problem
5-digit addition/subtraction.
Harder problems
Difficulty
B
Number
3.9
Problem
Predicting the output to simple code function. E.g, problems like "a = 1 2 3. a[2] = 4. a -> 1 2 4"
Existing Work
Harder problems
Difficulty
B
Number
3.1
Problem
Graph theory problems like this. Unsure of the correct input format. Try a bunch. See here
Harder problems
Difficulty
B
Number
3.11
Problem
Train a model on multiple algorithmic tasks we understand (like modular addition and subtraction). Compare to a model trained on each task. Does it learn the same circuits? Is there superposition?
Currently working
Joshua ; jhdhill@uwaterloo.ca ; jan 31 2024
Harder problems
Difficulty
B
Number
3.12
Problem
Train models for automata tasks and interpret them. Do your results match the theory?
Harder problems
Difficulty
B
Number
3.13
Problem
In-Context Linear Regression - the transformer gets a sequence (x_1, y_1, x_2, y_2, ...) where y_i = Ax_i + b. A and b are different for each prompt, and need to be learned in-context. (Code here)
Harder problems
Difficulty
B
Number
3.16
Problem
Predict repeated subsequences in randomly generated tokens, and see if you can find and reverse engineer induction heads.
Difficulty
B
Number
3.18
Problem
Build a toy model of Indirect Object Identification - train a tiny attention-only model on an algorithmic task simulating IOI - and reverse-engineer the learned solution. Compare it to the circuit found in GPT-2 Small.
Questions about language models
Difficulty
B
Number
3.22
Problem
Train a 3L attention-only transformer to perform the Indirect Object Identification task. Can it do the task? Does it learn the same circuit found in GPT-2 Small?
Questions about language models
Difficulty
B
Number
3.23
Problem
Redo Neel's modular addition analysis with GELU. Does it change things?
Questions about language models
Difficulty
B
Number
3.26
Problem
In modular addition, look at what different dimensionality reduction techniques do on different weight matrices. Can you identify which weights matter most? Which neurons form clusters for each frequency? Anything from activations?
Extending Othello-GPT
Difficulty
B
Number
3.32
Problem
Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and don't work. Try to build our understanding of transformer MLP's in general.
Harder problems
Difficulty
C
Number
3.14
Problem
Problems in In-Context Linear Regression that are in-context learned. See 3.13.
Harder problems
Difficulty
C
Number
3.15
Problem
5 digit (or binary) multiplication
Harder problems
Difficulty
C
Number
3.17
Problem
Choose your own adventure! Find your own algorithmic problem. Leetcode easy is probably a good source.
Difficulty
C
Number
3.19
Problem
Is 3.18 consistent across random seeds, or can other algorithms be learned? Can a 2L model learn this? What happens if you add more MLP's or more layers?
Difficulty
C
Number
3.2
Problem
Reverse-engineer Othello-GPT. Can you reverse-engineer the algorithms it learns, or the features the probes find?
Questions about language models
Difficulty
C
Number
3.24
Problem
How does memorisation work? Try training a one hidden layer MLP to memorise random data, or training a transformer on a fixed set of random strings of tokens.
Questions about language models
Difficulty
C
Number
3.25
Problem
Compare different dimensionality reduction techniques on modular addition or a problem you feel you understand.
Questions about language models
Difficulty
C
Number
3.27
Problem
Is direct logit attribution always useful? Can you find examples where it's highly misleading?
Extending Othello-GPT
Difficulty
C
Number
3.31
Problem
Looking for modular circuits - try to find the circuits used to compute the world model and to use the world model to compute the next move. Try to understand each in isolation and use this to understand how they fit together. See what you can learn about finding modular circuits in general.
Existing Work
Extending Othello-GPT
Difficulty
C
Number
3.33
Problem
Transformer Circuits Laboratory - Explore and test other conjectures about transformer circuits - e.g, can we figure out how the model manages memory in the residual stream?
Deep learning mysteries
Difficulty
D
Number
3.28
Problem
Explore the Lottery Ticket Hypothesis
Deep learning mysteries
Difficulty
D
Number
3.29
Problem
Explore Deep Double Descent
Confusions to study in Toy Models of Superposition
Difficulty
A
Number
4.1
Problem
Does dropout create a privileged basis? Put dropout on the hidden layer of the ReLU output model and study how this changes the results.
Existing Work
Post
Currently working
14 April 2023: Kunvar (firstuserhere)
Confusions to study in Toy Models of Superposition
Difficulty
A
Number
4.5
Problem
Explore neuron superposition by training their absolute value model on functions of multiple variables. Make inputs binary (0/1) and look at the AND and OR of element pairs.
Confusions to study in Toy Models of Superposition
Difficulty
A
Number
4.7
Problem
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features 1 (i.e, two possible values)
Confusions to study in Toy Models of Superposition
Difficulty
A
Number
4.1
Problem
What happens if you replace ReLU's with GeLU's in the toy models?
Currently working
May 1, 2023 - Kunvar (firstuserhere)
Studying bottleneck superposition in real language models
Difficulty
A
Number
4.25
Problem
Can you find any examples of the geometric superposition configurations in the residual stream of a language model?
Comparing SoLU/GELU
Difficulty
A
Number
4.37
Problem
How do TransformerLens SoLU / GeLU models compare in Neuroscope under the SoLU polysemanticity metric? (What fraction of neurons seem monosemantic)
Confusions to study in Toy Models of Superposition
Difficulty
B
Number
4.2
Problem
Replicate their absolute value model and study some of the variants of the ReLU output models.
Currently working
May 4, 2023 - Kunvar (firstuserhere)
Confusions to study in Toy Models of Superposition
Difficulty
B
Number
4.3
Problem
Explore neuron superposition by training their absolute value model on a more complex function like x -> x^2.
Confusions to study in Toy Models of Superposition
Difficulty
B
Number
4.4
Problem
What happens to their ReLU output model when there's non-uniform sparsity? E.g, one class of less sparse features and another of very sparse
Confusions to study in Toy Models of Superposition
Difficulty
B
Number
4.6
Problem
Explore neuron superposition by training their absolute value model on functions of multiple variables. Keep the inputs as uniform reals in [0, 1] and look at max(x, y)
Confusions to study in Toy Models of Superposition
Difficulty
B
Number
4.8
Problem
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features discrete (1, 2, 3)
Confusions to study in Toy Models of Superposition
Difficulty
B
Number
4.9
Problem
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features uniform [0.5, 1]
Currently working
April 30, 2023; Kunvar(firstuserhere)
Studying bottleneck superposition in real language models
Difficulty
B
Number
4.21
Problem
Induction heads copy the token they attend to the output, which involves storing which of 50,000 tokens it is. How are these stored in a 64-dimensional space?
Studying bottleneck superposition in real language models
Difficulty
B
Number
4.22
Problem
How does the previous token head in an induction circuit communicate the value of the previous token to the key of the induction head? Bonus: What residual stream subspace does it take up? Is there interference?
Studying bottleneck superposition in real language models
Difficulty
B
Number
4.23
Problem
How does the IOI circuit communicate names/positions between composing heads?
Studying bottleneck superposition in real language models
Difficulty
B
Number
4.24
Problem
Are there dedicated dimensions for positional embeddings? Do any other components write to those dimensions?
Studying neuron superposition in real models
Difficulty
B
Number
4.29
Problem
Look at a polysemantic neuron in a 1L language model. Can you figure out how the model disambiguates what feature it is?
Studying neuron superposition in real models
Difficulty
B
Number
4.31
Problem
Take a feature that's part of a polysemantic neuron in a 1L language model and try to identify every neuron that represents that feature. Is it sparse or diffuse?
Comparing SoLU/GELU
Difficulty
B
Number
4.38
Problem
Can you find any better metrics for polysemanticity?
Comparing SoLU/GELU
Difficulty
B
Number
4.39
Problem
The paper speculates LayerNorm lets the model "smuggle through" superposition in SoLU models by smearing features across many dimensions and letting LayerNorm scale it up. Can you find evidence of this?
Comparing SoLU/GELU
Difficulty
B
Number
4.4
Problem
How similar are the neurons between SoLU/GELU models of the same layers?
Confusions to study in Toy Models of Superposition
Difficulty
C
Number
4.11
Problem
Can you find a toy model where GELU acts significantly differently from ReLU?
Currently working
May 1, 2023 - Kunvar (firstuserhere)
Building toy models of superposition
Difficulty
C
Number
4.12
Problem
Build a toy model of a classification problem with cross-entropy loss
Currently working
Building toy models of superposition
Difficulty
C
Number
4.13
Problem
Build a toy model of neuron superposition that has many more hidden features than output features
Building toy models of superposition
Difficulty
C
Number
4.14
Problem
Build a toy model that needs multiple hidden layers of ReLU's. Can computation in superposition happen across several layers? Eg max (|x|, |y|)
Building toy models of superposition
Difficulty
C
Number
4.15
Problem
Build a toy model of attention head superposition/polysemanticity. Can you find a task where the model wants to do different things with an attention head on different inputs? How does it represent things internally / deal with interference?
Making toy model counterexamples
Difficulty
C
Number
4.17
Problem
Make toy models that are counterexamples in MI. A learned example of a network with a non-linear representation.
Making toy model counterexamples
Difficulty
C
Number
4.18
Problem
Make toy models that are counterexamples in MI. A network without a discrete number of features.
Making toy model counterexamples
Difficulty
C
Number
4.19
Problem
Make toy models that are counterexamples in MI. A non-decomposable neural network.
Making toy model counterexamples
Difficulty
C
Number
4.2
Problem
Make toy models that are counterexamples in MI. A task where networks can learn multiple different sets of features.
Studying bottleneck superposition in real language models
Difficulty
C
Number
4.26
Problem
Can you find any examples of locally almost-orthogonal bases?
Studying bottleneck superposition in real language models
Difficulty
C
Number
4.27
Problem
Do language models have "genre" directions that detect the type of text, and then represent features specific to each genre in the same subspace?
Studying neuron superposition in real models
Difficulty
C
Number
4.3
Problem
Look at a polysemantic neuron in a 2L language model. Can you figure out how the model disambiguates what feature it is?
Studying neuron superposition in real models
Difficulty
C
Number
4.32
Problem
Try to fully reverse engineer a feature discovered in 4.31.
Studying neuron superposition in real models
Difficulty
C
Number
4.33
Problem
Can you use superposition to create an adversarial example for a neuron?
Studying neuron superposition in real models
Difficulty
C
Number
4.34
Problem
Can you find any examples of the asymmetric superposition motif in the MLP of a 1-2 layer language model?
Difficulty
C
Number
4.35
Problem
Pick a simple feature of language (e.g, is number, is base64) and train a linear probe to detect that in the MLP activations of a 1L language model.
Comparing SoLU/GELU
Difficulty
C
Number
4.41
Problem
How does GELU vs. ReLU compare re: polysemanticity. Replicate SoLU analysis.
Getting rid of superposition
Difficulty
C
Number
4.42
Problem
If you train a 1L/2L language model with d_mlp = 100 * d_model, does superposition go away?
Getting rid of superposition
Difficulty
C
Number
4.43
Problem
Study the T5 XXL. It's 11B params and not supported by TransformerLens. Expect major infrastructure pain.
Getting rid of superposition
Difficulty
C
Number
4.45
Problem
Pick an open problem at the end of Toy Models of Superposition.
Building toy models of superposition
Difficulty
D
Number
4.16
Problem
Build a toy model with a mdoel needs to deal with simultaneous interference, and try to understand how it does it, or if it can.
Studying bottleneck superposition in real language models
Difficulty
D
Number
4.28
Problem
Can you find examples of a model learning to deal with simultaneous interference?
Difficulty
D
Number
4.36
Problem
Look for features in Neuroscope that seem to be represented by various neurons in a 1-2 layer language model. Train probes to detect some of them. Compare probe performance vs. neuron performance.
Getting rid of superposition
Difficulty
D
Number
4.44
Problem
Can you take a trained model, freeze all weights except an MLP layer, x10 that layer's width, copy each neuron 10 times, add noise, and fine-tune? Does this remove superposition / add new features?
Understanding fine-tuning
Difficulty
A
Number
5.16
Problem
How does model performance change on the original training distribution when finetuning?
Understanding training dynamics in language models
Difficulty
A
Number
5.25
Problem
Look at attention heads on various texts and see if any have recognisable attention patterns, then analyse them over training.
Finding phase transitions
Difficulty
A
Number
5.26
Problem
Look for phase transitions in the Indirect Object Identification task. (Note: This might not have a phase change)
Studying path dependence
Difficulty
A
Number
5.33
Problem
How much do the Stanford CRFM models have similar outputs on a given text?
Studying path dependence
Difficulty
A
Number
5.35
Problem
Look for Indirect Object Identification capability in other models of approximately the same size.
Studying path dependence
Difficulty
A
Number
5.38
Problem
Can you find some problem where you understand the circuits and Git Re-Basin does work?
Algorithmic tasks - understanding grokking
Difficulty
B
Number
5.1
Problem
Understanding why 5 digit addition has a phase change per digit (so 6 total?!)
Algorithmic tasks - understanding grokking
Difficulty
B
Number
5.3
Problem
Look at the PCA of logits on the full dataset, or the PCA of a stack of flattened weights. If you plot a scatter plot of the first 2 components, the different phases of training are clearly visible. What's up with this?
Algorithmic tasks - understanding grokking
Difficulty
B
Number
5.6
Problem
What happens if we include in the loss one of the progress measures in Neel's grokking post? Can we accelerate or stop grokking?
Algorithmic tasks - understanding grokking
Difficulty
B
Number
5.7
Problem
Adam Jermyn provides an analytical argument and some toy models for why phase transition should be an inherent part of (some of) how models learn. Can you find evidence of this in more complex models?
Algorithmic tasks - understanding grokking
Difficulty
B
Number
5.8
Problem
Build on and refine Adam Jermyn's arguments and toy models - think about how they deviate from a real transformer, and build more faithful models.
Algorithmic tasks - lottery tickets
Difficulty
B
Number
5.9
Problem
For a toy model trained to form induction heads, is there a lottery-ticket style thing going on? Can you disrupt induction head formation by messing with the initialisation?
Algorithmic tasks - lottery tickets
Difficulty
B
Number
5.11
Problem
If we knock out the parameters that form important circuits at the end of training on some toy task, but knock them out at the start of training, how much does that delay/stop generalisation?
Algorithmic tasks - lottery tickets
Difficulty
B
Number
5.12
Problem
Analysing how pairs of heads in an induction circuit compose over time - Can you find progress measures which predict these?
Algorithmic tasks - lottery tickets
Difficulty
B
Number
5.13
Problem
Analysing how pairs of heads in an induction circuit compose over time - Can we predict which heads will learn to compose first?
Algorithmic tasks - lottery tickets
Difficulty
B
Number
5.14
Problem
Analysing how pairs of heads in an induction circuit compose over time -Does the composition develop as a phase transition?
Understanding fine-tuning
Difficulty
B
Number
5.17
Problem
How is the model different on fine-tuned text? Look at examples where the model does much better after fine-tuning, and some normal text.
Understanding fine-tuning
Difficulty
B
Number
5.18
Problem
Try activation patching between the old and fine-tuned model and see how hard recovering performance is.
Understanding fine-tuning
Difficulty
B
Number
5.19
Problem
Look at max activating text for various neurons in the original models. How has it changed post fine-tuning?
Understanding fine-tuning
Difficulty
B
Number
5.2
Problem
Explore further and see what's going on with fine-tuning mechanistically.
Understanding training dynamics in language models
Difficulty
B
Number
5.22
Problem
Can you replicate the induction head phase transition results in the various checkpointed models in TransformerLens? (If code works for attn-only-2l it should work for them all)
Understanding training dynamics in language models
Difficulty
B
Number
5.23
Problem
Look at the neurons in TransformerLens SoLU models during training. Do they tend to form as a phase transition?
Finding phase transitions
Difficulty
B
Number
5.27
Problem
Try digging into the specific heads that act on IOI and look for phase transitions. Use direct logit attribution for the name movers.
Finding phase transitions
Difficulty
B
Number
5.28
Problem
Study the attention patterns of each category of heads in IOI for phase transitions.
Finding phase transitions
Difficulty
B
Number
5.29
Problem
Look for phase transitions in simple IOI-style algorithmic tasks, like few-shot learning, addition, sorting words alphabetically...
Finding phase transitions
Difficulty
B
Number
5.3
Problem
Look for phase transitions in soft induction heads like translation.
Studying path dependence
Difficulty
B
Number
5.34
Problem
How much do the Stanford CRFM models differ with algorithmic tasks like Indirect Object Identification?
Studying path dependence
Difficulty
B
Number
5.36
Problem
When model scale varies (e.g, GPT-2 small vs. medium) is there anything the smaller model can do that the larger one can't do? (Look at difference in per token log prob)
Studying path dependence
Difficulty
B
Number
5.37
Problem
Try applying the Git Re-Basin techniques to a 2L MLP trained for modular addition. Does this work? If you use Neel's grokking work to analyse the circuits involved, how does the re-basin technique map onto the circuits?
Algorithmic tasks - understanding grokking
Difficulty
C
Number
5.2
Problem
Why do 5-digit addition phase changes happen in that order?
Algorithmic tasks - understanding grokking
Difficulty
C
Number
5.4
Problem
Can we predict when grokking will happen? Bonus: Without using any future information?
Algorithmic tasks - understanding grokking
Difficulty
C
Number
5.5
Problem
Understanding why the model chooses specific frequencies (and why it switches mid-training sometimes!)
Algorithmic tasks - lottery tickets
Difficulty
C
Number
5.1
Problem
All Neel's toy models (attn-only, gelu, solu) were trained with the same data shuffle and weight initialisation. Many induction heads aren't shared, but L2H3 in 3L and L1H6 in 2L always are. What's up with that?
Understanding fine-tuning
Difficulty
C
Number
5.15
Problem
Build a toy model of fine-tuning (train on task 1, fine-tune on task 2). What is going on internally? Any interesting motifs?
Understanding fine-tuning
Difficulty
C
Number
5.21
Problem
Can you find any phase transitions in the fine-tuning checkpoints?
Understanding training dynamics in language models
Difficulty
C
Number
5.24
Problem
Use the per-token loss analysis technique from the induction heads paper to look for more phase changes.
Finding phase transitions
Difficulty
C
Number
5.31
Problem
Look for phase transitions in benchmark performance or specific questions from a benchmark.
Finding phase transitions
Difficulty
D
Number
5.32
Problem
Hypothesis: Scaling laws happen because models experience a ton of tiny phase changes which average out to a smooth curve due to the law of large numbers. Can you find evidence for or against that?
Breaking current techniques
Difficulty
A
Number
6.1
Problem
Try to find concrete edge cases where a technique breaks - start with a misleading example in a real model or training a toy model with one.
Breaking current techniques
Difficulty
A
Number
6.7
Problem
Find edge cases where ablations break. (Start w/ backup name movers in the IOI circuit, where we know zero ablations break)
ROME activation patching
Difficulty
A
Number
6.15
Problem
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). How do results change when you do single layers?
ROME activation patching
Difficulty
A
Number
6.16
Problem
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching specific neurons?
Automatically find circuits
Difficulty
A
Number
6.18
Problem
Automate ways to find previous token heads. (Bonus: Add to TransformerLens!)
Existing Work
Automatically find circuits
Difficulty
A
Number
6.19
Problem
Automate ways to find duplicate token heads. (Bonus: Add to TransformerLens!)
Existing Work
Automatically find circuits
Difficulty
A
Number
6.2
Problem
Automate ways to find induction heads. (Bonus: Add to TransformerLens!)
Existing Work
Automatically find circuits
Difficulty
A
Number
6.21
Problem
Automate ways to find translation heads. (Bonus: Add to TransformerLens!)
Refine max activating dataset examples
Difficulty
A
Number
6.36
Problem
Using 6.28: Finding the minimal example to activate a neuron by truncating the text - how often does this work?
Refine max activating dataset examples
Difficulty
A
Number
6.37
Problem
Using 6.28: Can you replicate the results of the interpretability illusion for Neel's toy models by finding seemingly monosemantic neurons on Python code or C4 (web text), but are polysemantic when combined?
Breaking current techniques
Difficulty
B
Number
6.2
Problem
Break direct logit attribution - start by looking at GPT-Neo Small where the logit lens (precursor to direct logit attribution) seems to work badly, but works well if you include the final layer and the unembed.
Breaking current techniques
Difficulty
B
Number
6.4
Problem
Find edge cases where linearising LayerNorm breaks. See some work by Eric Winsor at Conjecture.
Breaking current techniques
Difficulty
B
Number
6.5
Problem
Find edge cases where activation patching breaks. (It should break when you patch one variable but there's dependence on multiples)
Breaking current techniques
Difficulty
B
Number
6.8
Problem
Can you find places where one ablation (zero, mean, random) breaks but the others don't?
Breaking current techniques
Difficulty
B
Number
6.9
Problem
Find edge cases where composition scores break. (They don't work well for the IOI circuit)
Breaking current techniques
Difficulty
B
Number
6.1
Problem
Find edge cases where eigenvalue copying scores break.
Difficulty
B
Number
6.12
Problem
Try looking for composition on a specific input. Decompose the residual stream into the sum of outputs of previous heads, then decompose query, key, value into sums of terms from each previous head. Are any larger than the others / matter more if you ablate them / etc?
Difficulty
B
Number
6.14
Problem
Compare causal tracing to activation patching. Do they give the same outputs? Can you find situations where one breaks and the other doesn't? (Try IOI task or factual recall task)
ROME activation patching
Difficulty
B
Number
6.17
Problem
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching some set of neurons? (E.g, the neurons that activate the most within the 10 layers?)
Automatically find circuits
Difficulty
B
Number
6.22
Problem
Automate ways to find few shot learning heads. (Bonus: Add to TransformerLens!)
Automatically find circuits
Difficulty
B
Number
6.23
Problem
Can you find an automated way to detect pointer arithmetic based induction heads vs. classic induction heads?
Automatically find circuits
Difficulty
B
Number
6.24
Problem
Can you find an automated way to detect the heads used in the IOI Circuit? (S-inhibition, name mover, negative name mover, backup name mover)
Automatically find circuits
Difficulty
B
Number
6.25
Problem
Can you automate detection of the heads used in factual recall to move information about the fact to the final token? (Try activation patching)
Automatically find circuits
Difficulty
B
Number
6.26
Problem
(Infrastructure) Combine some of the head detectors from 6.18-6.25 to make a "wiki" for a range of models, with information and scores for each head for how it falls into different categories. MVP: Pandas Dataframes with a row for each head and a column for each metric.
Refine max activating dataset examples
Difficulty
B
Number
6.3
Problem
Using 6.28: Corrupt different token embeddings in a sequence to see which matter.
Refine max activating dataset examples
Difficulty
B
Number
6.31
Problem
Using 6.28: Compare to randomly chosen directions in neuron activation space to see how clustered/monosemantic things seem.
Refine max activating dataset examples
Difficulty
B
Number
6.32
Problem
Using 6.28: Validate these by comparing to direct effect of neuron on the logits, or output vocab logits most boosted by that neuron.
Refine max activating dataset examples
Difficulty
B
Number
6.33
Problem
Using 6.28: Use a model like GPT-3 to find similar text to an existing example and see if they also activate the neuron. Bonus: Use them to replace specific tokens.
Refine max activating dataset examples
Difficulty
B
Number
6.34
Problem
Using 6.28: Look at dataset examples at different quantiles for neuron activations (25%, 50%, 75%, 90%, 95%). Does that change anything?
Refine max activating dataset examples
Difficulty
B
Number
6.38
Problem
Using 6.28: In SoLU models, compare max activating results for pre-SoLU, post-SoLU, and post LayerNorm activations. ('pre', 'mid', 'post' in TransformerLens). How consistent are they? Does one seem more principled?
Interpreting models with LLM's
Difficulty
B
Number
6.39
Problem
Can GPT-3 figure out trends in max activating examples for a neuron?
Interpreting models with LLM's
Difficulty
B
Number
6.4
Problem
Can you use GPT-3 to generate counterfactual prompts with lined up tokens to do activation patching on novel problems? (E.g, "John gave a bottle of milk to -> Mary" vs. "Mary gave a bottle of milk to -> John")
Apply techniques from non-mechanistic interpretability
Difficulty
B
Number
6.42
Problem
How well does feature attribution work on circuits we understand?
Difficulty
B
Number
6.48
Problem
Resolve some of the open issues/feature requests for TransformerLens.
Taking the "diff" of two models
Difficulty
B
Number
6.5
Problem
Using 6.49, run it on a bunch of text and look at the biggest per-token log prob difference.
Taking the "diff" of two models
Difficulty
B
Number
6.51
Problem
Using 6.49, run them on various benchmarks and compare performance.
Taking the "diff" of two models
Difficulty
B
Number
6.52
Problem
Using 6.49, try "benchmarks" like performing algorithmic tasks like IOI, acronyms, etc. as from Circuits In the Wild.
Taking the "diff" of two models
Difficulty
B
Number
6.53
Problem
Using 6.49, try qualitative exploration like just generating text from the models and look for ideas.
Taking the "diff" of two models
Difficulty
B
Number
6.54
Problem
Build tooling to take the diff of two models with the same internal structure. Includes 6.49 but also lets you compare model internals!
Taking the "diff" of two models
Difficulty
B
Number
6.55
Problem
Using 6.54, look for the largest difference in weights.
Taking the "diff" of two models
Difficulty
B
Number
6.56
Problem
Using 6.54, run them on a bunch of text and look for largest difference in activations.
Taking the "diff" of two models
Difficulty
B
Number
6.57
Problem
Using 6.54, look at the direct logit attribution of layers and heads on various texts, and look for the biggest differences.
Taking the "diff" of two models
Difficulty
B
Number
6.58
Problem
Using 6.54, do activation patching on a piece of text where one model does much better than the other - are some parts key to improved performance?
Breaking current techniques
Difficulty
C
Number
6.3
Problem
Can you fix direct logit attribution in GPT-Neo small, e.g, by finding a linear approximation to the final layer by taking gradients? (Eleuther's tuned lens in #interp-across-depth would be a good place to start)
Breaking current techniques
Difficulty
C
Number
6.6
Problem
Find edge cases where causal scrubbing breaks.
Breaking current techniques
Difficulty
C
Number
6.11
Problem
Automate ways to identify heads that compose. Start with IOI circuit and the composition scores in A Mathematical Framework.
Difficulty
C
Number
6.13
Problem
Can you automate direct path patching as used in the IOI paper?
Automatically find circuits
Difficulty
C
Number
6.27
Problem
Can you automate the detection of something in neuron interpretability? E.g, trigram neurons
Automatically find circuits
Difficulty
C
Number
6.28
Problem
Find good ways to find the equivalent of max activating dataset examples for attention heads. Validate on induction circuits, then IOI. See post for ideas.
Refine max activating dataset examples
Difficulty
C
Number
6.29
Problem
Refine the max activating dataset examples technique for neuron interpretability to find minimal or diverse examples.
Refine max activating dataset examples
Difficulty
C
Number
6.35
Problem
Using 6.28: (Infrastructure) Add any of 6.29-6.34 to Neuroscope. Email Neel (neelnanda27@gmail.com) for codebase access.
Apply techniques from non-mechanistic interpretability
Difficulty
C
Number
6.43
Problem
Can you use probing to get evidence for or against predictions in Toy Models of Superposition?
Apply techniques from non-mechanistic interpretability
Difficulty
C
Number
6.44
Problem
Pick anything interesting from Rauker et al and try to apply the techniques to circuits we understand.
Difficulty
C
Number
6.46
Problem
Take existing circuits and explore quantitative ways to characterise that it's a true circuit (or disprove it!) Try causal scrubbing to start.
Difficulty
C
Number
6.47
Problem
Build on Arthur Conmy's work to automatically find circuits via recursive path patching
Taking the "diff" of two models
Difficulty
C
Number
6.49
Problem
Build tooling to take the "diff" of two models, treating them as a black box mapping inputs to outputs, so it works with models with different internal structure
Difficulty
C
Number
6.59
Problem
We understand how attention is calculated for a head using the QK matrix. This doesn't work for rotary attention. Can you find a principled alternative?
Interpreting models with LLM's
Difficulty
D
Number
6.41
Problem
Choose your own adventure - can you find a way to usefully use an LLM to interpret models?
Apply techniques from non-mechanistic interpretability
Difficulty
D
Number
6.45
Problem
Wiles et al gives an automated set of techniques to analyse bugs in image classification models. Can you get any traction adapting this to language models?
Building on Circuits thread
Difficulty
B
Number
7.7
Problem
Look for equivariance in late layers of vision models, symmetries in a network with analogous families of neurons. Likely looks like hunting in Microscope.
Building on Circuits thread
Difficulty
B
Number
7.9
Problem
Look for a wide array of circuits using the weight explorer. What interesting patterns and motifs can you find?
Multimodal models (CLIP interpretability)
Difficulty
B
Number
7.1
Problem
Look at the weights connecting neurons in adjacent layers. How sparse are they? Are there any clear patterns where one neuron is constructed from previous ones?
Multimodal models (CLIP interpretability)
Difficulty
B
Number
7.13
Problem
Can you refine the technique for generating max activating text strings? Could it be applied to language models?
Difficulty
B
Number
7.15
Problem
Does activation patching work on Inception?
Diffusion models
Difficulty
B
Number
7.16
Problem
Apply feature visualisation to neurons in diffusion models and see if any seem clearly interpretable.
Diffusion models
Difficulty
B
Number
7.17
Problem
Are there style transfer neurons in diffusion models? (E.g, activating on "in the style of Thomas Kinkade")
Diffusion models
Difficulty
B
Number
7.18
Problem
Are different circuits activating when different amounts of noise are input in diffusion models?
Reverse engineering image models
Difficulty
C
Number
7.1
Problem
Using Circuits techniques, how well can we reverse engineer ResNet?
Reverse engineering image models
Difficulty
C
Number
7.2
Problem
Vision Transformers - can you smush together transformer circuits and image circuits techniques? Which ones transfer?
Reverse engineering image models
Difficulty
C
Number
7.3
Problem
Using Circuits techniques, how well can we reverse engineer ConvNeXt, a modern image model architecture merging ResNet and vision transformer ideas?
Building on Circuits thread
Difficulty
C
Number
7.4
Problem
How well can you hand-code curve detectors? Can you include color? How much performance can you recover?
Building on Circuits thread
Difficulty
C
Number
7.5
Problem
Can you hand-code any other circuits? Start with other early vision neurons
Building on Circuits thread
Difficulty
C
Number
7.8
Problem
Digging into polysemantic neuron examples and trying to understand better what's going on there.
Multimodal models (CLIP interpretability)
Difficulty
C
Number
7.11
Problem
Can you rigorously reverse engineer any circuits, like the Curve Circuits paper?
Multimodal models (CLIP interpretability)
Difficulty
C
Number
7.12
Problem
Can you apply transformer circuits techniques to understand the attention heads in the image part?
Difficulty
C
Number
7.14
Problem
Train a checkpointed run of Inception. Do curve detectors form as a phase change?
Building on Circuits thread
Difficulty
D
Number
7.6
Problem
What happens if you apply causal scrubbing to the Circuits thread's claimed curve circuits algorithm? (This will take significant conceptual effort to extend to images since it's harder to precisely control input!)
Goal misgeneralisation
Difficulty
B
Number
8.6
Problem
Using 8.5: Possible starting point, Tree Gridworld and Monster Gridworld from Shah et al.
Decision Transformers
Difficulty
B
Number
8.8
Problem
Can you apply transformer circuits to a decision transformer? What do you find?
Existing Work
Decision Transformers
Difficulty
B
Number
8.9
Problem
Try training a 1L decision transformer on a toy problem, like finding the shortest path in a graph.
Interpreting policy gradients
Difficulty
B
Number
8.16
Problem
Can you interpret a small model trained with policy gradients on a gridworld task?
Interpreting policy gradients
Difficulty
B
Number
8.17
Problem
Can you interpret a small model trained with policy gradients on an OpenAI Gym task?
Interpreting policy gradients
Difficulty
B
Number
8.18
Problem
Can you interpret a small model trained with policy gradients on an Atari game (e.g, Pong)?
Difficulty
B
Number
8.22
Problem
Choose your own adventure! There's lots of work in RL - pick something you're excited about and try to reverse engineer something!
AlphaZero
Difficulty
C
Number
8.1
Problem
Replicate some of Tom McGrath's AlphaZero work with LeelaChessZero. Use NMF on the activations and trying to interpret some. See visualisations here.
Goal misgeneralisation
Difficulty
C
Number
8.5
Problem
Intrepret one of the examples in the goal misgeneralisation papers (Langosco et al and Shah et al). Can you concretely figure out what's going on?
Goal misgeneralisation
Difficulty
C
Number
8.7
Problem
Using 8.5: Possible starting point - CoinRun. Interpreting RL Vision made significant progress and Langosco et al found it was an example of goal misgeneralisation - can you build on these to predict the misgeneralisation?
Difficulty
C
Number
8.1
Problem
Train and interpret a model from the In-Context Reinforcement Learning and Algorithmic Distillation paper. They trained small transformers where they input a sequence of moves for a "novel" RL task and the model outputs sensible answers for that task.
Currently working
10/april/2023-Victor Levoso and others , working on reinplementing AD to try this, we have a channel for it on this discord: https://discord.gg/cMr5YqbU4y
Interpreting RLHF Transformers
Difficulty
C
Number
8.12
Problem
Can you find any circuits in CarperAI's RLHF model corresponding to longer term planning?
Interpreting RLHF Transformers
Difficulty
C
Number
8.13
Problem
Can you get any traction on interpreting CarperAI's RLHF model's reward model?
Difficulty
C
Number
8.15
Problem
Try training and interpreting a small model from Guez et al. They trained model-free RL agents and showed evidence they spontaneously learned planning. Can you find evidence for/against this?
Difficulty
C
Number
8.19
Problem
Can you interpret a model on a task from 8.16-8.18 using Q-Learning?
Difficulty
C
Number
8.2
Problem
Take an agent trained with RL and train another network to copy the output logits of that agent. Try to reverse engineer the clone. Can you find the resulting circuits in the original?
Difficulty
C
Number
8.21
Problem
Once you've got traction understanding a fully trained agent on a task elsewhere in this category, try to extend this understanding to study it during training. Can you get any insight into what's actually going on?
AlphaZero
Difficulty
D
Number
8.2
Problem
Try applying 8.1 to an open source AlphaZero style Go playing agent
AlphaZero
Difficulty
D
Number
8.3
Problem
Train a small AlphaZero model on a simple game like Tic-Tac-Toe, and try to apply 8.1 there. (Training will be hard! See this tutorial.)
AlphaZero
Difficulty
D
Number
8.4
Problem
Can you extend the work on LeelaZero? Can you find anything about how a feature is computed? Start by looking for features near the start or end of the network.
Interpreting RLHF Transformers
Difficulty
D
Number
8.11
Problem
Go and interpret CarperAI's RLHF model (forthcoming). What's up with that? How is it different from a vanilla language model?
Interpreting RLHF Transformers
Difficulty
D
Number
8.14
Problem
Train a toy RLHF model (1-2 layers) to do a simple task. Use GPT-3 for human data generation. Then try to interpret it. (Note: This will be hard to train, but Neel would be super excited to see the results!) Bonus: Try bigger models like GPT-2 Medium to XL.
Exploring Neuroscope
Difficulty
A
Number
9.1
Problem
Explore random neurons! Use the interactive neuroscope to test and verify your understanding.
Exploring Neuroscope
Difficulty
A
Number
9.2
Problem
Look for interesting conceptual neurons in the middle layers of larger models, like the "numbers that refer to groups of people" neuron.
Exploring Neuroscope
Difficulty
A
Number
9.3
Problem
Look for examples of detokenisation neurons
Exploring Neuroscope
Difficulty
A
Number
9.4
Problem
Look for examples of trigram neurons (consistently activate on a pair of tokens and boost the logit of plausible next tokens)
Exploring Neuroscope
Difficulty
A
Number
9.5
Problem
Look for examples of retokenization neurons
Exploring Neuroscope
Difficulty
A
Number
9.6
Problem
Look for examples of context neurons (eg base64)
Exploring Neuroscope
Difficulty
A
Number
9.7
Problem
Look for neurons that align with any of the feature ideas in 9.13-9.21
Exploring Neuroscope
Difficulty
A
Number
9.1
Problem
How much does the logit attribution of a neuron align with the dataset example patterns? Is it related?
Seeking out specific features
Difficulty
A
Number
9.13
Problem
Basic syntax (Lots of ideas in post)
Seeking out specific features
Difficulty
A
Number
9.14
Problem
Linguistic features (Try using spaCy to automate this) (Lots of ideas in post)
Seeking out specific features
Difficulty
A
Number
9.15
Problem
Proper nouns (Lots of ideas in post)
Seeking out specific features
Difficulty
A
Number
9.16
Problem
Python code features (Lots of ideas in post)
Seeking out specific features
Difficulty
A
Number
9.2
Problem
LaTeX features. Try common commands (\left, \right) and section titles (\abstract, \introduction, etc.)
Seeking out specific features
Difficulty
A
Number
9.23
Problem
Diambiguation neurons - Foreign language disambiguation (e.g, "die" in Dutch vs. German vs. Afrikaans)
Seeking out specific features
Difficulty
A
Number
9.24
Problem
Disambiguation neurons - words with multiple meanings (e.g, "bat" as animal or sports equipment)
Seeking out specific features
Difficulty
A
Number
9.25
Problem
Search for memory management neurons (high negative cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?
Seeking out specific features
Difficulty
A
Number
9.26
Problem
Search for signal boosting neurons (high positive cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?
Seeking out specific features
Difficulty
A
Number
9.28
Problem
Can you find split-token neurons? (I.e, " Claire" vs. "Cl" and "aire" - the model should learn to identify the split-token case)
Seeking out specific features
Difficulty
A
Number
9.32
Problem
Neurons which link to attention heads - duplicated token
Curiosities about neurons
Difficulty
A
Number
9.4
Problem
When you look at the max dataset examples for a specific neuron, is that neuron the most activated neuron on the text? What does it look like in general?
Curiosities about neurons
Difficulty
A
Number
9.41
Problem
Look at the distributions of neuron activations (pre and post-activation for GELU, and pre, mid, and post for SoLU). What does this look like? How heavy tailed? How well can it be modelled as a normal distribution?
Curiosities about neurons
Difficulty
A
Number
9.43
Problem
How similar are the distributions between SoLU and GELU?
Curiosities about neurons
Difficulty
A
Number
9.44
Problem
What does the distribution of the LayerNorm scale and softmax denominator in SoLU look like? Is it bimodal (indicating monosemantic features) or fairly smooth and unimodal?
Curiosities about neurons
Difficulty
A
Number
9.52
Problem
Try comparing how monosemantic the neurons in a GELU vs SoLU model are. Can you replicate the results SoLU does better? What are the rates for each model?
Miscellaneous
Difficulty
A
Number
9.59
Problem
Can you replicate the results of the interpretability illusion on SoLU models, which were trained on a mix of web text and Python code? (Find neurons that seem monosemantic on either but with importantly different patterns)
Exploring Neuroscope
Difficulty
B
Number
9.8
Problem
Look for examples of neurons with a naive (but incorrect!) initial story that have a much simpler explanation after further investigation
Exploring Neuroscope
Difficulty
B
Number
9.9
Problem
Look for examples of neurons with a naive (but incorrect!) initial story that have a much more complex explanation after further investigation
Exploring Neuroscope
Difficulty
B
Number
9.11
Problem
If you find neurons for 9.10 that seem very inconsistent, can you figure out what's going on?
Exploring Neuroscope
Difficulty
B
Number
9.12
Problem
For dataset examples for neurons in a 1L network, measure how much its pre-activation value comes from the output of each attention head vs. the embedding (vs. positional embedding!). If dominated by specific heads, how much do those heads attend to the tokens you expect?
Seeking out specific features
Difficulty
B
Number
9.17
Problem
From 9.16 - level of indent for a line (harder because it's categorical/numeric)
Seeking out specific features
Difficulty
B
Number
9.18
Problem
From 9.16 - level of bracket nesting (harder because it's categorical/numeric)
Seeking out specific features
Difficulty
B
Number
9.19
Problem
General code features (Lots of ideas in post)
Seeking out specific features
Difficulty
B
Number
9.21
Problem
Features in compiled LaTeX, e.g paper citations
Seeking out specific features
Difficulty
B
Number
9.22
Problem
Any of the more abstract neurons in Multimodel Neurons (e.g Christmas, sadness, teenager, anime, Pokemon, etc.)
Seeking out specific features
Difficulty
B
Number
9.29
Problem
Can you find examples of neuron families/equivariance? (Ideas in post)
Seeking out specific features
Difficulty
B
Number
9.3
Problem
Neurons which link to attention heads - Induction should NOT trigger (e.g, current token repeated but previous token is not, different copies of current string have different next tokens)
Seeking out specific features
Difficulty
B
Number
9.31
Problem
Neurons which link to attention heads - fixing a skip trigram bug
Seeking out specific features
Difficulty
B
Number
9.33
Problem
Neurons which link to attention heads - splitting into token X is duplicated for many common tokens
Seeking out specific features
Difficulty
B
Number
9.34
Problem
Neurons which represent positional information (not invariant between position). Will need to input data with a random offset to isolate this.
Seeking out specific features
Difficulty
B
Number
9.35
Problem
What is the longest n-gram you can find that seems represented?
Curiosities about neurons
Difficulty
B
Number
9.42
Problem
Do neurons vary in terms of how heavy tailed their distributions are? Does it at all correspond to monosemanticity?
Curiosities about neurons
Difficulty
B
Number
9.45
Problem
Can you find any genuinely monosemantic neurons? That are mostly monosemantic across their entire activation range?
Curiosities about neurons
Difficulty
B
Number
9.46
Problem
Find a feature where GELU is used to calculate it in a way that ReLU couldn't be (e.g, approximating a quadratic)
Curiosities about neurons
Difficulty
B
Number
9.47
Problem
Can you find a feature which seems to be represented by several neurons?
Curiosities about neurons
Difficulty
B
Number
9.48
Problem
Using 9.47 - what happens if you ablate some of the neurons? Is it robust to this? Does it need them all?
Curiosities about neurons
Difficulty
B
Number
9.49
Problem
Can you find a feature that is highly diffuse across neurons? (I.e, represented by the MLP layer but doesn't activate any particular neuron a lot)
Curiosities about neurons
Difficulty
B
Number
9.5
Problem
Look at the direct logit attribution of neurons and find the max dataset examples for this. How similar are the texts to max activating dataset examples?
Curiosities about neurons
Difficulty
B
Number
9.51
Problem
Looka t the max negative direct logit attribution. Are there neurons which systematically suppress the correct next token? Can you figure out what's up with these?
Curiosities about neurons
Difficulty
B
Number
9.53
Problem
Using 9.52, can you come up with a better and more robust metric? How consistent is it across reasonable metrics?
Curiosities about neurons
Difficulty
B
Number
9.54
Problem
The GELU and SoLU toy language models were trained with identical initialisation and data shuffle. Is there any correspondence between what neurons represent in each model?
Curiosities about neurons
Difficulty
B
Number
9.55
Problem
If a feature is represented in one of the GELU/SoLU models, how likely is it to be represented in the other?
Curiosities about neurons
Difficulty
B
Number
9.56
Problem
Can you find a neuron whose activation isn't significantly affected by the current token?
Miscellaneous
Difficulty
B
Number
9.57
Problem
An important ability of a network is to attend to things within the current clause or sentence. Are models doing something more sophisticated than distance here, like punctuation? If so, are there relevant neurons/features?
Miscellaneous
Difficulty
B
Number
9.6
Problem
Try doing dimensionality reduction over neuron activations across a bunch of text, and see how interpretable the resulting directions are.
Miscellaneous
Difficulty
B
Number
9.61
Problem
Pick a BERTology paper and try to replicate it on GPT-2! (See post for ideas)
Miscellaneous
Difficulty
B
Number
9.62
Problem
Make a PR to Neuroscope with some feature you wish it had!
Miscellaneous
Difficulty
B
Number
9.63
Problem
Replicate the part of Conjecture's Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns. (Is it the case there are monosemantic bands in the neuron activation spectrum?)
Seeking out specific features
Difficulty
C
Number
9.27
Problem
Search for neurons that clean up superposition interference.
Seeking out specific features
Difficulty
C
Number
9.36
Problem
Try training linear probes for features from 9.13-9.35.
Seeking out specific features
Difficulty
C
Number
9.37
Problem
Using 9.36 - How does your ability to recover features from the residual stream compare to MLP layer outputs vs. attention layer outputs? Can you find features that can only be recovered from some of these?
Seeking out specific features
Difficulty
C
Number
9.38
Problem
Using 9.36 - Are there features that can only be recovered from certain MLP layers?
Seeking out specific features
Difficulty
C
Number
9.39
Problem
Using 9.36 - Are there features that are significantly easier to recover from early layer residual streams and not from later layers?
Miscellaneous
Difficulty
C
Number
9.58
Problem
Replicate Knowledge Neurons in Pretrained Transformers on a generative model. How much are these results consistent with what Neuroscope shows?
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.