Understanding neurons
1.6
Hunt through Neuroscope for the toy models and look for interesting neurons to focus on.
Understanding neurons
1.7
Can you find any polysemantic neurons in Neuroscope? Explore this.
1.23
Choose your own adventure: Take a bunch of text with interesting patterns and run the models over it. Look for tokens they do really well on and try to reverse engineer what's going on!
Circuits in natural language
2.13
Choose your own adventure! Try finding behaviours of your own related to natural language circuits.
Circuits in code models
2.17
Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.
Extensions to IOI paper
2.18
Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)
Extensions to IOI paper
2.19
Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?
Extensions to IOI paper
2.21
Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?
Interpreting Algorithmic Problems
Beginner problems
3.1
Sorting fixed-length lists. (format - START 4 6 2 9 MID 2 4 6 9)
Interpreting Algorithmic Problems
Beginner problems
3.2
Sorting variable-length lists. (What's the sorting algorithm? What's the longest list you can get do? How does length affect accuracy?)
Interpreting Algorithmic Problems
Beginner problems
3.3
Interpret a 2L MLP (one hidden layer) trained to do modular addition. (Analogous to Neel's grokking work)
Interpreting Algorithmic Problems
Beginner problems
3.4
Interpret a 1L MLP trained to do modular subtraction (Analogous to Neel's grokking work)
Interpreting Algorithmic Problems
Beginner problems
3.5
Taking the minimum or maximum of two ints
Interpreting Algorithmic Problems
Beginner problems
3.6
Permuting lists
Interpreting Algorithmic Problems
Beginner problems
3.7
Calculating sequences with Fibonnaci-style recurrence (predicting next element from the previous two)
Interpreting Algorithmic Problems
Questions about language models
3.21
Train a 1L attention-only transformer with rotary to predict the previous token and reverse engineer how it does this.
5/7/23: Eric (repo: https://github.com/DKdekes/rotary-interp)
Interpreting Algorithmic Problems
Extending Othello-GPT
3.3
Try one of Neel's concrete Othello-GPT projects.
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.1
Does dropout create a privileged basis? Put dropout on the hidden layer of the ReLU output model and study how this changes the results.
Post
14 April 2023: Kunvar (firstuserhere)
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.5
Explore neuron superposition by training their absolute value model on functions of multiple variables. Make inputs binary (0/1) and look at the AND and OR of element pairs.
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.7
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features 1 (i.e, two possible values)
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.1
What happens if you replace ReLU's with GeLU's in the toy models?
May 1, 2023 - Kunvar (firstuserhere)
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
4.25
Can you find any examples of the geometric superposition configurations in the residual stream of a language model?
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
4.37
How do TransformerLens SoLU / GeLU models compare in Neuroscope under the SoLU polysemanticity metric? (What fraction of neurons seem monosemantic)
Analysing Training Dynamics
Understanding fine-tuning
5.16
How does model performance change on the original training distribution when finetuning?
Analysing Training Dynamics
Understanding training dynamics in language models
5.25
Look at attention heads on various texts and see if any have recognisable attention patterns, then analyse them over training.
Analysing Training Dynamics
Finding phase transitions
5.26
Look for phase transitions in the Indirect Object Identification task. (Note: This might not have a phase change)
Analysing Training Dynamics
Studying path dependence
5.33
How much do the Stanford CRFM models have similar outputs on a given text?
Analysing Training Dynamics
Studying path dependence
5.35
Look for Indirect Object Identification capability in other models of approximately the same size.
Analysing Training Dynamics
Studying path dependence
5.38
Can you find some problem where you understand the circuits and Git Re-Basin does work?
Techniques, Tooling, and Automation
Breaking current techniques
6.1
Try to find concrete edge cases where a technique breaks - start with a misleading example in a real model or training a toy model with one.
Techniques, Tooling, and Automation
Breaking current techniques
6.7
Find edge cases where ablations break. (Start w/ backup name movers in the IOI circuit, where we know zero ablations break)
Techniques, Tooling, and Automation
ROME activation patching
6.15
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). How do results change when you do single layers?
Techniques, Tooling, and Automation
ROME activation patching
6.16
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching specific neurons?
Techniques, Tooling, and Automation
Automatically find circuits
6.18
Automate ways to find previous token heads. (Bonus: Add to TransformerLens!)
Techniques, Tooling, and Automation
Automatically find circuits
6.19
Automate ways to find duplicate token heads. (Bonus: Add to TransformerLens!)
Techniques, Tooling, and Automation
Automatically find circuits
6.2
Automate ways to find induction heads. (Bonus: Add to TransformerLens!)
Techniques, Tooling, and Automation
Automatically find circuits
6.21
Automate ways to find translation heads. (Bonus: Add to TransformerLens!)
Techniques, Tooling, and Automation
Refine max activating dataset examples
6.36
Using 6.28: Finding the minimal example to activate a neuron by truncating the text - how often does this work?
Techniques, Tooling, and Automation
Refine max activating dataset examples
6.37
Using 6.28: Can you replicate the results of the interpretability illusion for Neel's toy models by finding seemingly monosemantic neurons on Python code or C4 (web text), but are polysemantic when combined?
Studying Learned Features in Language Models
Exploring Neuroscope
9.1
Explore random neurons! Use the interactive neuroscope to test and verify your understanding.
Studying Learned Features in Language Models
Exploring Neuroscope
9.2
Look for interesting conceptual neurons in the middle layers of larger models, like the "numbers that refer to groups of people" neuron.
Studying Learned Features in Language Models
Exploring Neuroscope
9.3
Look for examples of detokenisation neurons
Studying Learned Features in Language Models
Exploring Neuroscope
9.4
Look for examples of trigram neurons (consistently activate on a pair of tokens and boost the logit of plausible next tokens)
Studying Learned Features in Language Models
Exploring Neuroscope
9.5
Look for examples of retokenization neurons
Studying Learned Features in Language Models
Exploring Neuroscope
9.6
Look for examples of context neurons (eg base64)
Studying Learned Features in Language Models
Exploring Neuroscope
9.7
Look for neurons that align with any of the feature ideas in 9.13-9.21
Studying Learned Features in Language Models
Exploring Neuroscope
9.1
How much does the logit attribution of a neuron align with the dataset example patterns? Is it related?
Studying Learned Features in Language Models
Seeking out specific features
9.13
Basic syntax (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
9.14
Linguistic features (Try using spaCy to automate this) (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
9.15
Proper nouns (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
9.16
Python code features (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
9.2
LaTeX features. Try common commands (\left, \right) and section titles (\abstract, \introduction, etc.)
Studying Learned Features in Language Models
Seeking out specific features
9.23
Diambiguation neurons - Foreign language disambiguation (e.g, "die" in Dutch vs. German vs. Afrikaans)
Studying Learned Features in Language Models
Seeking out specific features
9.24
Disambiguation neurons - words with multiple meanings (e.g, "bat" as animal or sports equipment)
Studying Learned Features in Language Models
Seeking out specific features
9.25
Search for memory management neurons (high negative cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?
Studying Learned Features in Language Models
Seeking out specific features
9.26
Search for signal boosting neurons (high positive cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?
Studying Learned Features in Language Models
Seeking out specific features
9.28
Can you find split-token neurons? (I.e, " Claire" vs. "Cl" and "aire" - the model should learn to identify the split-token case)
Studying Learned Features in Language Models
Seeking out specific features
9.32
Neurons which link to attention heads - duplicated token
Studying Learned Features in Language Models
Curiosities about neurons
9.4
When you look at the max dataset examples for a specific neuron, is that neuron the most activated neuron on the text? What does it look like in general?
Studying Learned Features in Language Models
Curiosities about neurons
9.41
Look at the distributions of neuron activations (pre and post-activation for GELU, and pre, mid, and post for SoLU). What does this look like? How heavy tailed? How well can it be modelled as a normal distribution?
Studying Learned Features in Language Models
Curiosities about neurons
9.43
How similar are the distributions between SoLU and GELU?
Studying Learned Features in Language Models
Curiosities about neurons
9.44
What does the distribution of the LayerNorm scale and softmax denominator in SoLU look like? Is it bimodal (indicating monosemantic features) or fairly smooth and unimodal?
Studying Learned Features in Language Models
Curiosities about neurons
9.52
Try comparing how monosemantic the neurons in a GELU vs SoLU model are. Can you replicate the results SoLU does better? What are the rates for each model?
Studying Learned Features in Language Models
Miscellaneous
9.59
Can you replicate the results of the interpretability illusion on SoLU models, which were trained on a mix of web text and Python code? (Find neurons that seem monosemantic on either but with importantly different patterns)