Understanding neurons
1.1
How far can you get deeply reverse engineering a neuron in a 1L model? 1L is particularly easy since each neuron's output adds directly to the logits.
Understanding neurons
1.2
Find an interesting neuron you think represents a feature. Can you fully reverse engineer which direction should activate that feature, and compare to neuron input direction?
Understanding neurons
1.3
Look for trigram neurons and try to reverse engineer them. in a 1L model.(e.g, "ice cream -> sundae")
Understanding neurons
1.4
Check out the SoLU paper for more ideas on 1L neurons to find and reverse engineer.
Understanding neurons
1.8
Are there neurons whose behaviour can be matched by a regex or other code? If so, run it on a ton of text and compare the output.
How do larger models differ?
1.9
How do 3-layer and 4-layer attention-only models differ from 2L? (For instance, induction heads only appeared with 2L. Can you find something useful that only appears at 3L or higher?)
How do larger models differ?
1.1
How do 3-layer and 4-layer attention-only models differ from 2L? Look for composition scores - try to identify pairs of heads that compose a lot.
How do larger models differ?
1.11
How do 3-layer and 4-layer attention-only models differ from 2L? Look for evidence of composition.
How do larger models differ?
1.12
How do 3-layer and 4-layer attention-only models differ from 2L? Ablate a single head and run the model on a lot of text. Look at the change in performance. Do any heads matter a lot that aren't induction heads?
How do larger models differ?
1.13
Look for tasks that an n-layer model can't do, but an n+1-layer model can, and look for a circuit that explains this. (Start by running both models on a bunch of text and look for per-token probability differences)
How do larger models differ?
1.14
How do 1L SoLU/GELU models differ from 1L attention-only?
How do larger models differ?
1.15
How do 2L SoLU models differ from 1L?
How do larger models differ?
1.16
How does 1L GELU differ from 1L SoLU?
How do larger models differ?
1.17
Analyse how a larger model "fixes the bugs" of a smaller model.
How do larger models differ?
1.18
Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?
How do larger models differ?
1.19
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Try looking at split-token induction, where the current token has a preceding space and is one token, but the earlier occurrence has no preceding space and is two tokens. E.g " Claire" vs. "Cl" "aire"
How do larger models differ?
1.2
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at misfiring when the previous token appears multiple times with different following tokens
How do larger models differ?
1.21
Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at stopping induction on a token that likely shows the end of a repeated string (e.g, . or ! or ")
How do larger models differ?
1.22
Does a 2L MLP model fix these bugs (1.19 -1.21) too?
Circuits in natural language
2.1
Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?
Circuits in natural language
2.2
Continuing sequences that are common in natural language (E.g, "1 2 3 4" -> "5", "Monday\nTuesday\n" -> "Wednesday"
I did some preliminary work on this during a hackathon this July, and found components shared between sequence contnuation tasks such as head 9.1 that were found to output the “next member” of a circuit. The work was rushed and crude but I am looking to polish and continue it in the future. A link to it can be found here: Pablo Hansen- April 18- 2024
Circuits in natural language
2.3
A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!
Circuits in natural language
2.4
3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU
Circuits in natural language
2.5
Converting names to emails, like "Katy Johnson <" -> "katy_johnson"
Circuits in natural language
2.8
Learning that words after full stops are capital letters.
Circuits in natural language
2.9
Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)
Circuits in natural language
2.11
Reverse engineer an induction head in a non-toy model.
Circuits in natural language
2.12
Choosing the right pronouns (E.g, "Lina is a great friend, isn't")
Alana Xiang - 5 May 2023
Circuits in code models
2.14
Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.
Circuits in code models
2.15
Closing HTML tags
Extensions to IOI paper
2.2
Is there a general pattern for backup-ness? (Follows 2.19)
Manan Suri - 14 July, 2023
Extensions to IOI paper
2.22
Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.
Extensions to IOI paper
2.25
GPT-Neo wasn't trained with dropout - check 2.24 on this.
Extensions to IOI paper
2.26
Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.
Confusing things
2.29
Why do models have so many induction heads? How do they specialise, and why does the model need so many?
Confusing things
2.3
Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?
Confusing things
2.31
Can we find evidence of the residual stream as shared bandwidth hypothesis?
Confusing things
2.32
Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?
Confusing things
2.33
What happens to the memory in an induction circuit? (See 2.32)
Interpreting Algorithmic Problems
Harder problems
3.8
5-digit addition/subtraction.
Interpreting Algorithmic Problems
Harder problems
3.9
Predicting the output to simple code function. E.g, problems like "a = 1 2 3. a[2] = 4. a -> 1 2 4"
Interpreting Algorithmic Problems
Harder problems
3.1
Graph theory problems like this. Unsure of the correct input format. Try a bunch. See here
Interpreting Algorithmic Problems
Harder problems
3.11
Train a model on multiple algorithmic tasks we understand (like modular addition and subtraction). Compare to a model trained on each task. Does it learn the same circuits? Is there superposition?
Joshua ; jhdhill@uwaterloo.ca ; jan 31 2024
Interpreting Algorithmic Problems
Harder problems
3.12
Train models for automata tasks and interpret them. Do your results match the theory?
Interpreting Algorithmic Problems
Harder problems
3.13
In-Context Linear Regression - the transformer gets a sequence (x_1, y_1, x_2, y_2, ...) where y_i = Ax_i + b. A and b are different for each prompt, and need to be learned in-context. (Code here)
Interpreting Algorithmic Problems
Harder problems
3.16
Predict repeated subsequences in randomly generated tokens, and see if you can find and reverse engineer induction heads.
Interpreting Algorithmic Problems
3.18
Build a toy model of Indirect Object Identification - train a tiny attention-only model on an algorithmic task simulating IOI - and reverse-engineer the learned solution. Compare it to the circuit found in GPT-2 Small.
Interpreting Algorithmic Problems
Questions about language models
3.22
Train a 3L attention-only transformer to perform the Indirect Object Identification task. Can it do the task? Does it learn the same circuit found in GPT-2 Small?
Interpreting Algorithmic Problems
Questions about language models
3.23
Redo Neel's modular addition analysis with GELU. Does it change things?
Interpreting Algorithmic Problems
Questions about language models
3.26
In modular addition, look at what different dimensionality reduction techniques do on different weight matrices. Can you identify which weights matter most? Which neurons form clusters for each frequency? Anything from activations?
Interpreting Algorithmic Problems
Extending Othello-GPT
3.32
Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and don't work. Try to build our understanding of transformer MLP's in general.
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.2
Replicate their absolute value model and study some of the variants of the ReLU output models.
May 4, 2023 - Kunvar (firstuserhere)
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.3
Explore neuron superposition by training their absolute value model on a more complex function like x -> x^2.
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.4
What happens to their ReLU output model when there's non-uniform sparsity? E.g, one class of less sparse features and another of very sparse
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.6
Explore neuron superposition by training their absolute value model on functions of multiple variables. Keep the inputs as uniform reals in [0, 1] and look at max(x, y)
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.8
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features discrete (1, 2, 3)
Exploring Polysemanticity and Superposition
Confusions to study in Toy Models of Superposition
4.9
Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features uniform [0.5, 1]
April 30, 2023; Kunvar(firstuserhere)
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
4.21
Induction heads copy the token they attend to the output, which involves storing which of 50,000 tokens it is. How are these stored in a 64-dimensional space?
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
4.22
How does the previous token head in an induction circuit communicate the value of the previous token to the key of the induction head? Bonus: What residual stream subspace does it take up? Is there interference?
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
4.23
How does the IOI circuit communicate names/positions between composing heads?
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
4.24
Are there dedicated dimensions for positional embeddings? Do any other components write to those dimensions?
Exploring Polysemanticity and Superposition
Studying neuron superposition in real models
4.29
Look at a polysemantic neuron in a 1L language model. Can you figure out how the model disambiguates what feature it is?
Exploring Polysemanticity and Superposition
Studying neuron superposition in real models
4.31
Take a feature that's part of a polysemantic neuron in a 1L language model and try to identify every neuron that represents that feature. Is it sparse or diffuse?
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
4.38
Can you find any better metrics for polysemanticity?
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
4.39
The paper speculates LayerNorm lets the model "smuggle through" superposition in SoLU models by smearing features across many dimensions and letting LayerNorm scale it up. Can you find evidence of this?
Exploring Polysemanticity and Superposition
Comparing SoLU/GELU
4.4
How similar are the neurons between SoLU/GELU models of the same layers?
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
5.1
Understanding why 5 digit addition has a phase change per digit (so 6 total?!)
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
5.3
Look at the PCA of logits on the full dataset, or the PCA of a stack of flattened weights. If you plot a scatter plot of the first 2 components, the different phases of training are clearly visible. What's up with this?
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
5.6
What happens if we include in the loss one of the progress measures in Neel's grokking post? Can we accelerate or stop grokking?
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
5.7
Adam Jermyn provides an analytical argument and some toy models for why phase transition should be an inherent part of (some of) how models learn. Can you find evidence of this in more complex models?
Analysing Training Dynamics
Algorithmic tasks - understanding grokking
5.8
Build on and refine Adam Jermyn's arguments and toy models - think about how they deviate from a real transformer, and build more faithful models.
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
5.9
For a toy model trained to form induction heads, is there a lottery-ticket style thing going on? Can you disrupt induction head formation by messing with the initialisation?
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
5.11
If we knock out the parameters that form important circuits at the end of training on some toy task, but knock them out at the start of training, how much does that delay/stop generalisation?
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
5.12
Analysing how pairs of heads in an induction circuit compose over time - Can you find progress measures which predict these?
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
5.13
Analysing how pairs of heads in an induction circuit compose over time - Can we predict which heads will learn to compose first?
Analysing Training Dynamics
Algorithmic tasks - lottery tickets
5.14
Analysing how pairs of heads in an induction circuit compose over time -Does the composition develop as a phase transition?
Analysing Training Dynamics
Understanding fine-tuning
5.17
How is the model different on fine-tuned text? Look at examples where the model does much better after fine-tuning, and some normal text.
Analysing Training Dynamics
Understanding fine-tuning
5.18
Try activation patching between the old and fine-tuned model and see how hard recovering performance is.
Analysing Training Dynamics
Understanding fine-tuning
5.19
Look at max activating text for various neurons in the original models. How has it changed post fine-tuning?
Analysing Training Dynamics
Understanding fine-tuning
5.2
Explore further and see what's going on with fine-tuning mechanistically.
Analysing Training Dynamics
Understanding training dynamics in language models
5.22
Can you replicate the induction head phase transition results in the various checkpointed models in TransformerLens? (If code works for attn-only-2l it should work for them all)
Analysing Training Dynamics
Understanding training dynamics in language models
5.23
Look at the neurons in TransformerLens SoLU models during training. Do they tend to form as a phase transition?
Analysing Training Dynamics
Finding phase transitions
5.27
Try digging into the specific heads that act on IOI and look for phase transitions. Use direct logit attribution for the name movers.
Analysing Training Dynamics
Finding phase transitions
5.28
Study the attention patterns of each category of heads in IOI for phase transitions.
Analysing Training Dynamics
Finding phase transitions
5.29
Look for phase transitions in simple IOI-style algorithmic tasks, like few-shot learning, addition, sorting words alphabetically...
Analysing Training Dynamics
Finding phase transitions
5.3
Look for phase transitions in soft induction heads like translation.
Analysing Training Dynamics
Studying path dependence
5.34
How much do the Stanford CRFM models differ with algorithmic tasks like Indirect Object Identification?
Analysing Training Dynamics
Studying path dependence
5.36
When model scale varies (e.g, GPT-2 small vs. medium) is there anything the smaller model can do that the larger one can't do? (Look at difference in per token log prob)
Analysing Training Dynamics
Studying path dependence
5.37
Try applying the Git Re-Basin techniques to a 2L MLP trained for modular addition. Does this work? If you use Neel's grokking work to analyse the circuits involved, how does the re-basin technique map onto the circuits?
Techniques, Tooling, and Automation
Breaking current techniques
6.2
Break direct logit attribution - start by looking at GPT-Neo Small where the logit lens (precursor to direct logit attribution) seems to work badly, but works well if you include the final layer and the unembed.
Techniques, Tooling, and Automation
Breaking current techniques
6.4
Find edge cases where linearising LayerNorm breaks. See some work by Eric Winsor at Conjecture.
Techniques, Tooling, and Automation
Breaking current techniques
6.5
Find edge cases where activation patching breaks. (It should break when you patch one variable but there's dependence on multiples)
Techniques, Tooling, and Automation
Breaking current techniques
6.8
Can you find places where one ablation (zero, mean, random) breaks but the others don't?
Techniques, Tooling, and Automation
Breaking current techniques
6.9
Find edge cases where composition scores break. (They don't work well for the IOI circuit)
Techniques, Tooling, and Automation
Breaking current techniques
6.1
Find edge cases where eigenvalue copying scores break.
Techniques, Tooling, and Automation
6.12
Try looking for composition on a specific input. Decompose the residual stream into the sum of outputs of previous heads, then decompose query, key, value into sums of terms from each previous head. Are any larger than the others / matter more if you ablate them / etc?
Techniques, Tooling, and Automation
6.14
Compare causal tracing to activation patching. Do they give the same outputs? Can you find situations where one breaks and the other doesn't? (Try IOI task or factual recall task)
Techniques, Tooling, and Automation
ROME activation patching
6.17
In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching some set of neurons? (E.g, the neurons that activate the most within the 10 layers?)
Techniques, Tooling, and Automation
Automatically find circuits
6.22
Automate ways to find few shot learning heads. (Bonus: Add to TransformerLens!)
Techniques, Tooling, and Automation
Automatically find circuits
6.23
Can you find an automated way to detect pointer arithmetic based induction heads vs. classic induction heads?
Techniques, Tooling, and Automation
Automatically find circuits
6.24
Can you find an automated way to detect the heads used in the IOI Circuit? (S-inhibition, name mover, negative name mover, backup name mover)
Techniques, Tooling, and Automation
Automatically find circuits
6.25
Can you automate detection of the heads used in factual recall to move information about the fact to the final token? (Try activation patching)
Techniques, Tooling, and Automation
Automatically find circuits
6.26
(Infrastructure) Combine some of the head detectors from 6.18-6.25 to make a "wiki" for a range of models, with information and scores for each head for how it falls into different categories. MVP: Pandas Dataframes with a row for each head and a column for each metric.
Techniques, Tooling, and Automation
Refine max activating dataset examples
6.3
Using 6.28: Corrupt different token embeddings in a sequence to see which matter.
Techniques, Tooling, and Automation
Refine max activating dataset examples
6.31
Using 6.28: Compare to randomly chosen directions in neuron activation space to see how clustered/monosemantic things seem.
Techniques, Tooling, and Automation
Refine max activating dataset examples
6.32
Using 6.28: Validate these by comparing to direct effect of neuron on the logits, or output vocab logits most boosted by that neuron.
Techniques, Tooling, and Automation
Refine max activating dataset examples
6.33
Using 6.28: Use a model like GPT-3 to find similar text to an existing example and see if they also activate the neuron. Bonus: Use them to replace specific tokens.
Techniques, Tooling, and Automation
Refine max activating dataset examples
6.34
Using 6.28: Look at dataset examples at different quantiles for neuron activations (25%, 50%, 75%, 90%, 95%). Does that change anything?
Techniques, Tooling, and Automation
Refine max activating dataset examples
6.38
Using 6.28: In SoLU models, compare max activating results for pre-SoLU, post-SoLU, and post LayerNorm activations. ('pre', 'mid', 'post' in TransformerLens). How consistent are they? Does one seem more principled?
Techniques, Tooling, and Automation
Interpreting models with LLM's
6.39
Can GPT-3 figure out trends in max activating examples for a neuron?
Techniques, Tooling, and Automation
Interpreting models with LLM's
6.4
Can you use GPT-3 to generate counterfactual prompts with lined up tokens to do activation patching on novel problems? (E.g, "John gave a bottle of milk to -> Mary" vs. "Mary gave a bottle of milk to -> John")
Techniques, Tooling, and Automation
Apply techniques from non-mechanistic interpretability
6.42
How well does feature attribution work on circuits we understand?
Techniques, Tooling, and Automation
6.48
Resolve some of the open issues/feature requests for TransformerLens.
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.5
Using 6.49, run it on a bunch of text and look at the biggest per-token log prob difference.
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.51
Using 6.49, run them on various benchmarks and compare performance.
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.52
Using 6.49, try "benchmarks" like performing algorithmic tasks like IOI, acronyms, etc. as from Circuits In the Wild.
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.53
Using 6.49, try qualitative exploration like just generating text from the models and look for ideas.
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.54
Build tooling to take the diff of two models with the same internal structure. Includes 6.49 but also lets you compare model internals!
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.55
Using 6.54, look for the largest difference in weights.
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.56
Using 6.54, run them on a bunch of text and look for largest difference in activations.
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.57
Using 6.54, look at the direct logit attribution of layers and heads on various texts, and look for the biggest differences.
Techniques, Tooling, and Automation
Taking the "diff" of two models
6.58
Using 6.54, do activation patching on a piece of text where one model does much better than the other - are some parts key to improved performance?
Image Model Interpretability
Building on Circuits thread
7.7
Look for equivariance in late layers of vision models, symmetries in a network with analogous families of neurons. Likely looks like hunting in Microscope.
Image Model Interpretability
Building on Circuits thread
7.9
Look for a wide array of circuits using the weight explorer. What interesting patterns and motifs can you find?
Image Model Interpretability
Multimodal models (CLIP interpretability)
7.1
Look at the weights connecting neurons in adjacent layers. How sparse are they? Are there any clear patterns where one neuron is constructed from previous ones?
Image Model Interpretability
Multimodal models (CLIP interpretability)
7.13
Can you refine the technique for generating max activating text strings? Could it be applied to language models?
Image Model Interpretability
7.15
Does activation patching work on Inception?
Image Model Interpretability
Diffusion models
7.16
Apply feature visualisation to neurons in diffusion models and see if any seem clearly interpretable.
Image Model Interpretability
Diffusion models
7.17
Are there style transfer neurons in diffusion models? (E.g, activating on "in the style of Thomas Kinkade")
Image Model Interpretability
Diffusion models
7.18
Are different circuits activating when different amounts of noise are input in diffusion models?
Interpreting Reinforcement Learning
Goal misgeneralisation
8.6
Using 8.5: Possible starting point, Tree Gridworld and Monster Gridworld from Shah et al.
Interpreting Reinforcement Learning
Decision Transformers
8.8
Can you apply transformer circuits to a decision transformer? What do you find?
Interpreting Reinforcement Learning
Decision Transformers
8.9
Try training a 1L decision transformer on a toy problem, like finding the shortest path in a graph.
Interpreting Reinforcement Learning
Interpreting policy gradients
8.16
Can you interpret a small model trained with policy gradients on a gridworld task?
Interpreting Reinforcement Learning
Interpreting policy gradients
8.17
Can you interpret a small model trained with policy gradients on an OpenAI Gym task?
Interpreting Reinforcement Learning
Interpreting policy gradients
8.18
Can you interpret a small model trained with policy gradients on an Atari game (e.g, Pong)?
Interpreting Reinforcement Learning
8.22
Choose your own adventure! There's lots of work in RL - pick something you're excited about and try to reverse engineer something!
Studying Learned Features in Language Models
Exploring Neuroscope
9.8
Look for examples of neurons with a naive (but incorrect!) initial story that have a much simpler explanation after further investigation
Studying Learned Features in Language Models
Exploring Neuroscope
9.9
Look for examples of neurons with a naive (but incorrect!) initial story that have a much more complex explanation after further investigation
Studying Learned Features in Language Models
Exploring Neuroscope
9.11
If you find neurons for 9.10 that seem very inconsistent, can you figure out what's going on?
Studying Learned Features in Language Models
Exploring Neuroscope
9.12
For dataset examples for neurons in a 1L network, measure how much its pre-activation value comes from the output of each attention head vs. the embedding (vs. positional embedding!). If dominated by specific heads, how much do those heads attend to the tokens you expect?
Studying Learned Features in Language Models
Seeking out specific features
9.17
From 9.16 - level of indent for a line (harder because it's categorical/numeric)
Studying Learned Features in Language Models
Seeking out specific features
9.18
From 9.16 - level of bracket nesting (harder because it's categorical/numeric)
Studying Learned Features in Language Models
Seeking out specific features
9.19
General code features (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
9.21
Features in compiled LaTeX, e.g paper citations
Studying Learned Features in Language Models
Seeking out specific features
9.22
Any of the more abstract neurons in Multimodel Neurons (e.g Christmas, sadness, teenager, anime, Pokemon, etc.)
Studying Learned Features in Language Models
Seeking out specific features
9.29
Can you find examples of neuron families/equivariance? (Ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
9.3
Neurons which link to attention heads - Induction should NOT trigger (e.g, current token repeated but previous token is not, different copies of current string have different next tokens)
Studying Learned Features in Language Models
Seeking out specific features
9.31
Neurons which link to attention heads - fixing a skip trigram bug
Studying Learned Features in Language Models
Seeking out specific features
9.33
Neurons which link to attention heads - splitting into token X is duplicated for many common tokens
Studying Learned Features in Language Models
Seeking out specific features
9.34
Neurons which represent positional information (not invariant between position). Will need to input data with a random offset to isolate this.
Studying Learned Features in Language Models
Seeking out specific features
9.35
What is the longest n-gram you can find that seems represented?
Studying Learned Features in Language Models
Curiosities about neurons
9.42
Do neurons vary in terms of how heavy tailed their distributions are? Does it at all correspond to monosemanticity?
Studying Learned Features in Language Models
Curiosities about neurons
9.45
Can you find any genuinely monosemantic neurons? That are mostly monosemantic across their entire activation range?
Studying Learned Features in Language Models
Curiosities about neurons
9.46
Find a feature where GELU is used to calculate it in a way that ReLU couldn't be (e.g, approximating a quadratic)
Studying Learned Features in Language Models
Curiosities about neurons
9.47
Can you find a feature which seems to be represented by several neurons?
Studying Learned Features in Language Models
Curiosities about neurons
9.48
Using 9.47 - what happens if you ablate some of the neurons? Is it robust to this? Does it need them all?
Studying Learned Features in Language Models
Curiosities about neurons
9.49
Can you find a feature that is highly diffuse across neurons? (I.e, represented by the MLP layer but doesn't activate any particular neuron a lot)
Studying Learned Features in Language Models
Curiosities about neurons
9.5
Look at the direct logit attribution of neurons and find the max dataset examples for this. How similar are the texts to max activating dataset examples?
Studying Learned Features in Language Models
Curiosities about neurons
9.51
Looka t the max negative direct logit attribution. Are there neurons which systematically suppress the correct next token? Can you figure out what's up with these?
Studying Learned Features in Language Models
Curiosities about neurons
9.53
Using 9.52, can you come up with a better and more robust metric? How consistent is it across reasonable metrics?
Studying Learned Features in Language Models
Curiosities about neurons
9.54
The GELU and SoLU toy language models were trained with identical initialisation and data shuffle. Is there any correspondence between what neurons represent in each model?
Studying Learned Features in Language Models
Curiosities about neurons
9.55
If a feature is represented in one of the GELU/SoLU models, how likely is it to be represented in the other?
Studying Learned Features in Language Models
Curiosities about neurons
9.56
Can you find a neuron whose activation isn't significantly affected by the current token?
Studying Learned Features in Language Models
Miscellaneous
9.57
An important ability of a network is to attend to things within the current clause or sentence. Are models doing something more sophisticated than distance here, like punctuation? If so, are there relevant neurons/features?
Studying Learned Features in Language Models
Miscellaneous
9.6
Try doing dimensionality reduction over neuron activations across a bunch of text, and see how interpretable the resulting directions are.
Studying Learned Features in Language Models
Miscellaneous
9.61
Pick a BERTology paper and try to replicate it on GPT-2! (See post for ideas)
Studying Learned Features in Language Models
Miscellaneous
9.62
Make a PR to Neuroscope with some feature you wish it had!
Studying Learned Features in Language Models
Miscellaneous
9.63
Replicate the part of Conjecture's Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns. (Is it the case there are monosemantic bands in the neuron activation spectrum?)