Learned features in language models

Please see Neel’s post on for a more detailed description of the problems.
Learned features
Search
Exploring Neuroscope
9.1
Explore random neurons! Use the interactive neuroscope to test and verify your understanding.
Exploring Neuroscope
9.2
Look for interesting conceptual neurons in the middle layers of larger models, like the "numbers that refer to groups of people" neuron.
Exploring Neuroscope
9.3
Look for examples of detokenisation neurons
Exploring Neuroscope
9.4
Look for examples of trigram neurons (consistently activate on a pair of tokens and boost the logit of plausible next tokens)
Exploring Neuroscope
9.5
Look for examples of retokenization neurons
Exploring Neuroscope
9.6
Look for examples of context neurons (eg base64)
Exploring Neuroscope
9.7
Look for neurons that align with any of the feature ideas in 9.13-9.21
Exploring Neuroscope
9.1
How much does the logit attribution of a neuron align with the dataset example patterns? Is it related?
Seeking out specific features
9.13
Basic syntax (Lots of ideas in post)
Seeking out specific features
9.14
Linguistic features (Try using spaCy to automate this) (Lots of ideas in post)
Seeking out specific features
9.15
Proper nouns (Lots of ideas in post)
Seeking out specific features
9.16
Python code features (Lots of ideas in post)
Seeking out specific features
9.2
LaTeX features. Try common commands (\left, \right) and section titles (\abstract, \introduction, etc.)
Seeking out specific features
9.23
Diambiguation neurons - Foreign language disambiguation (e.g, "die" in Dutch vs. German vs. Afrikaans)
Seeking out specific features
9.24
Disambiguation neurons - words with multiple meanings (e.g, "bat" as animal or sports equipment)
Seeking out specific features
9.25
Search for memory management neurons (high negative cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?
Seeking out specific features
9.26
Search for signal boosting neurons (high positive cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?
Seeking out specific features
9.28
Can you find split-token neurons? (I.e, " Claire" vs. "Cl" and "aire" - the model should learn to identify the split-token case)
Seeking out specific features
9.32
Neurons which link to attention heads - duplicated token
Curiosities about neurons
9.4
When you look at the max dataset examples for a specific neuron, is that neuron the most activated neuron on the text? What does it look like in general?
Curiosities about neurons
9.41
Look at the distributions of neuron activations (pre and post-activation for GELU, and pre, mid, and post for SoLU). What does this look like? How heavy tailed? How well can it be modelled as a normal distribution?
Curiosities about neurons
9.43
How similar are the distributions between SoLU and GELU?
Curiosities about neurons
9.44
What does the distribution of the LayerNorm scale and softmax denominator in SoLU look like? Is it bimodal (indicating monosemantic features) or fairly smooth and unimodal?
Curiosities about neurons
9.52
Try comparing how monosemantic the neurons in a GELU vs SoLU model are. Can you replicate the results SoLU does better? What are the rates for each model?
Miscellaneous
9.59
Can you replicate the results of the interpretability illusion on SoLU models, which were trained on a mix of web text and Python code? (Find neurons that seem monosemantic on either but with importantly different patterns)
Exploring Neuroscope
9.8
Look for examples of neurons with a naive (but incorrect!) initial story that have a much simpler explanation after further investigation
Exploring Neuroscope
9.9
Look for examples of neurons with a naive (but incorrect!) initial story that have a much more complex explanation after further investigation
Exploring Neuroscope
9.11
If you find neurons for 9.10 that seem very inconsistent, can you figure out what's going on?
Exploring Neuroscope
9.12
For dataset examples for neurons in a 1L network, measure how much its pre-activation value comes from the output of each attention head vs. the embedding (vs. positional embedding!). If dominated by specific heads, how much do those heads attend to the tokens you expect?
Seeking out specific features
9.17
From 9.16 - level of indent for a line (harder because it's categorical/numeric)
Seeking out specific features
9.18
From 9.16 - level of bracket nesting (harder because it's categorical/numeric)
Seeking out specific features
9.19
General code features (Lots of ideas in post)
Seeking out specific features
9.21
Features in compiled LaTeX, e.g paper citations
Seeking out specific features
9.22
Any of the more abstract neurons in Multimodel Neurons (e.g Christmas, sadness, teenager, anime, Pokemon, etc.)
Seeking out specific features
9.29
Can you find examples of neuron families/equivariance? (Ideas in post)
Seeking out specific features
9.3
Neurons which link to attention heads - Induction should NOT trigger (e.g, current token repeated but previous token is not, different copies of current string have different next tokens)
Seeking out specific features
9.31
Neurons which link to attention heads - fixing a skip trigram bug
Seeking out specific features
9.33
Neurons which link to attention heads - splitting into token X is duplicated for many common tokens
Seeking out specific features
9.34
Neurons which represent positional information (not invariant between position). Will need to input data with a random offset to isolate this.
Seeking out specific features
9.35
What is the longest n-gram you can find that seems represented?
Curiosities about neurons
9.42
Do neurons vary in terms of how heavy tailed their distributions are? Does it at all correspond to monosemanticity?
Curiosities about neurons
9.45
Can you find any genuinely monosemantic neurons? That are mostly monosemantic across their entire activation range?
Curiosities about neurons
9.46
Find a feature where GELU is used to calculate it in a way that ReLU couldn't be (e.g, approximating a quadratic)
Curiosities about neurons
9.47
Can you find a feature which seems to be represented by several neurons?
Curiosities about neurons
9.48
Using 9.47 - what happens if you ablate some of the neurons? Is it robust to this? Does it need them all?
Curiosities about neurons
9.49
Can you find a feature that is highly diffuse across neurons? (I.e, represented by the MLP layer but doesn't activate any particular neuron a lot)
Curiosities about neurons
9.5
Look at the direct logit attribution of neurons and find the max dataset examples for this. How similar are the texts to max activating dataset examples?
Curiosities about neurons
9.51
Looka t the max negative direct logit attribution. Are there neurons which systematically suppress the correct next token? Can you figure out what's up with these?
Curiosities about neurons
9.53
Using 9.52, can you come up with a better and more robust metric? How consistent is it across reasonable metrics?
Curiosities about neurons
9.54
The GELU and SoLU toy language models were trained with identical initialisation and data shuffle. Is there any correspondence between what neurons represent in each model?
Curiosities about neurons
9.55
If a feature is represented in one of the GELU/SoLU models, how likely is it to be represented in the other?
Curiosities about neurons
9.56
Can you find a neuron whose activation isn't significantly affected by the current token?
Miscellaneous
9.57
An important ability of a network is to attend to things within the current clause or sentence. Are models doing something more sophisticated than distance here, like punctuation? If so, are there relevant neurons/features?
Miscellaneous
9.6
Try doing dimensionality reduction over neuron activations across a bunch of text, and see how interpretable the resulting directions are.
Miscellaneous
9.61
Pick a BERTology paper and try to replicate it on GPT-2! (See post for ideas)
Miscellaneous
9.62
Make a PR to Neuroscope with some feature you wish it had!
Miscellaneous
9.63
Replicate the part of Conjecture's Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns. (Is it the case there are monosemantic bands in the neuron activation spectrum?)
Seeking out specific features
9.27
Search for neurons that clean up superposition interference.
Seeking out specific features
9.36
Try training linear probes for features from 9.13-9.35.
Seeking out specific features
9.37
Using 9.36 - How does your ability to recover features from the residual stream compare to MLP layer outputs vs. attention layer outputs? Can you find features that can only be recovered from some of these?
Seeking out specific features
9.38
Using 9.36 - Are there features that can only be recovered from certain MLP layers?
Seeking out specific features
9.39
Using 9.36 - Are there features that are significantly easier to recover from early layer residual streams and not from later layers?
Miscellaneous
9.58
Replicate Knowledge Neurons in Pretrained Transformers on a generative model. How much are these results consistent with what Neuroscope shows?
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.