Learned features in language models

Please see Neel’s post on for a more detailed description of the problems.
Learned features
Search
Studying Learned Features in Language Models
Exploring Neuroscope
A
9.1
Explore random neurons! Use the interactive neuroscope to test and verify your understanding.
Studying Learned Features in Language Models
Exploring Neuroscope
A
9.2
Look for interesting conceptual neurons in the middle layers of larger models, like the "numbers that refer to groups of people" neuron.
Studying Learned Features in Language Models
Exploring Neuroscope
A
9.3
Look for examples of detokenisation neurons
Studying Learned Features in Language Models
Exploring Neuroscope
A
9.4
Look for examples of trigram neurons (consistently activate on a pair of tokens and boost the logit of plausible next tokens)
Studying Learned Features in Language Models
Exploring Neuroscope
A
9.5
Look for examples of retokenization neurons
Studying Learned Features in Language Models
Exploring Neuroscope
A
9.6
Look for examples of context neurons (eg base64)
Studying Learned Features in Language Models
Exploring Neuroscope
A
9.7
Look for neurons that align with any of the feature ideas in 9.13-9.21
Studying Learned Features in Language Models
Exploring Neuroscope
A
9.1
How much does the logit attribution of a neuron align with the dataset example patterns? Is it related?
Studying Learned Features in Language Models
Seeking out specific features
A
9.13
Basic syntax (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
A
9.14
Linguistic features (Try using spaCy to automate this) (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
A
9.15
Proper nouns (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
A
9.16
Python code features (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
A
9.2
LaTeX features. Try common commands (\left, \right) and section titles (\abstract, \introduction, etc.)
Studying Learned Features in Language Models
Seeking out specific features
A
9.23
Diambiguation neurons - Foreign language disambiguation (e.g, "die" in Dutch vs. German vs. Afrikaans)
Studying Learned Features in Language Models
Seeking out specific features
A
9.24
Disambiguation neurons - words with multiple meanings (e.g, "bat" as animal or sports equipment)
Studying Learned Features in Language Models
Seeking out specific features
A
9.25
Search for memory management neurons (high negative cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?
Studying Learned Features in Language Models
Seeking out specific features
A
9.26
Search for signal boosting neurons (high positive cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?
Studying Learned Features in Language Models
Seeking out specific features
A
9.28
Can you find split-token neurons? (I.e, " Claire" vs. "Cl" and "aire" - the model should learn to identify the split-token case)
Studying Learned Features in Language Models
Seeking out specific features
A
9.32
Neurons which link to attention heads - duplicated token
Studying Learned Features in Language Models
Curiosities about neurons
A
9.4
When you look at the max dataset examples for a specific neuron, is that neuron the most activated neuron on the text? What does it look like in general?
Studying Learned Features in Language Models
Curiosities about neurons
A
9.41
Look at the distributions of neuron activations (pre and post-activation for GELU, and pre, mid, and post for SoLU). What does this look like? How heavy tailed? How well can it be modelled as a normal distribution?
Studying Learned Features in Language Models
Curiosities about neurons
A
9.43
How similar are the distributions between SoLU and GELU?
Studying Learned Features in Language Models
Curiosities about neurons
A
9.44
What does the distribution of the LayerNorm scale and softmax denominator in SoLU look like? Is it bimodal (indicating monosemantic features) or fairly smooth and unimodal?
Studying Learned Features in Language Models
Curiosities about neurons
A
9.52
Try comparing how monosemantic the neurons in a GELU vs SoLU model are. Can you replicate the results SoLU does better? What are the rates for each model?
Studying Learned Features in Language Models
Miscellaneous
A
9.59
Can you replicate the results of the interpretability illusion on SoLU models, which were trained on a mix of web text and Python code? (Find neurons that seem monosemantic on either but with importantly different patterns)
Studying Learned Features in Language Models
Exploring Neuroscope
B
9.8
Look for examples of neurons with a naive (but incorrect!) initial story that have a much simpler explanation after further investigation
Studying Learned Features in Language Models
Exploring Neuroscope
B
9.9
Look for examples of neurons with a naive (but incorrect!) initial story that have a much more complex explanation after further investigation
Studying Learned Features in Language Models
Exploring Neuroscope
B
9.11
If you find neurons for 9.10 that seem very inconsistent, can you figure out what's going on?
Studying Learned Features in Language Models
Exploring Neuroscope
B
9.12
For dataset examples for neurons in a 1L network, measure how much its pre-activation value comes from the output of each attention head vs. the embedding (vs. positional embedding!). If dominated by specific heads, how much do those heads attend to the tokens you expect?
Studying Learned Features in Language Models
Seeking out specific features
B
9.17
From 9.16 - level of indent for a line (harder because it's categorical/numeric)
Studying Learned Features in Language Models
Seeking out specific features
B
9.18
From 9.16 - level of bracket nesting (harder because it's categorical/numeric)
Studying Learned Features in Language Models
Seeking out specific features
B
9.19
General code features (Lots of ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
B
9.21
Features in compiled LaTeX, e.g paper citations
Studying Learned Features in Language Models
Seeking out specific features
B
9.22
Any of the more abstract neurons in Multimodel Neurons (e.g Christmas, sadness, teenager, anime, Pokemon, etc.)
Studying Learned Features in Language Models
Seeking out specific features
B
9.29
Can you find examples of neuron families/equivariance? (Ideas in post)
Studying Learned Features in Language Models
Seeking out specific features
B
9.3
Neurons which link to attention heads - Induction should NOT trigger (e.g, current token repeated but previous token is not, different copies of current string have different next tokens)
Studying Learned Features in Language Models
Seeking out specific features
B
9.31
Neurons which link to attention heads - fixing a skip trigram bug
Studying Learned Features in Language Models
Seeking out specific features
B
9.33
Neurons which link to attention heads - splitting into token X is duplicated for many common tokens
Studying Learned Features in Language Models
Seeking out specific features
B
9.34
Neurons which represent positional information (not invariant between position). Will need to input data with a random offset to isolate this.
Studying Learned Features in Language Models
Seeking out specific features
B
9.35
What is the longest n-gram you can find that seems represented?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.42
Do neurons vary in terms of how heavy tailed their distributions are? Does it at all correspond to monosemanticity?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.45
Can you find any genuinely monosemantic neurons? That are mostly monosemantic across their entire activation range?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.46
Find a feature where GELU is used to calculate it in a way that ReLU couldn't be (e.g, approximating a quadratic)
Studying Learned Features in Language Models
Curiosities about neurons
B
9.47
Can you find a feature which seems to be represented by several neurons?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.48
Using 9.47 - what happens if you ablate some of the neurons? Is it robust to this? Does it need them all?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.49
Can you find a feature that is highly diffuse across neurons? (I.e, represented by the MLP layer but doesn't activate any particular neuron a lot)
Studying Learned Features in Language Models
Curiosities about neurons
B
9.5
Look at the direct logit attribution of neurons and find the max dataset examples for this. How similar are the texts to max activating dataset examples?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.51
Looka t the max negative direct logit attribution. Are there neurons which systematically suppress the correct next token? Can you figure out what's up with these?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.53
Using 9.52, can you come up with a better and more robust metric? How consistent is it across reasonable metrics?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.54
The GELU and SoLU toy language models were trained with identical initialisation and data shuffle. Is there any correspondence between what neurons represent in each model?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.55
If a feature is represented in one of the GELU/SoLU models, how likely is it to be represented in the other?
Studying Learned Features in Language Models
Curiosities about neurons
B
9.56
Can you find a neuron whose activation isn't significantly affected by the current token?
Studying Learned Features in Language Models
Miscellaneous
B
9.57
An important ability of a network is to attend to things within the current clause or sentence. Are models doing something more sophisticated than distance here, like punctuation? If so, are there relevant neurons/features?
Studying Learned Features in Language Models
Miscellaneous
B
9.6
Try doing dimensionality reduction over neuron activations across a bunch of text, and see how interpretable the resulting directions are.
Studying Learned Features in Language Models
Miscellaneous
B
9.61
Pick a BERTology paper and try to replicate it on GPT-2! (See post for ideas)
Studying Learned Features in Language Models
Miscellaneous
B
9.62
Make a PR to Neuroscope with some feature you wish it had!
Studying Learned Features in Language Models
Miscellaneous
B
9.63
Replicate the part of Conjecture's Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns. (Is it the case there are monosemantic bands in the neuron activation spectrum?)
Studying Learned Features in Language Models
Seeking out specific features
C
9.27
Search for neurons that clean up superposition interference.
Studying Learned Features in Language Models
Seeking out specific features
C
9.36
Try training linear probes for features from 9.13-9.35.
Studying Learned Features in Language Models
Seeking out specific features
C
9.37
Using 9.36 - How does your ability to recover features from the residual stream compare to MLP layer outputs vs. attention layer outputs? Can you find features that can only be recovered from some of these?
Studying Learned Features in Language Models
Seeking out specific features
C
9.38
Using 9.36 - Are there features that can only be recovered from certain MLP layers?
Studying Learned Features in Language Models
Seeking out specific features
C
9.39
Using 9.36 - Are there features that are significantly easier to recover from early layer residual streams and not from later layers?
Studying Learned Features in Language Models
Miscellaneous
C
9.58
Replicate Knowledge Neurons in Pretrained Transformers on a generative model. How much are these results consistent with what Neuroscope shows?
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.