Open Problems in Mechanistic Interpretability

Explore

Polysemanticity and Superposition

Please see Neel’s post on

Exploring Polysemanticity and Superposition⁠

for a more detailed description of the problems.

Polysemanticity

Polysemanticity

Category

Difficulty

Existing Work

Currently working

Help Wanted?

Confusions to study in Toy Models of Superposition

4.5

Explore neuron superposition by training their absolute value model on functions of multiple variables. Make inputs binary (0/1) and look at the AND and OR of element pairs.

Confusions to study in Toy Models of Superposition

4.7

Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features 1 (i.e, two possible values)

Confusions to study in Toy Models of Superposition

4.1

What happens if you replace ReLU's with GeLU's in the toy models?

May 1, 2023 - Kunvar (firstuserhere)

Studying bottleneck superposition in real language models

4.25

Can you find any examples of the geometric superposition configurations in the residual stream of a language model?

Comparing SoLU/GELU

4.37

How do TransformerLens SoLU / GeLU models compare in Neuroscope under the SoLU polysemanticity metric? (What fraction of neurons seem monosemantic)

Confusions to study in Toy Models of Superposition

4.2

Replicate their absolute value model and study some of the variants of the ReLU output models.

May 4, 2023 - Kunvar (firstuserhere)

Confusions to study in Toy Models of Superposition

4.3

Explore neuron superposition by training their absolute value model on a more complex function like x -> x^2.

Confusions to study in Toy Models of Superposition

4.4

What happens to their ReLU output model when there's non-uniform sparsity? E.g, one class of less sparse features and another of very sparse

Confusions to study in Toy Models of Superposition

4.6

Explore neuron superposition by training their absolute value model on functions of multiple variables. Keep the inputs as uniform reals in [0, 1] and look at max(x, y)

Confusions to study in Toy Models of Superposition

4.8

Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features discrete (1, 2, 3)

Confusions to study in Toy Models of Superposition

4.9

Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features uniform [0.5, 1]

April 30, 2023; Kunvar(firstuserhere)

Studying bottleneck superposition in real language models

4.21

Induction heads copy the token they attend to the output, which involves storing which of 50,000 tokens it is. How are these stored in a 64-dimensional space?

Studying bottleneck superposition in real language models

4.22

How does the previous token head in an induction circuit communicate the value of the previous token to the key of the induction head? Bonus: What residual stream subspace does it take up? Is there interference?

Studying bottleneck superposition in real language models

4.23

How does the IOI circuit communicate names/positions between composing heads?

Studying bottleneck superposition in real language models

4.24

Are there dedicated dimensions for positional embeddings? Do any other components write to those dimensions?

Studying neuron superposition in real models

4.29

Look at a polysemantic neuron in a 1L language model. Can you figure out how the model disambiguates what feature it is?

Studying neuron superposition in real models

4.31

Take a feature that's part of a polysemantic neuron in a 1L language model and try to identify every neuron that represents that feature. Is it sparse or diffuse?

Comparing SoLU/GELU

4.38

Can you find any better metrics for polysemanticity?

Comparing SoLU/GELU

4.39

The paper speculates LayerNorm lets the model "smuggle through" superposition in SoLU models by smearing features across many dimensions and letting LayerNorm scale it up. Can you find evidence of this?

Comparing SoLU/GELU

4.4

How similar are the neurons between SoLU/GELU models of the same layers?

Confusions to study in Toy Models of Superposition

4.11

Can you find a toy model where GELU acts significantly differently from ReLU?

May 1, 2023 - Kunvar (firstuserhere)

Building toy models of superposition

4.12

Build a toy model of a classification problem with cross-entropy loss

November 10, 2023 - Lucas Hayne (

lucas.hayne@colorado.edu

⁠

)

Building toy models of superposition

4.13

Build a toy model of neuron superposition that has many more hidden features than output features

Building toy models of superposition

4.14

Build a toy model that needs multiple hidden layers of ReLU's. Can computation in superposition happen across several layers? Eg max (|x|, |y|)

Building toy models of superposition

4.15

Build a toy model of attention head superposition/polysemanticity. Can you find a task where the model wants to do different things with an attention head on different inputs? How does it represent things internally / deal with interference?

Making toy model counterexamples

4.17

Make toy models that are counterexamples in MI. A learned example of a network with a non-linear representation.

Making toy model counterexamples

4.18

Make toy models that are counterexamples in MI. A network without a discrete number of features.

Making toy model counterexamples

4.19

Make toy models that are counterexamples in MI. A non-decomposable neural network.

Making toy model counterexamples

4.2

Make toy models that are counterexamples in MI. A task where networks can learn multiple different sets of features.

Studying bottleneck superposition in real language models

4.26

Can you find any examples of locally almost-orthogonal bases?

Studying bottleneck superposition in real language models

4.27

Do language models have "genre" directions that detect the type of text, and then represent features specific to each genre in the same subspace?

Studying neuron superposition in real models

4.3

Look at a polysemantic neuron in a 2L language model. Can you figure out how the model disambiguates what feature it is?

Studying neuron superposition in real models

4.32

Try to fully reverse engineer a feature discovered in 4.31.

Studying neuron superposition in real models

4.33

Can you use superposition to create an adversarial example for a neuron?

Studying neuron superposition in real models

4.34

Can you find any examples of the asymmetric superposition motif in the MLP of a 1-2 layer language model?

4.35

Pick a simple feature of language (e.g, is number, is base64) and train a linear probe to detect that in the MLP activations of a 1L language model.

Comparing SoLU/GELU

4.41

How does GELU vs. ReLU compare re: polysemanticity. Replicate SoLU analysis.

Getting rid of superposition

4.42

If you train a 1L/2L language model with d_mlp = 100 * d_model, does superposition go away?

Getting rid of superposition

4.43

Study the T5 XXL. It's 11B params and not supported by TransformerLens. Expect major infrastructure pain.

Getting rid of superposition

4.45

Pick an open problem at the end of Toy Models of Superposition.

Building toy models of superposition

4.16

Build a toy model with a mdoel needs to deal with simultaneous interference, and try to understand how it does it, or if it can.

Studying bottleneck superposition in real language models

4.28

Can you find examples of a model learning to deal with simultaneous interference?

4.36

Look for features in Neuroscope that seem to be represented by various neurons in a 1-2 layer language model. Train probes to detect some of them. Compare probe performance vs. neuron performance.

Getting rid of superposition

4.44

Can you take a trained model, freeze all weights except an MLP layer, x10 that layer's width, copy each neuron 10 times, add noise, and fine-tune? Does this remove superposition / add new features?

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.