Open Problems in Mechanistic Interpretability

Explore

Toy language models

Please see Neel’s sequence on

Toy Language Models⁠

for a more detailed description of the problems

Toy language model problems

Toy language model problems

Category

Difficulty

Existing Work

Currently working

Help Wanted?

Understanding neurons

1.6

Hunt through Neuroscope for the toy models and look for interesting neurons to focus on.

Understanding neurons

1.7

Can you find any polysemantic neurons in Neuroscope? Explore this.

1.23

Choose your own adventure: Take a bunch of text with interesting patterns and run the models over it. Look for tokens they do really well on and try to reverse engineer what's going on!

Understanding neurons

1.1

How far can you get deeply reverse engineering a neuron in a 1L model? 1L is particularly easy since each neuron's output adds directly to the logits.

Understanding neurons

1.2

Find an interesting neuron you think represents a feature. Can you fully reverse engineer which direction should activate that feature, and compare to neuron input direction?

Understanding neurons

1.3

Look for trigram neurons and try to reverse engineer them. in a 1L model.(e.g, "ice cream -> sundae")

Understanding neurons

1.4

Check out the SoLU paper for more ideas on 1L neurons to find and reverse engineer.

Understanding neurons

1.8

Are there neurons whose behaviour can be matched by a regex or other code? If so, run it on a ton of text and compare the output.

How do larger models differ?

1.9

How do 3-layer and 4-layer attention-only models differ from 2L? (For instance, induction heads only appeared with 2L. Can you find something useful that only appears at 3L or higher?)

How do larger models differ?

1.1

How do 3-layer and 4-layer attention-only models differ from 2L? Look for composition scores - try to identify pairs of heads that compose a lot.

How do larger models differ?

1.11

How do 3-layer and 4-layer attention-only models differ from 2L? Look for evidence of composition.

How do larger models differ?

1.12

How do 3-layer and 4-layer attention-only models differ from 2L? Ablate a single head and run the model on a lot of text. Look at the change in performance. Do any heads matter a lot that aren't induction heads?

How do larger models differ?

1.13

Look for tasks that an n-layer model can't do, but an n+1-layer model can, and look for a circuit that explains this. (Start by running both models on a bunch of text and look for per-token probability differences)

How do larger models differ?

1.14

How do 1L SoLU/GELU models differ from 1L attention-only?

How do larger models differ?

1.15

How do 2L SoLU models differ from 1L?

How do larger models differ?

1.16

How does 1L GELU differ from 1L SoLU?

How do larger models differ?

1.17

Analyse how a larger model "fixes the bugs" of a smaller model.

How do larger models differ?

1.18

Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?

How do larger models differ?

1.19

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Try looking at split-token induction, where the current token has a preceding space and is one token, but the earlier occurrence has no preceding space and is two tokens. E.g " Claire" vs. "Cl" "aire"

How do larger models differ?

1.2

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at misfiring when the previous token appears multiple times with different following tokens

How do larger models differ?

1.21

Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at stopping induction on a token that likely shows the end of a repeated string (e.g, . or ! or ")

How do larger models differ?

1.22

Does a 2L MLP model fix these bugs (1.19 -1.21) too?

Understanding neurons

1.5

How far can you get deeply reverse engineering a neuron in a 2+ layer model?

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.