Open Problems in Mechanistic Interpretability

Explore

Analysing Training Dynamics

Please see Neel’s post on

Analyzing training dynamics⁠

for a more detailed description of the problems

Training dynamics

Training dynamics

Category

Difficulty

Existing Work

Currently working

Help Wanted?

Understanding fine-tuning

5.16

How does model performance change on the original training distribution when finetuning?

Understanding training dynamics in language models

5.25

Look at attention heads on various texts and see if any have recognisable attention patterns, then analyse them over training.

Finding phase transitions

5.26

Look for phase transitions in the Indirect Object Identification task. (Note: This might not have a phase change)

Studying path dependence

5.33

How much do the Stanford CRFM models have similar outputs on a given text?

Studying path dependence

5.35

Look for Indirect Object Identification capability in other models of approximately the same size.

Studying path dependence

5.38

Can you find some problem where you understand the circuits and Git Re-Basin does work?

Algorithmic tasks - understanding grokking

5.1

Understanding why 5 digit addition has a phase change per digit (so 6 total?!)

Algorithmic tasks - understanding grokking

5.3

Look at the PCA of logits on the full dataset, or the PCA of a stack of flattened weights. If you plot a scatter plot of the first 2 components, the different phases of training are clearly visible. What's up with this?

Algorithmic tasks - understanding grokking

5.6

What happens if we include in the loss one of the progress measures in Neel's grokking post? Can we accelerate or stop grokking?

Algorithmic tasks - understanding grokking

5.7

Adam Jermyn provides an analytical argument and some toy models for why phase transition should be an inherent part of (some of) how models learn. Can you find evidence of this in more complex models?

Algorithmic tasks - understanding grokking

5.8

Build on and refine Adam Jermyn's arguments and toy models - think about how they deviate from a real transformer, and build more faithful models.

Algorithmic tasks - lottery tickets

5.9

For a toy model trained to form induction heads, is there a lottery-ticket style thing going on? Can you disrupt induction head formation by messing with the initialisation?

Algorithmic tasks - lottery tickets

5.11

If we knock out the parameters that form important circuits at the end of training on some toy task, but knock them out at the start of training, how much does that delay/stop generalisation?

Algorithmic tasks - lottery tickets

5.12

Analysing how pairs of heads in an induction circuit compose over time - Can you find progress measures which predict these?

Algorithmic tasks - lottery tickets

5.13

Analysing how pairs of heads in an induction circuit compose over time - Can we predict which heads will learn to compose first?

Algorithmic tasks - lottery tickets

5.14

Analysing how pairs of heads in an induction circuit compose over time -Does the composition develop as a phase transition?

Understanding fine-tuning

5.17

How is the model different on fine-tuned text? Look at examples where the model does much better after fine-tuning, and some normal text.

Understanding fine-tuning

5.18

Try activation patching between the old and fine-tuned model and see how hard recovering performance is.

Understanding fine-tuning

5.19

Look at max activating text for various neurons in the original models. How has it changed post fine-tuning?

Understanding fine-tuning

5.2

Explore further and see what's going on with fine-tuning mechanistically.

Understanding training dynamics in language models

5.22

Can you replicate the induction head phase transition results in the various checkpointed models in TransformerLens? (If code works for attn-only-2l it should work for them all)

Understanding training dynamics in language models

5.23

Look at the neurons in TransformerLens SoLU models during training. Do they tend to form as a phase transition?

Finding phase transitions

5.27

Try digging into the specific heads that act on IOI and look for phase transitions. Use direct logit attribution for the name movers.

Finding phase transitions

5.28

Study the attention patterns of each category of heads in IOI for phase transitions.

Finding phase transitions

5.29

Look for phase transitions in simple IOI-style algorithmic tasks, like few-shot learning, addition, sorting words alphabetically...

Finding phase transitions

5.3

Look for phase transitions in soft induction heads like translation.

Studying path dependence

5.34

How much do the Stanford CRFM models differ with algorithmic tasks like Indirect Object Identification?

Studying path dependence

5.36

When model scale varies (e.g, GPT-2 small vs. medium) is there anything the smaller model can do that the larger one can't do? (Look at difference in per token log prob)

Studying path dependence

5.37

Try applying the Git Re-Basin techniques to a 2L MLP trained for modular addition. Does this work? If you use Neel's grokking work to analyse the circuits involved, how does the re-basin technique map onto the circuits?

Algorithmic tasks - understanding grokking

5.2

Why do 5-digit addition phase changes happen in that order?

Algorithmic tasks - understanding grokking

5.4

Can we predict when grokking will happen? Bonus: Without using any future information?

Algorithmic tasks - understanding grokking

5.5

Understanding why the model chooses specific frequencies (and why it switches mid-training sometimes!)

Algorithmic tasks - lottery tickets

5.1

All Neel's toy models (attn-only, gelu, solu) were trained with the same data shuffle and weight initialisation. Many induction heads aren't shared, but L2H3 in 3L and L1H6 in 2L always are. What's up with that?

Understanding fine-tuning

5.15

Build a toy model of fine-tuning (train on task 1, fine-tune on task 2). What is going on internally? Any interesting motifs?

Understanding fine-tuning

5.21

Can you find any phase transitions in the fine-tuning checkpoints?

Understanding training dynamics in language models

5.24

Use the per-token loss analysis technique from the induction heads paper to look for more phase changes.

Finding phase transitions

5.31

Look for phase transitions in benchmark performance or specific questions from a benchmark.

Finding phase transitions

5.32

Hypothesis: Scaling laws happen because models experience a ton of tiny phase changes which average out to a smooth curve due to the law of large numbers. Can you find evidence for or against that?

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.