Open Problems in Mechanistic Interpretability

Explore

Interpreting Algorithmic Problems

Please see Neel’s post on

Interpreting algorithmic problems⁠

for a more detailed description of the problems

Interpreting Algorithmic Problems

Category

Difficulty

Existing Work

Currently working

Help Wanted?

Interpreting Algorithmic Problems

Beginner problems

3.1

Sorting fixed-length lists. (format - START 4 6 2 9 MID 2 4 6 9)

⁠

Code

⁠

Interpreting Algorithmic Problems

Beginner problems

3.2

Sorting variable-length lists. (What's the sorting algorithm? What's the longest list you can get do? How does length affect accuracy?)

Interpreting Algorithmic Problems

Beginner problems

3.3

Interpret a 2L MLP (one hidden layer) trained to do modular addition. (Analogous to Neel's grokking work)

Interpreting Algorithmic Problems

Beginner problems

3.4

Interpret a 1L MLP trained to do modular subtraction (Analogous to Neel's grokking work)

Interpreting Algorithmic Problems

Beginner problems

3.5

Taking the minimum or maximum of two ints

⁠

Code

⁠

Interpreting Algorithmic Problems

Beginner problems

3.6

Permuting lists

Interpreting Algorithmic Problems

Beginner problems

3.7

Calculating sequences with Fibonnaci-style recurrence (predicting next element from the previous two)

Interpreting Algorithmic Problems

Extending Othello-GPT

3.3

Try one of Neel's concrete Othello-GPT projects.

Interpreting Algorithmic Problems

Harder problems

3.8

5-digit addition/subtraction.

Interpreting Algorithmic Problems

Harder problems

3.9

Predicting the output to simple code function. E.g, problems like "a = 1 2 3. a[2] = 4. a -> 1 2 4"

⁠

Code

⁠

Interpreting Algorithmic Problems

Harder problems

3.1

Graph theory problems like this. Unsure of the correct input format. Try a bunch. See here

Interpreting Algorithmic Problems

Harder problems

3.12

Train models for automata tasks and interpret them. Do your results match the theory?

Interpreting Algorithmic Problems

Harder problems

3.13

In-Context Linear Regression - the transformer gets a sequence (x_1, y_1, x_2, y_2, ...) where y_i = Ax_i + b. A and b are different for each prompt, and need to be learned in-context. (Code here)

Interpreting Algorithmic Problems

Harder problems

3.16

Predict repeated subsequences in randomly generated tokens, and see if you can find and reverse engineer induction heads.

Interpreting Algorithmic Problems

3.18

Build a toy model of Indirect Object Identification - train a tiny attention-only model on an algorithmic task simulating IOI - and reverse-engineer the learned solution. Compare it to the circuit found in GPT-2 Small.

Interpreting Algorithmic Problems

Questions about language models

3.22

Train a 3L attention-only transformer to perform the Indirect Object Identification task. Can it do the task? Does it learn the same circuit found in GPT-2 Small?

Interpreting Algorithmic Problems

Questions about language models

3.23

Redo Neel's modular addition analysis with GELU. Does it change things?

Interpreting Algorithmic Problems

Questions about language models

3.26

In modular addition, look at what different dimensionality reduction techniques do on different weight matrices. Can you identify which weights matter most? Which neurons form clusters for each frequency? Anything from activations?

Interpreting Algorithmic Problems

Extending Othello-GPT

3.32

Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and don't work. Try to build our understanding of transformer MLP's in general.

Interpreting Algorithmic Problems

Harder problems

3.14

Problems in In-Context Linear Regression that are in-context learned. See 3.13.

Interpreting Algorithmic Problems

Harder problems

3.15

5 digit (or binary) multiplication

Interpreting Algorithmic Problems

Harder problems

3.17

Choose your own adventure! Find your own algorithmic problem. Leetcode easy is probably a good source.

Interpreting Algorithmic Problems

3.19

Is 3.18 consistent across random seeds, or can other algorithms be learned? Can a 2L model learn this? What happens if you add more MLP's or more layers?

Interpreting Algorithmic Problems

3.2

Reverse-engineer Othello-GPT. Can you reverse-engineer the algorithms it learns, or the features the probes find?

Interpreting Algorithmic Problems

Questions about language models

3.24

How does memorisation work? Try training a one hidden layer MLP to memorise random data, or training a transformer on a fixed set of random strings of tokens.

Interpreting Algorithmic Problems

Questions about language models

3.25

Compare different dimensionality reduction techniques on modular addition or a problem you feel you understand.

Interpreting Algorithmic Problems

Questions about language models

3.27

Is direct logit attribution always useful? Can you find examples where it's highly misleading?

Interpreting Algorithmic Problems

Extending Othello-GPT

3.31

Looking for modular circuits - try to find the circuits used to compute the world model and to use the world model to compute the next move. Try to understand each in isolation and use this to understand how they fit together. See what you can learn about finding modular circuits in general.

⁠

Conceptual post

⁠

Interpreting Algorithmic Problems

Extending Othello-GPT

3.33

Transformer Circuits Laboratory - Explore and test other conjectures about transformer circuits - e.g, can we figure out how the model manages memory in the residual stream?

Interpreting Algorithmic Problems

Deep learning mysteries

3.28

Explore the Lottery Ticket Hypothesis

Interpreting Algorithmic Problems

Deep learning mysteries

3.29

Explore Deep Double Descent

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.