Understanding neurons

1.5

How far can you get deeply reverse engineering a neuron in a 2+ layer model?

Circuits in natural language

2.6

A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail

Circuits in natural language

2.7

Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?

Circuits in natural language

2.1

Interpreting memorisation. Sometimes GPT knows phone numbers. How?

Circuits in code models

2.16

Methods depend on object type (e.g, x.append a list, x.update a dictionary)

Extensions to IOI paper

2.23

What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?

Extensions to IOI paper

2.24

What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?

Extensions to IOI paper

2.27

MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?

Extensions to IOI paper

2.28

Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)

Studying larger models

2.34

GPT-J contains translation heads. Can you interpret how they work and what they do?

Studying larger models

2.35

Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.

Studying larger models

2.36

What's up with few-shot learning? How does it work?

Studying larger models

2.37

How does addition work? (Focus on 2-digit)

Studying larger models

2.38

What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?

Interpreting Algorithmic Problems

Harder problems

3.14

Problems in In-Context Linear Regression that are in-context learned. See 3.13.

Interpreting Algorithmic Problems

Harder problems

3.15

5 digit (or binary) multiplication

Interpreting Algorithmic Problems

Harder problems

3.17

Choose your own adventure! Find your own algorithmic problem. Leetcode easy is probably a good source.

Interpreting Algorithmic Problems

3.19

Is 3.18 consistent across random seeds, or can other algorithms be learned? Can a 2L model learn this? What happens if you add more MLP's or more layers?

Interpreting Algorithmic Problems

3.2

Reverse-engineer Othello-GPT. Can you reverse-engineer the algorithms it learns, or the features the probes find?

Interpreting Algorithmic Problems

Questions about language models

3.24

How does memorisation work? Try training a one hidden layer MLP to memorise random data, or training a transformer on a fixed set of random strings of tokens.

Interpreting Algorithmic Problems

Questions about language models

3.25

Compare different dimensionality reduction techniques on modular addition or a problem you feel you understand.

Interpreting Algorithmic Problems

Questions about language models

3.27

Is direct logit attribution always useful? Can you find examples where it's highly misleading?

Interpreting Algorithmic Problems

Extending Othello-GPT

3.31

Looking for modular circuits - try to find the circuits used to compute the world model and to use the world model to compute the next move. Try to understand each in isolation and use this to understand how they fit together. See what you can learn about finding modular circuits in general.

Interpreting Algorithmic Problems

Extending Othello-GPT

3.33

Transformer Circuits Laboratory - Explore and test other conjectures about transformer circuits - e.g, can we figure out how the model manages memory in the residual stream?

Exploring Polysemanticity and Superposition

Confusions to study in Toy Models of Superposition

4.11

Can you find a toy model where GELU acts significantly differently from ReLU?

May 1, 2023 - Kunvar (firstuserhere)

Exploring Polysemanticity and Superposition

Building toy models of superposition

4.12

Build a toy model of a classification problem with cross-entropy loss

November 10, 2023 - Lucas Hayne ()

Exploring Polysemanticity and Superposition

Building toy models of superposition

4.13

Build a toy model of neuron superposition that has many more hidden features than output features

Exploring Polysemanticity and Superposition

Building toy models of superposition

4.14

Build a toy model that needs multiple hidden layers of ReLU's. Can computation in superposition happen across several layers? Eg max (|x|, |y|)

Exploring Polysemanticity and Superposition

Building toy models of superposition

4.15

Build a toy model of attention head superposition/polysemanticity. Can you find a task where the model wants to do different things with an attention head on different inputs? How does it represent things internally / deal with interference?

Exploring Polysemanticity and Superposition

Making toy model counterexamples

4.17

Make toy models that are counterexamples in MI. A learned example of a network with a non-linear representation.

Exploring Polysemanticity and Superposition

Making toy model counterexamples

4.18

Make toy models that are counterexamples in MI. A network without a discrete number of features.

Exploring Polysemanticity and Superposition

Making toy model counterexamples

4.19

Make toy models that are counterexamples in MI. A non-decomposable neural network.

Exploring Polysemanticity and Superposition

Making toy model counterexamples

4.2

Make toy models that are counterexamples in MI. A task where networks can learn multiple different sets of features.

Exploring Polysemanticity and Superposition

Studying bottleneck superposition in real language models

4.26

Can you find any examples of locally almost-orthogonal bases?

Exploring Polysemanticity and Superposition

Studying bottleneck superposition in real language models

4.27

Do language models have "genre" directions that detect the type of text, and then represent features specific to each genre in the same subspace?

Exploring Polysemanticity and Superposition

Studying neuron superposition in real models

4.3

Look at a polysemantic neuron in a 2L language model. Can you figure out how the model disambiguates what feature it is?

Exploring Polysemanticity and Superposition

Studying neuron superposition in real models

4.32

Try to fully reverse engineer a feature discovered in 4.31.

Exploring Polysemanticity and Superposition

Studying neuron superposition in real models

4.33

Can you use superposition to create an adversarial example for a neuron?

Exploring Polysemanticity and Superposition

Studying neuron superposition in real models

4.34

Can you find any examples of the asymmetric superposition motif in the MLP of a 1-2 layer language model?

Exploring Polysemanticity and Superposition

4.35

Pick a simple feature of language (e.g, is number, is base64) and train a linear probe to detect that in the MLP activations of a 1L language model.

Exploring Polysemanticity and Superposition

Comparing SoLU/GELU

4.41

How does GELU vs. ReLU compare re: polysemanticity. Replicate SoLU analysis.

Exploring Polysemanticity and Superposition

Getting rid of superposition

4.42

If you train a 1L/2L language model with d_mlp = 100 * d_model, does superposition go away?

Exploring Polysemanticity and Superposition

Getting rid of superposition

4.43

Study the T5 XXL. It's 11B params and not supported by TransformerLens. Expect major infrastructure pain.

Exploring Polysemanticity and Superposition

Getting rid of superposition

4.45

Pick an open problem at the end of Toy Models of Superposition.

Analysing Training Dynamics

Algorithmic tasks - understanding grokking

5.2

Why do 5-digit addition phase changes happen in that order?

Analysing Training Dynamics

Algorithmic tasks - understanding grokking

5.4

Can we predict when grokking will happen? Bonus: Without using any future information?

Analysing Training Dynamics

Algorithmic tasks - understanding grokking

5.5

Understanding why the model chooses specific frequencies (and why it switches mid-training sometimes!)

Analysing Training Dynamics

Algorithmic tasks - lottery tickets

5.1

All Neel's toy models (attn-only, gelu, solu) were trained with the same data shuffle and weight initialisation. Many induction heads aren't shared, but L2H3 in 3L and L1H6 in 2L always are. What's up with that?

Analysing Training Dynamics

Understanding fine-tuning

5.15

Build a toy model of fine-tuning (train on task 1, fine-tune on task 2). What is going on internally? Any interesting motifs?

Analysing Training Dynamics

Understanding fine-tuning

5.21

Can you find any phase transitions in the fine-tuning checkpoints?

Analysing Training Dynamics

Understanding training dynamics in language models

5.24

Use the per-token loss analysis technique from the induction heads paper to look for more phase changes.

Analysing Training Dynamics

Finding phase transitions

5.31

Look for phase transitions in benchmark performance or specific questions from a benchmark.

Techniques, Tooling, and Automation

Breaking current techniques

6.3

Can you fix direct logit attribution in GPT-Neo small, e.g, by finding a linear approximation to the final layer by taking gradients? (Eleuther's tuned lens in #interp-across-depth would be a good place to start)

Techniques, Tooling, and Automation

Breaking current techniques

6.6

Find edge cases where causal scrubbing breaks.

Techniques, Tooling, and Automation

Breaking current techniques

6.11

Automate ways to identify heads that compose. Start with IOI circuit and the composition scores in A Mathematical Framework.

Techniques, Tooling, and Automation

6.13

Can you automate direct path patching as used in the IOI paper?

Techniques, Tooling, and Automation

Automatically find circuits

6.27

Can you automate the detection of something in neuron interpretability? E.g, trigram neurons

Techniques, Tooling, and Automation

Automatically find circuits

6.28

Find good ways to find the equivalent of max activating dataset examples for attention heads. Validate on induction circuits, then IOI. See post for ideas.

Techniques, Tooling, and Automation

Refine max activating dataset examples

6.29

Refine the max activating dataset examples technique for neuron interpretability to find minimal or diverse examples.

Techniques, Tooling, and Automation

Refine max activating dataset examples

6.35

Using 6.28: (Infrastructure) Add any of 6.29-6.34 to Neuroscope. Email Neel (neelnanda27@gmail.com) for codebase access.

Techniques, Tooling, and Automation

Apply techniques from non-mechanistic interpretability

6.43

Can you use probing to get evidence for or against predictions in Toy Models of Superposition?

Techniques, Tooling, and Automation

Apply techniques from non-mechanistic interpretability

6.44

Pick anything interesting from Rauker et al and try to apply the techniques to circuits we understand.

Techniques, Tooling, and Automation

6.46

Take existing circuits and explore quantitative ways to characterise that it's a true circuit (or disprove it!) Try causal scrubbing to start.

Techniques, Tooling, and Automation

6.47

Build on Arthur Conmy's work to automatically find circuits via recursive path patching

Techniques, Tooling, and Automation

Taking the "diff" of two models

6.49

Build tooling to take the "diff" of two models, treating them as a black box mapping inputs to outputs, so it works with models with different internal structure

Techniques, Tooling, and Automation

6.59

We understand how attention is calculated for a head using the QK matrix. This doesn't work for rotary attention. Can you find a principled alternative?

Image Model Interpretability

Reverse engineering image models

7.1

Using Circuits techniques, how well can we reverse engineer ResNet?

Image Model Interpretability

Reverse engineering image models

7.2

Vision Transformers - can you smush together transformer circuits and image circuits techniques? Which ones transfer?

Image Model Interpretability

Reverse engineering image models

7.3

Using Circuits techniques, how well can we reverse engineer ConvNeXt, a modern image model architecture merging ResNet and vision transformer ideas?

Image Model Interpretability

Building on Circuits thread

7.4

How well can you hand-code curve detectors? Can you include color? How much performance can you recover?

Image Model Interpretability

Building on Circuits thread

7.5

Can you hand-code any other circuits? Start with other early vision neurons

Image Model Interpretability

Building on Circuits thread

7.8

Digging into polysemantic neuron examples and trying to understand better what's going on there.

Image Model Interpretability

Multimodal models (CLIP interpretability)

7.11

Can you rigorously reverse engineer any circuits, like the Curve Circuits paper?

Image Model Interpretability

Multimodal models (CLIP interpretability)

7.12

Can you apply transformer circuits techniques to understand the attention heads in the image part?

Image Model Interpretability

7.14

Train a checkpointed run of Inception. Do curve detectors form as a phase change?

Interpreting Reinforcement Learning

AlphaZero

8.1

Replicate some of Tom McGrath's AlphaZero work with LeelaChessZero. Use NMF on the activations and trying to interpret some. See visualisations here.

Interpreting Reinforcement Learning

Goal misgeneralisation

8.5

Intrepret one of the examples in the goal misgeneralisation papers (Langosco et al and Shah et al). Can you concretely figure out what's going on?

Interpreting Reinforcement Learning

Goal misgeneralisation

8.7

Using 8.5: Possible starting point - CoinRun. Interpreting RL Vision made significant progress and Langosco et al found it was an example of goal misgeneralisation - can you build on these to predict the misgeneralisation?

Interpreting Reinforcement Learning

8.1

Train and interpret a model from the In-Context Reinforcement Learning and Algorithmic Distillation paper. They trained small transformers where they input a sequence of moves for a "novel" RL task and the model outputs sensible answers for that task.

10/april/2023-Victor Levoso and others , working on reinplementing AD to try this, we have a channel for it on this discord: https://discord.gg/cMr5YqbU4y

Interpreting Reinforcement Learning

Interpreting RLHF Transformers

8.12

Can you find any circuits in CarperAI's RLHF model corresponding to longer term planning?

Interpreting Reinforcement Learning

Interpreting RLHF Transformers

8.13

Can you get any traction on interpreting CarperAI's RLHF model's reward model?

Interpreting Reinforcement Learning

8.15

Try training and interpreting a small model from Guez et al. They trained model-free RL agents and showed evidence they spontaneously learned planning. Can you find evidence for/against this?

Interpreting Reinforcement Learning

8.19

Can you interpret a model on a task from 8.16-8.18 using Q-Learning?

Interpreting Reinforcement Learning

8.2

Take an agent trained with RL and train another network to copy the output logits of that agent. Try to reverse engineer the clone. Can you find the resulting circuits in the original?

Interpreting Reinforcement Learning

8.21

Once you've got traction understanding a fully trained agent on a task elsewhere in this category, try to extend this understanding to study it during training. Can you get any insight into what's actually going on?

Studying Learned Features in Language Models

Seeking out specific features

9.27

Search for neurons that clean up superposition interference.

Studying Learned Features in Language Models

Seeking out specific features

9.36

Try training linear probes for features from 9.13-9.35.

Studying Learned Features in Language Models

Seeking out specific features

9.37

Using 9.36 - How does your ability to recover features from the residual stream compare to MLP layer outputs vs. attention layer outputs? Can you find features that can only be recovered from some of these?

Studying Learned Features in Language Models

Seeking out specific features

9.38

Using 9.36 - Are there features that can only be recovered from certain MLP layers?

Studying Learned Features in Language Models

Seeking out specific features

9.39

Using 9.36 - Are there features that are significantly easier to recover from early layer residual streams and not from later layers?

Studying Learned Features in Language Models

Miscellaneous

9.58

Replicate Knowledge Neurons in Pretrained Transformers on a generative model. How much are these results consistent with what Neuroscope shows?