Open Problems in Mechanistic Interpretability

Explore

Circuits in the wild

Please see Neel’s post on

Circuits in the Wild⁠

for a more detailed description of the problems

Circuits in the Wild

Circuits in the Wild

Category

Difficulty

Existing Work

Currently working

Help Wanted?

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.13

Problem

Choose your own adventure! Try finding behaviours of your own related to natural language circuits.

Circuits In The Wild

Subcategory

Circuits in code models

Difficulty

Number

2.17

Problem

Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.18

Problem

Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)

Existing Work

⁠

Code

⁠

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.19

Problem

Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.21

Problem

Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.1

Problem

Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.3

Problem

A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.4

Problem

3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.5

Problem

Converting names to emails, like "Katy Johnson <" -> "katy_johnson"

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.8

Problem

Learning that words after full stops are capital letters.

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.9

Problem

Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.11

Problem

Reverse engineer an induction head in a non-toy model.

Circuits In The Wild

Subcategory

Circuits in code models

Difficulty

Number

2.14

Problem

Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.

Circuits In The Wild

Subcategory

Circuits in code models

Difficulty

Number

2.15

Problem

Closing HTML tags

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.22

Problem

Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.25

Problem

GPT-Neo wasn't trained with dropout - check 2.24 on this.

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.26

Problem

Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.

Circuits In The Wild

Subcategory

Confusing things

Difficulty

Number

2.29

Problem

Why do models have so many induction heads? How do they specialise, and why does the model need so many?

Circuits In The Wild

Subcategory

Confusing things

Difficulty

Number

2.3

Problem

Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?

Circuits In The Wild

Subcategory

Confusing things

Difficulty

Number

2.31

Problem

Can we find evidence of the residual stream as shared bandwidth hypothesis?

Circuits In The Wild

Subcategory

Confusing things

Difficulty

Number

2.32

Problem

Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?

Circuits In The Wild

Subcategory

Confusing things

Difficulty

Number

2.33

Problem

What happens to the memory in an induction circuit? (See 2.32)

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.6

Problem

A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.7

Problem

Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?

Existing Work

⁠

Paper

⁠

Circuits In The Wild

Subcategory

Circuits in natural language

Difficulty

Number

2.1

Problem

Interpreting memorisation. Sometimes GPT knows phone numbers. How?

Circuits In The Wild

Subcategory

Circuits in code models

Difficulty

Number

2.16

Problem

Methods depend on object type (e.g, x.append a list, x.update a dictionary)

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.23

Problem

What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.24

Problem

What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.27

Problem

MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?

Circuits In The Wild

Subcategory

Extensions to IOI paper

Difficulty

Number

2.28

Problem

Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)

Circuits In The Wild

Subcategory

Studying larger models

Difficulty

Number

2.34

Problem

GPT-J contains translation heads. Can you interpret how they work and what they do?

Circuits In The Wild

Subcategory

Studying larger models

Difficulty

Number

2.35

Problem

Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.

Circuits In The Wild

Subcategory

Studying larger models

Difficulty

Number

2.36

Problem

What's up with few-shot learning? How does it work?

Circuits In The Wild

Subcategory

Studying larger models

Difficulty

Number

2.37

Problem

How does addition work? (Focus on 2-digit)

Circuits In The Wild

Subcategory

Studying larger models

Difficulty

Number

2.38

Problem

What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.