Circuits in the wild

Please see Neel’s post on for a more detailed description of the problems
Circuits in the Wild
Search
Subcategory
Circuits in natural language
Difficulty
Number
2.13
Problem
Choose your own adventure! Try finding behaviours of your own related to natural language circuits.
Subcategory
Circuits in code models
Difficulty
Number
2.17
Problem
Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.
Subcategory
Extensions to IOI paper
Difficulty
Number
2.18
Problem
Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)
Existing Work
Subcategory
Extensions to IOI paper
Difficulty
Number
2.19
Problem
Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?
Subcategory
Extensions to IOI paper
Difficulty
Number
2.21
Problem
Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?
Subcategory
Circuits in natural language
Difficulty
Number
2.1
Problem
Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?
Subcategory
Circuits in natural language
Difficulty
Number
2.3
Problem
A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!
Subcategory
Circuits in natural language
Difficulty
Number
2.4
Problem
3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU
Subcategory
Circuits in natural language
Difficulty
Number
2.5
Problem
Converting names to emails, like "Katy Johnson <" -> "katy_johnson"
Subcategory
Circuits in natural language
Difficulty
Number
2.8
Problem
Learning that words after full stops are capital letters.
Subcategory
Circuits in natural language
Difficulty
Number
2.9
Problem
Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)
Subcategory
Circuits in natural language
Difficulty
Number
2.11
Problem
Reverse engineer an induction head in a non-toy model.
Subcategory
Circuits in code models
Difficulty
Number
2.14
Problem
Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.
Subcategory
Circuits in code models
Difficulty
Number
2.15
Problem
Closing HTML tags
Subcategory
Extensions to IOI paper
Difficulty
Number
2.22
Problem
Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.
Subcategory
Extensions to IOI paper
Difficulty
Number
2.25
Problem
GPT-Neo wasn't trained with dropout - check 2.24 on this.
Subcategory
Extensions to IOI paper
Difficulty
Number
2.26
Problem
Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.
Subcategory
Confusing things
Difficulty
Number
2.29
Problem
Why do models have so many induction heads? How do they specialise, and why does the model need so many?
Subcategory
Confusing things
Difficulty
Number
2.3
Problem
Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?
Subcategory
Confusing things
Difficulty
Number
2.31
Problem
Can we find evidence of the residual stream as shared bandwidth hypothesis?
Subcategory
Confusing things
Difficulty
Number
2.32
Problem
Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?
Subcategory
Confusing things
Difficulty
Number
2.33
Problem
What happens to the memory in an induction circuit? (See 2.32)
Subcategory
Circuits in natural language
Difficulty
Number
2.6
Problem
A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail
Subcategory
Circuits in natural language
Difficulty
Number
2.7
Problem
Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?
Existing Work
Subcategory
Circuits in natural language
Difficulty
Number
2.1
Problem
Interpreting memorisation. Sometimes GPT knows phone numbers. How?
Subcategory
Circuits in code models
Difficulty
Number
2.16
Problem
Methods depend on object type (e.g, x.append a list, x.update a dictionary)
Subcategory
Extensions to IOI paper
Difficulty
Number
2.23
Problem
What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?
Subcategory
Extensions to IOI paper
Difficulty
Number
2.24
Problem
What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?
Subcategory
Extensions to IOI paper
Difficulty
Number
2.27
Problem
MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?
Subcategory
Extensions to IOI paper
Difficulty
Number
2.28
Problem
Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)
Subcategory
Studying larger models
Difficulty
Number
2.34
Problem
GPT-J contains translation heads. Can you interpret how they work and what they do?
Subcategory
Studying larger models
Difficulty
Number
2.35
Problem
Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.
Subcategory
Studying larger models
Difficulty
Number
2.36
Problem
What's up with few-shot learning? How does it work?
Subcategory
Studying larger models
Difficulty
Number
2.37
Problem
How does addition work? (Focus on 2-digit)
Subcategory
Studying larger models
Difficulty
Number
2.38
Problem
What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.