Choose your own adventure! Try finding behaviours of your own related to natural language circuits.
Circuits In The Wild
Subcategory
Circuits in code models
Difficulty
A
Number
2.17
Problem
Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.
Circuits In The Wild
Subcategory
Extensions to IOI paper
Difficulty
A
Number
2.18
Problem
Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)
Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?
Circuits In The Wild
Subcategory
Extensions to IOI paper
Difficulty
A
Number
2.21
Problem
Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
B
Number
2.1
Problem
Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
B
Number
2.3
Problem
A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
B
Number
2.4
Problem
3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
B
Number
2.5
Problem
Converting names to emails, like "Katy Johnson <" -> "katy_johnson"
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
B
Number
2.8
Problem
Learning that words after full stops are capital letters.
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
B
Number
2.9
Problem
Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
B
Number
2.11
Problem
Reverse engineer an induction head in a non-toy model.
Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.
Circuits In The Wild
Subcategory
Extensions to IOI paper
Difficulty
B
Number
2.25
Problem
GPT-Neo wasn't trained with dropout - check 2.24 on this.
Circuits In The Wild
Subcategory
Extensions to IOI paper
Difficulty
B
Number
2.26
Problem
Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.
Circuits In The Wild
Subcategory
Confusing things
Difficulty
B
Number
2.29
Problem
Why do models have so many induction heads? How do they specialise, and why does the model need so many?
Circuits In The Wild
Subcategory
Confusing things
Difficulty
B
Number
2.3
Problem
Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?
Circuits In The Wild
Subcategory
Confusing things
Difficulty
B
Number
2.31
Problem
Can we find evidence of the residual stream as shared bandwidth hypothesis?
Circuits In The Wild
Subcategory
Confusing things
Difficulty
B
Number
2.32
Problem
Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?
Circuits In The Wild
Subcategory
Confusing things
Difficulty
B
Number
2.33
Problem
What happens to the memory in an induction circuit? (See 2.32)
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
C
Number
2.6
Problem
A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail
Circuits In The Wild
Subcategory
Circuits in natural language
Difficulty
C
Number
2.7
Problem
Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?
Interpreting memorisation. Sometimes GPT knows phone numbers. How?
Circuits In The Wild
Subcategory
Circuits in code models
Difficulty
C
Number
2.16
Problem
Methods depend on object type (e.g, x.append a list, x.update a dictionary)
Circuits In The Wild
Subcategory
Extensions to IOI paper
Difficulty
C
Number
2.23
Problem
What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?
Circuits In The Wild
Subcategory
Extensions to IOI paper
Difficulty
C
Number
2.24
Problem
What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?
Circuits In The Wild
Subcategory
Extensions to IOI paper
Difficulty
C
Number
2.27
Problem
MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?
Circuits In The Wild
Subcategory
Extensions to IOI paper
Difficulty
C
Number
2.28
Problem
Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)
Circuits In The Wild
Subcategory
Studying larger models
Difficulty
C
Number
2.34
Problem
GPT-J contains translation heads. Can you interpret how they work and what they do?
Circuits In The Wild
Subcategory
Studying larger models
Difficulty
C
Number
2.35
Problem
Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.
Circuits In The Wild
Subcategory
Studying larger models
Difficulty
C
Number
2.36
Problem
What's up with few-shot learning? How does it work?
Circuits In The Wild
Subcategory
Studying larger models
Difficulty
C
Number
2.37
Problem
How does addition work? (Focus on 2-digit)
Circuits In The Wild
Subcategory
Studying larger models
Difficulty
C
Number
2.38
Problem
What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?
Want to print your doc? This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (