Explore

Difficulty A problems

Difficulty: A

Difficulty: A

Category

Subcategory

Difficulty

Number

Problem

Existing Work

Currently working

Help Wanted?

Understanding neurons

1.6

Hunt through Neuroscope for the toy models and look for interesting neurons to focus on.

Understanding neurons

1.7

Can you find any polysemantic neurons in Neuroscope? Explore this.

1.23

Choose your own adventure: Take a bunch of text with interesting patterns and run the models over it. Look for tokens they do really well on and try to reverse engineer what's going on!

Circuits in natural language

2.13

Choose your own adventure! Try finding behaviours of your own related to natural language circuits.

Circuits in code models

2.17

Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.

Extensions to IOI paper

2.18

Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)

⁠

Code

⁠

Extensions to IOI paper

2.19

Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?

Extensions to IOI paper

2.21

Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?

Beginner problems

3.1

Sorting fixed-length lists. (format - START 4 6 2 9 MID 2 4 6 9)

⁠

Code

⁠

Beginner problems

3.2

Sorting variable-length lists. (What's the sorting algorithm? What's the longest list you can get do? How does length affect accuracy?)

Beginner problems

3.3

Interpret a 2L MLP (one hidden layer) trained to do modular addition. (Analogous to Neel's grokking work)

Beginner problems

3.4

Interpret a 1L MLP trained to do modular subtraction (Analogous to Neel's grokking work)

Beginner problems

3.5

Taking the minimum or maximum of two ints

⁠

Code

⁠

Beginner problems

3.6

Permuting lists

Beginner problems

3.7

Calculating sequences with Fibonnaci-style recurrence (predicting next element from the previous two)

Questions about language models

3.21

Train a 1L attention-only transformer with rotary to predict the previous token and reverse engineer how it does this.

5/7/23: Eric (repo: https://github.com/DKdekes/rotary-interp)

Extending Othello-GPT

3.3

Try one of Neel's concrete Othello-GPT projects.

Confusions to study in Toy Models of Superposition

4.1

Does dropout create a privileged basis? Put dropout on the hidden layer of the ReLU output model and study how this changes the results.

Post

14 April 2023: Kunvar (firstuserhere)

Confusions to study in Toy Models of Superposition

4.5

Explore neuron superposition by training their absolute value model on functions of multiple variables. Make inputs binary (0/1) and look at the AND and OR of element pairs.

Confusions to study in Toy Models of Superposition

4.7

Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Make the features 1 (i.e, two possible values)

Confusions to study in Toy Models of Superposition

4.1

What happens if you replace ReLU's with GeLU's in the toy models?

May 1, 2023 - Kunvar (firstuserhere)

Studying bottleneck superposition in real language models

4.25

Can you find any examples of the geometric superposition configurations in the residual stream of a language model?

Comparing SoLU/GELU

4.37

How do TransformerLens SoLU / GeLU models compare in Neuroscope under the SoLU polysemanticity metric? (What fraction of neurons seem monosemantic)

Understanding fine-tuning

5.16

How does model performance change on the original training distribution when finetuning?

Understanding training dynamics in language models

5.25

Look at attention heads on various texts and see if any have recognisable attention patterns, then analyse them over training.

Finding phase transitions

5.26

Look for phase transitions in the Indirect Object Identification task. (Note: This might not have a phase change)

Studying path dependence

5.33

How much do the Stanford CRFM models have similar outputs on a given text?

Studying path dependence

5.35

Look for Indirect Object Identification capability in other models of approximately the same size.

Studying path dependence

5.38

Can you find some problem where you understand the circuits and Git Re-Basin does work?

Breaking current techniques

6.1

Try to find concrete edge cases where a technique breaks - start with a misleading example in a real model or training a toy model with one.

Breaking current techniques

6.7

Find edge cases where ablations break. (Start w/ backup name movers in the IOI circuit, where we know zero ablations break)

ROME activation patching

6.15

In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). How do results change when you do single layers?

ROME activation patching

6.16

In the ROME paper, they do activation patching by patching over the outputs of 10 adjacent MLP or attention layers. (Look at logit difference after patching). Can you get anywhere when patching specific neurons?

Automatically find circuits

6.18

Automate ways to find previous token heads. (Bonus: Add to TransformerLens!)

⁠

Code

⁠

Automatically find circuits

6.19

Automate ways to find duplicate token heads. (Bonus: Add to TransformerLens!)

⁠

Code

⁠

Automatically find circuits

6.2

Automate ways to find induction heads. (Bonus: Add to TransformerLens!)

⁠

Code

⁠

Automatically find circuits

6.21

Automate ways to find translation heads. (Bonus: Add to TransformerLens!)

Refine max activating dataset examples

6.36

Using 6.28: Finding the minimal example to activate a neuron by truncating the text - how often does this work?

Refine max activating dataset examples

6.37

Using 6.28: Can you replicate the results of the interpretability illusion for Neel's toy models by finding seemingly monosemantic neurons on Python code or C4 (web text), but are polysemantic when combined?

Exploring Neuroscope

9.1

Explore random neurons! Use the interactive neuroscope to test and verify your understanding.

Exploring Neuroscope

9.2

Look for interesting conceptual neurons in the middle layers of larger models, like the "numbers that refer to groups of people" neuron.

Exploring Neuroscope

9.3

Look for examples of detokenisation neurons

Exploring Neuroscope

9.4

Look for examples of trigram neurons (consistently activate on a pair of tokens and boost the logit of plausible next tokens)

Exploring Neuroscope

9.5

Look for examples of retokenization neurons

Exploring Neuroscope

9.6

Look for examples of context neurons (eg base64)

Exploring Neuroscope

9.7

Look for neurons that align with any of the feature ideas in 9.13-9.21

Exploring Neuroscope

9.1

How much does the logit attribution of a neuron align with the dataset example patterns? Is it related?

Seeking out specific features

9.13

Basic syntax (Lots of ideas in post)

Seeking out specific features

9.14

Linguistic features (Try using spaCy to automate this) (Lots of ideas in post)

Seeking out specific features

9.15

Proper nouns (Lots of ideas in post)

Seeking out specific features

9.16

Python code features (Lots of ideas in post)

Seeking out specific features

9.2

LaTeX features. Try common commands (\left, \right) and section titles (\abstract, \introduction, etc.)

Seeking out specific features

9.23

Diambiguation neurons - Foreign language disambiguation (e.g, "die" in Dutch vs. German vs. Afrikaans)

Seeking out specific features

9.24

Disambiguation neurons - words with multiple meanings (e.g, "bat" as animal or sports equipment)

Seeking out specific features

9.25

Search for memory management neurons (high negative cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?

Seeking out specific features

9.26

Search for signal boosting neurons (high positive cosine similarity between w_in and w_out). What do their dataset examples look like? Is there a pattern?

Seeking out specific features

9.28

Can you find split-token neurons? (I.e, " Claire" vs. "Cl" and "aire" - the model should learn to identify the split-token case)

Seeking out specific features

9.32

Neurons which link to attention heads - duplicated token

Curiosities about neurons

9.4

When you look at the max dataset examples for a specific neuron, is that neuron the most activated neuron on the text? What does it look like in general?

Curiosities about neurons

9.41

Look at the distributions of neuron activations (pre and post-activation for GELU, and pre, mid, and post for SoLU). What does this look like? How heavy tailed? How well can it be modelled as a normal distribution?

Curiosities about neurons

9.43

How similar are the distributions between SoLU and GELU?

Curiosities about neurons

9.44

What does the distribution of the LayerNorm scale and softmax denominator in SoLU look like? Is it bimodal (indicating monosemantic features) or fairly smooth and unimodal?

Curiosities about neurons

9.52

Try comparing how monosemantic the neurons in a GELU vs SoLU model are. Can you replicate the results SoLU does better? What are the rates for each model?

Miscellaneous

9.59

Can you replicate the results of the interpretability illusion on SoLU models, which were trained on a mix of web text and Python code? (Find neurons that seem monosemantic on either but with importantly different patterns)

No results from filter

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.