Interpreting Algorithmic Problems
Deep learning mysteries
3.28
Explore the Lottery Ticket Hypothesis
Interpreting Algorithmic Problems
Deep learning mysteries
3.29
Explore Deep Double Descent
Exploring Polysemanticity and Superposition
Building toy models of superposition
4.16
Build a toy model with a mdoel needs to deal with simultaneous interference, and try to understand how it does it, or if it can.
Exploring Polysemanticity and Superposition
Studying bottleneck superposition in real language models
4.28
Can you find examples of a model learning to deal with simultaneous interference?
Exploring Polysemanticity and Superposition
4.36
Look for features in Neuroscope that seem to be represented by various neurons in a 1-2 layer language model. Train probes to detect some of them. Compare probe performance vs. neuron performance.
Exploring Polysemanticity and Superposition
Getting rid of superposition
4.44
Can you take a trained model, freeze all weights except an MLP layer, x10 that layer's width, copy each neuron 10 times, add noise, and fine-tune? Does this remove superposition / add new features?
Analysing Training Dynamics
Finding phase transitions
5.32
Hypothesis: Scaling laws happen because models experience a ton of tiny phase changes which average out to a smooth curve due to the law of large numbers. Can you find evidence for or against that?
Techniques, Tooling, and Automation
Interpreting models with LLM's
6.41
Choose your own adventure - can you find a way to usefully use an LLM to interpret models?
Techniques, Tooling, and Automation
Apply techniques from non-mechanistic interpretability
6.45
Wiles et al gives an automated set of techniques to analyse bugs in image classification models. Can you get any traction adapting this to language models?
Image Model Interpretability
Building on Circuits thread
7.6
What happens if you apply causal scrubbing to the Circuits thread's claimed curve circuits algorithm? (This will take significant conceptual effort to extend to images since it's harder to precisely control input!)
Interpreting Reinforcement Learning
AlphaZero
8.2
Try applying 8.1 to an open source AlphaZero style Go playing agent
Interpreting Reinforcement Learning
AlphaZero
8.3
Train a small AlphaZero model on a simple game like Tic-Tac-Toe, and try to apply 8.1 there. (Training will be hard! See this tutorial.)
Interpreting Reinforcement Learning
AlphaZero
8.4
Can you extend the work on LeelaZero? Can you find anything about how a feature is computed? Start by looking for features near the start or end of the network.
Interpreting Reinforcement Learning
Interpreting RLHF Transformers
8.11
Go and interpret CarperAI's RLHF model (forthcoming). What's up with that? How is it different from a vanilla language model?
Interpreting Reinforcement Learning
Interpreting RLHF Transformers
8.14
Train a toy RLHF model (1-2 layers) to do a simple task. Use GPT-3 for human data generation. Then try to interpret it. (Note: This will be hard to train, but Neel would be super excited to see the results!) Bonus: Try bigger models like GPT-2 Medium to XL.