Difficulty D problems

Difficulty: D

Difficulty: D

Category

Subcategory

Difficulty

Number

Problem

Existing Work

Currently working

Help Wanted?

Interpreting Algorithmic Problems

Deep learning mysteries

3.28

Explore the Lottery Ticket Hypothesis

Interpreting Algorithmic Problems

Deep learning mysteries

3.29

Explore Deep Double Descent

Exploring Polysemanticity and Superposition

Building toy models of superposition

4.16

Build a toy model with a mdoel needs to deal with simultaneous interference, and try to understand how it does it, or if it can.

Exploring Polysemanticity and Superposition

Studying bottleneck superposition in real language models

4.28

Can you find examples of a model learning to deal with simultaneous interference?

Exploring Polysemanticity and Superposition

4.36

Look for features in Neuroscope that seem to be represented by various neurons in a 1-2 layer language model. Train probes to detect some of them. Compare probe performance vs. neuron performance.

Exploring Polysemanticity and Superposition

Getting rid of superposition

4.44

Can you take a trained model, freeze all weights except an MLP layer, x10 that layer's width, copy each neuron 10 times, add noise, and fine-tune? Does this remove superposition / add new features?

Analysing Training Dynamics

Finding phase transitions

5.32

Hypothesis: Scaling laws happen because models experience a ton of tiny phase changes which average out to a smooth curve due to the law of large numbers. Can you find evidence for or against that?

Techniques, Tooling, and Automation

Interpreting models with LLM's

6.41

Choose your own adventure - can you find a way to usefully use an LLM to interpret models?

Techniques, Tooling, and Automation

Apply techniques from non-mechanistic interpretability

6.45

Wiles et al gives an automated set of techniques to analyse bugs in image classification models. Can you get any traction adapting this to language models?

Image Model Interpretability

Building on Circuits thread

7.6

What happens if you apply causal scrubbing to the Circuits thread's claimed curve circuits algorithm? (This will take significant conceptual effort to extend to images since it's harder to precisely control input!)