Skip to content

Open Problems in Mechanistic Interpretability

Open Problems in Mechanistic Interpretability

Card View for COP

Toy language models

Circuits in the wild

Interpreting Algorithmic Problems

Polysemanticity and Superposition

Analysing Training Dynamics

Tooling and Automation

Image Model Interpretability

RL interpretability

Learned features in language models

Difficulty A problems

Difficulty B problems

Difficulty C problems

Difficulty D problems

More

Share

Explore

RL interpretability

Please see Neel’s post on

Interpreting Reinforcement Learning⁠

for a more detailed description of the problems.

View of Open Problems

View of Open Problems

14

Category

Difficulty

Existing Work

Currently working

Help Wanted?

Search

Interpreting Reinforcement Learning

Goal misgeneralisation

B

8.6

Using 8.5: Possible starting point, Tree Gridworld and Monster Gridworld from Shah et al.

Interpreting Reinforcement Learning

Decision Transformers

B

8.9

Try training a 1L decision transformer on a toy problem, like finding the shortest path in a graph.

Interpreting Reinforcement Learning

Interpreting policy gradients

B

8.16

Can you interpret a small model trained with policy gradients on a gridworld task?

Interpreting Reinforcement Learning

Interpreting policy gradients

B

8.17

Can you interpret a small model trained with policy gradients on an OpenAI Gym task?

Interpreting Reinforcement Learning

Interpreting policy gradients

B

8.18

Can you interpret a small model trained with policy gradients on an Atari game (e.g, Pong)?

Interpreting Reinforcement Learning

B

8.22

Choose your own adventure! There's lots of work in RL - pick something you're excited about and try to reverse engineer something!

Interpreting Reinforcement Learning

AlphaZero

C

8.1

Replicate some of Tom McGrath's AlphaZero work with LeelaChessZero. Use NMF on the activations and trying to interpret some. See visualisations here.

Interpreting Reinforcement Learning

Goal misgeneralisation

C

8.5

Intrepret one of the examples in the goal misgeneralisation papers (Langosco et al and Shah et al). Can you concretely figure out what's going on?

Interpreting Reinforcement Learning

Goal misgeneralisation

C

8.7

Using 8.5: Possible starting point - CoinRun. Interpreting RL Vision made significant progress and Langosco et al found it was an example of goal misgeneralisation - can you build on these to predict the misgeneralisation?

Interpreting Reinforcement Learning

C

8.1

Train and interpret a model from the In-Context Reinforcement Learning and Algorithmic Distillation paper. They trained small transformers where they input a sequence of moves for a "novel" RL task and the model outputs sensible answers for that task.

10/april/2023-Victor Levoso and others , working on reinplementing AD to try this, we have a channel for it on this discord: https://discord.gg/cMr5YqbU4y

Interpreting Reinforcement Learning

Interpreting RLHF Transformers

C

8.12

Can you find any circuits in CarperAI's RLHF model corresponding to longer term planning?

Interpreting Reinforcement Learning

Interpreting RLHF Transformers

C

8.13

Can you get any traction on interpreting CarperAI's RLHF model's reward model?

Interpreting Reinforcement Learning

C

8.15

Try training and interpreting a small model from Guez et al. They trained model-free RL agents and showed evidence they spontaneously learned planning. Can you find evidence for/against this?

Interpreting Reinforcement Learning

C

8.19

Can you interpret a model on a task from 8.16-8.18 using Q-Learning?

Interpreting Reinforcement Learning

C

8.2

Take an agent trained with RL and train another network to copy the output logits of that agent. Try to reverse engineer the clone. Can you find the resulting circuits in the original?

Interpreting Reinforcement Learning

C

8.21

Once you've got traction understanding a fully trained agent on a task elsewhere in this category, try to extend this understanding to study it during training. Can you get any insight into what's actually going on?

Interpreting Reinforcement Learning

AlphaZero

D

8.2

Try applying 8.1 to an open source AlphaZero style Go playing agent

Interpreting Reinforcement Learning

AlphaZero

D

8.3

Train a small AlphaZero model on a simple game like Tic-Tac-Toe, and try to apply 8.1 there. (Training will be hard! See this tutorial.)

Interpreting Reinforcement Learning

AlphaZero

D

8.4

Can you extend the work on LeelaZero? Can you find anything about how a feature is computed? Start by looking for features near the start or end of the network.

Interpreting Reinforcement Learning

Interpreting RLHF Transformers

D

8.11

Go and interpret CarperAI's RLHF model (forthcoming). What's up with that? How is it different from a vanilla language model?

Interpreting Reinforcement Learning

Interpreting RLHF Transformers

D

8.14

Train a toy RLHF model (1-2 layers) to do a simple task. Use GPT-3 for human data generation. Then try to interpret it. (Note: This will be hard to train, but Neel would be super excited to see the results!) Bonus: Try bigger models like GPT-2 Medium to XL.

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.