RL interpretability

Please see Neel’s post on for a more detailed description of the problems.
View of Open Problems
14
Category
Difficulty
Existing Work
Currently working
Help Wanted?
Search
Interpreting Reinforcement Learning
Goal misgeneralisation
B
8.6
Using 8.5: Possible starting point, Tree Gridworld and Monster Gridworld from Shah et al.
Interpreting Reinforcement Learning
Decision Transformers
B
8.9
Try training a 1L decision transformer on a toy problem, like finding the shortest path in a graph.
Interpreting Reinforcement Learning
Interpreting policy gradients
B
8.16
Can you interpret a small model trained with policy gradients on a gridworld task?
Interpreting Reinforcement Learning
Interpreting policy gradients
B
8.17
Can you interpret a small model trained with policy gradients on an OpenAI Gym task?
Interpreting Reinforcement Learning
Interpreting policy gradients
B
8.18
Can you interpret a small model trained with policy gradients on an Atari game (e.g, Pong)?
Interpreting Reinforcement Learning
B
8.22
Choose your own adventure! There's lots of work in RL - pick something you're excited about and try to reverse engineer something!
Interpreting Reinforcement Learning
AlphaZero
C
8.1
Replicate some of Tom McGrath's AlphaZero work with LeelaChessZero. Use NMF on the activations and trying to interpret some. See visualisations here.
Interpreting Reinforcement Learning
Goal misgeneralisation
C
8.5
Intrepret one of the examples in the goal misgeneralisation papers (Langosco et al and Shah et al). Can you concretely figure out what's going on?
Interpreting Reinforcement Learning
Goal misgeneralisation
C
8.7
Using 8.5: Possible starting point - CoinRun. Interpreting RL Vision made significant progress and Langosco et al found it was an example of goal misgeneralisation - can you build on these to predict the misgeneralisation?
Interpreting Reinforcement Learning
C
8.1
Train and interpret a model from the In-Context Reinforcement Learning and Algorithmic Distillation paper. They trained small transformers where they input a sequence of moves for a "novel" RL task and the model outputs sensible answers for that task.
10/april/2023-Victor Levoso and others , working on reinplementing AD to try this, we have a channel for it on this discord: https://discord.gg/cMr5YqbU4y
Interpreting Reinforcement Learning
Interpreting RLHF Transformers
C
8.12
Can you find any circuits in CarperAI's RLHF model corresponding to longer term planning?
Interpreting Reinforcement Learning
Interpreting RLHF Transformers
C
8.13
Can you get any traction on interpreting CarperAI's RLHF model's reward model?
Interpreting Reinforcement Learning
C
8.15
Try training and interpreting a small model from Guez et al. They trained model-free RL agents and showed evidence they spontaneously learned planning. Can you find evidence for/against this?
Interpreting Reinforcement Learning
C
8.19
Can you interpret a model on a task from 8.16-8.18 using Q-Learning?
Interpreting Reinforcement Learning
C
8.2
Take an agent trained with RL and train another network to copy the output logits of that agent. Try to reverse engineer the clone. Can you find the resulting circuits in the original?
Interpreting Reinforcement Learning
C
8.21
Once you've got traction understanding a fully trained agent on a task elsewhere in this category, try to extend this understanding to study it during training. Can you get any insight into what's actually going on?
Interpreting Reinforcement Learning
AlphaZero
D
8.2
Try applying 8.1 to an open source AlphaZero style Go playing agent
Interpreting Reinforcement Learning
AlphaZero
D
8.3
Train a small AlphaZero model on a simple game like Tic-Tac-Toe, and try to apply 8.1 there. (Training will be hard! See this tutorial.)
Interpreting Reinforcement Learning
AlphaZero
D
8.4
Can you extend the work on LeelaZero? Can you find anything about how a feature is computed? Start by looking for features near the start or end of the network.
Interpreting Reinforcement Learning
Interpreting RLHF Transformers
D
8.11
Go and interpret CarperAI's RLHF model (forthcoming). What's up with that? How is it different from a vanilla language model?
Interpreting Reinforcement Learning
Interpreting RLHF Transformers
D
8.14
Train a toy RLHF model (1-2 layers) to do a simple task. Use GPT-3 for human data generation. Then try to interpret it. (Note: This will be hard to train, but Neel would be super excited to see the results!) Bonus: Try bigger models like GPT-2 Medium to XL.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.