RL interpretability

Please see Neel’s post on for a more detailed description of the problems.
View of Open Problems
Search
Goal misgeneralisation
8.6
Using 8.5: Possible starting point, Tree Gridworld and Monster Gridworld from Shah et al.
Decision Transformers
8.9
Try training a 1L decision transformer on a toy problem, like finding the shortest path in a graph.
Interpreting policy gradients
8.16
Can you interpret a small model trained with policy gradients on a gridworld task?
Interpreting policy gradients
8.17
Can you interpret a small model trained with policy gradients on an OpenAI Gym task?
Interpreting policy gradients
8.18
Can you interpret a small model trained with policy gradients on an Atari game (e.g, Pong)?
8.22
Choose your own adventure! There's lots of work in RL - pick something you're excited about and try to reverse engineer something!
AlphaZero
8.1
Replicate some of Tom McGrath's AlphaZero work with LeelaChessZero. Use NMF on the activations and trying to interpret some. See visualisations here.
Goal misgeneralisation
8.5
Intrepret one of the examples in the goal misgeneralisation papers (Langosco et al and Shah et al). Can you concretely figure out what's going on?
Goal misgeneralisation
8.7
Using 8.5: Possible starting point - CoinRun. Interpreting RL Vision made significant progress and Langosco et al found it was an example of goal misgeneralisation - can you build on these to predict the misgeneralisation?
8.1
Train and interpret a model from the In-Context Reinforcement Learning and Algorithmic Distillation paper. They trained small transformers where they input a sequence of moves for a "novel" RL task and the model outputs sensible answers for that task.
10/april/2023-Victor Levoso and others , working on reinplementing AD to try this, we have a channel for it on this discord: https://discord.gg/cMr5YqbU4y
Interpreting RLHF Transformers
8.12
Can you find any circuits in CarperAI's RLHF model corresponding to longer term planning?
Interpreting RLHF Transformers
8.13
Can you get any traction on interpreting CarperAI's RLHF model's reward model?
8.15
Try training and interpreting a small model from Guez et al. They trained model-free RL agents and showed evidence they spontaneously learned planning. Can you find evidence for/against this?
8.19
Can you interpret a model on a task from 8.16-8.18 using Q-Learning?
8.2
Take an agent trained with RL and train another network to copy the output logits of that agent. Try to reverse engineer the clone. Can you find the resulting circuits in the original?
8.21
Once you've got traction understanding a fully trained agent on a task elsewhere in this category, try to extend this understanding to study it during training. Can you get any insight into what's actually going on?
AlphaZero
8.2
Try applying 8.1 to an open source AlphaZero style Go playing agent
AlphaZero
8.3
Train a small AlphaZero model on a simple game like Tic-Tac-Toe, and try to apply 8.1 there. (Training will be hard! See this tutorial.)
AlphaZero
8.4
Can you extend the work on LeelaZero? Can you find anything about how a feature is computed? Start by looking for features near the start or end of the network.
Interpreting RLHF Transformers
8.11
Go and interpret CarperAI's RLHF model (forthcoming). What's up with that? How is it different from a vanilla language model?
Interpreting RLHF Transformers
8.14
Train a toy RLHF model (1-2 layers) to do a simple task. Use GPT-3 for human data generation. Then try to interpret it. (Note: This will be hard to train, but Neel would be super excited to see the results!) Bonus: Try bigger models like GPT-2 Medium to XL.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.