Open Problems in Mechanistic Interpretability

Explore

Image Model Interpretability

Please see Neel’s post on

Image model Interpretability⁠

for a more detailed description of the problems

Image model interp

Image model interp

Category

Difficulty

Existing Work

Currently working

Help Wanted?

Image Model Interpretability

Building on Circuits thread

7.7

Look for equivariance in late layers of vision models, symmetries in a network with analogous families of neurons. Likely looks like hunting in Microscope.

Image Model Interpretability

Building on Circuits thread

7.9

Look for a wide array of circuits using the weight explorer. What interesting patterns and motifs can you find?

Image Model Interpretability

Multimodal models (CLIP interpretability)

7.1

Look at the weights connecting neurons in adjacent layers. How sparse are they? Are there any clear patterns where one neuron is constructed from previous ones?

Image Model Interpretability

Multimodal models (CLIP interpretability)

7.13

Can you refine the technique for generating max activating text strings? Could it be applied to language models?

Image Model Interpretability

7.15

Does activation patching work on Inception?

Image Model Interpretability

Diffusion models

7.16

Apply feature visualisation to neurons in diffusion models and see if any seem clearly interpretable.

Image Model Interpretability

Diffusion models

7.17

Are there style transfer neurons in diffusion models? (E.g, activating on "in the style of Thomas Kinkade")

Image Model Interpretability

Diffusion models

7.18

Are different circuits activating when different amounts of noise are input in diffusion models?

Image Model Interpretability

Reverse engineering image models

7.1

Using Circuits techniques, how well can we reverse engineer ResNet?

Image Model Interpretability

Reverse engineering image models

7.2

Vision Transformers - can you smush together transformer circuits and image circuits techniques? Which ones transfer?

Image Model Interpretability

Reverse engineering image models

7.3

Using Circuits techniques, how well can we reverse engineer ConvNeXt, a modern image model architecture merging ResNet and vision transformer ideas?

Image Model Interpretability

Building on Circuits thread

7.4

How well can you hand-code curve detectors? Can you include color? How much performance can you recover?

Image Model Interpretability

Building on Circuits thread

7.5

Can you hand-code any other circuits? Start with other early vision neurons

Image Model Interpretability

Building on Circuits thread

7.8

Digging into polysemantic neuron examples and trying to understand better what's going on there.

Image Model Interpretability

Multimodal models (CLIP interpretability)

7.11

Can you rigorously reverse engineer any circuits, like the Curve Circuits paper?

Image Model Interpretability

Multimodal models (CLIP interpretability)

7.12

Can you apply transformer circuits techniques to understand the attention heads in the image part?

Image Model Interpretability

7.14

Train a checkpointed run of Inception. Do curve detectors form as a phase change?

Image Model Interpretability

Building on Circuits thread

7.6

What happens if you apply causal scrubbing to the Circuits thread's claimed curve circuits algorithm? (This will take significant conceptual effort to extend to images since it's harder to precisely control input!)

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.