Image Model Interpretability

Please see Neel’s post on for a more detailed description of the problems
Image model interp
Search
Building on Circuits thread
7.7
Look for equivariance in late layers of vision models, symmetries in a network with analogous families of neurons. Likely looks like hunting in Microscope.
Building on Circuits thread
7.9
Look for a wide array of circuits using the weight explorer. What interesting patterns and motifs can you find?
Multimodal models (CLIP interpretability)
7.1
Look at the weights connecting neurons in adjacent layers. How sparse are they? Are there any clear patterns where one neuron is constructed from previous ones?
Multimodal models (CLIP interpretability)
7.13
Can you refine the technique for generating max activating text strings? Could it be applied to language models?
7.15
Does activation patching work on Inception?
Diffusion models
7.16
Apply feature visualisation to neurons in diffusion models and see if any seem clearly interpretable.
Diffusion models
7.17
Are there style transfer neurons in diffusion models? (E.g, activating on "in the style of Thomas Kinkade")
Diffusion models
7.18
Are different circuits activating when different amounts of noise are input in diffusion models?
Reverse engineering image models
7.1
Using Circuits techniques, how well can we reverse engineer ResNet?
Reverse engineering image models
7.2
Vision Transformers - can you smush together transformer circuits and image circuits techniques? Which ones transfer?
Reverse engineering image models
7.3
Using Circuits techniques, how well can we reverse engineer ConvNeXt, a modern image model architecture merging ResNet and vision transformer ideas?
Building on Circuits thread
7.4
How well can you hand-code curve detectors? Can you include color? How much performance can you recover?
Building on Circuits thread
7.5
Can you hand-code any other circuits? Start with other early vision neurons
Building on Circuits thread
7.8
Digging into polysemantic neuron examples and trying to understand better what's going on there.
Multimodal models (CLIP interpretability)
7.11
Can you rigorously reverse engineer any circuits, like the Curve Circuits paper?
Multimodal models (CLIP interpretability)
7.12
Can you apply transformer circuits techniques to understand the attention heads in the image part?
7.14
Train a checkpointed run of Inception. Do curve detectors form as a phase change?
Building on Circuits thread
7.6
What happens if you apply causal scrubbing to the Circuits thread's claimed curve circuits algorithm? (This will take significant conceptual effort to extend to images since it's harder to precisely control input!)
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.