Image Model Interpretability

Please see Neel’s post on for a more detailed description of the problems
Image model interp
14
Category
Difficulty
Existing Work
Currently working
Help Wanted?
Search
Image Model Interpretability
Building on Circuits thread
B
7.7
Look for equivariance in late layers of vision models, symmetries in a network with analogous families of neurons. Likely looks like hunting in Microscope.
Image Model Interpretability
Building on Circuits thread
B
7.9
Look for a wide array of circuits using the weight explorer. What interesting patterns and motifs can you find?
Image Model Interpretability
Multimodal models (CLIP interpretability)
B
7.1
Look at the weights connecting neurons in adjacent layers. How sparse are they? Are there any clear patterns where one neuron is constructed from previous ones?
Image Model Interpretability
Multimodal models (CLIP interpretability)
B
7.13
Can you refine the technique for generating max activating text strings? Could it be applied to language models?
Image Model Interpretability
B
7.15
Does activation patching work on Inception?
Image Model Interpretability
Diffusion models
B
7.16
Apply feature visualisation to neurons in diffusion models and see if any seem clearly interpretable.
Image Model Interpretability
Diffusion models
B
7.17
Are there style transfer neurons in diffusion models? (E.g, activating on "in the style of Thomas Kinkade")
Image Model Interpretability
Diffusion models
B
7.18
Are different circuits activating when different amounts of noise are input in diffusion models?
Image Model Interpretability
Reverse engineering image models
C
7.1
Using Circuits techniques, how well can we reverse engineer ResNet?
Image Model Interpretability
Reverse engineering image models
C
7.2
Vision Transformers - can you smush together transformer circuits and image circuits techniques? Which ones transfer?
Image Model Interpretability
Reverse engineering image models
C
7.3
Using Circuits techniques, how well can we reverse engineer ConvNeXt, a modern image model architecture merging ResNet and vision transformer ideas?
Image Model Interpretability
Building on Circuits thread
C
7.4
How well can you hand-code curve detectors? Can you include color? How much performance can you recover?
Image Model Interpretability
Building on Circuits thread
C
7.5
Can you hand-code any other circuits? Start with other early vision neurons
Image Model Interpretability
Building on Circuits thread
C
7.8
Digging into polysemantic neuron examples and trying to understand better what's going on there.
Image Model Interpretability
Multimodal models (CLIP interpretability)
C
7.11
Can you rigorously reverse engineer any circuits, like the Curve Circuits paper?
Image Model Interpretability
Multimodal models (CLIP interpretability)
C
7.12
Can you apply transformer circuits techniques to understand the attention heads in the image part?
Image Model Interpretability
C
7.14
Train a checkpointed run of Inception. Do curve detectors form as a phase change?
Image Model Interpretability
Building on Circuits thread
D
7.6
What happens if you apply causal scrubbing to the Circuits thread's claimed curve circuits algorithm? (This will take significant conceptual effort to extend to images since it's harder to precisely control input!)
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.