Analysing Training Dynamics
Algorithmic tasks - lottery tickets
5.1
All Neel's toy models (attn-only, gelu, solu) were trained with the same data shuffle and weight initialisation. Many induction heads aren't shared, but L2H3 in 3L and L1H6 in 2L always are. What's up with that?