Found out that batch norm is causing the switches to init really poorly - not using a significant number of transforms. Might be a great time to re-consider using the attention norm, but for now just re-enable it.