- use a gated activation layer for both attention & convs
- add a relativistic learned position bias. I believe this is similar to the T5 position encodings but it is simpler and learned
- get rid of prepending to the attention matrix - this doesn't really work that well. the model eventually learns to attend one of its heads to these blocks but why not just concat if it is doing that?