ee8ceed6da
- use a gated activation layer for both attention & convs - add a relativistic learned position bias. I believe this is similar to the T5 position encodings but it is simpler and learned - get rid of prepending to the attention matrix - this doesn't really work that well. the model eventually learns to attend one of its heads to these blocks but why not just concat if it is doing that? |
||
---|---|---|
.. | ||
.idea | ||
data | ||
models | ||
scripts | ||
trainer | ||
utils | ||
multi_modal_train.py | ||
process_video.py | ||
requirements.txt | ||
sweep.py | ||
test.py | ||
train.py | ||
use_discriminator_as_filter.py |