forked from ecker/DL-Art-School
- use a gated activation layer for both attention & convs - add a relativistic learned position bias. I believe this is similar to the T5 position encodings but it is simpler and learned - get rid of prepending to the attention matrix - this doesn't really work that well. the model eventually learns to attend one of its heads to these blocks but why not just concat if it is doing that? |
||
|---|---|---|
| .. | ||
| .idea | ||
| data | ||
| models | ||
| scripts | ||
| trainer | ||
| utils | ||
| multi_modal_train.py | ||
| process_video.py | ||
| requirements.txt | ||
| sweep.py | ||
| test.py | ||
| train.py | ||
| use_discriminator_as_filter.py | ||