forked from ecker/DL-Art-School
- use a gated activation layer for both attention & convs - add a relativistic learned position bias. I believe this is similar to the T5 position encodings but it is simpler and learned - get rid of prepending to the attention matrix - this doesn't really work that well. the model eventually learns to attend one of its heads to these blocks but why not just concat if it is doing that? |
||
|---|---|---|
| .. | ||
| audio | ||
| classifiers | ||
| clip | ||
| composable | ||
| diffusion | ||
| image_generation | ||
| image_latents | ||
| lucidrains | ||
| optical_flow | ||
| vqvae | ||
| __init__.py | ||
| arch_util.py | ||