at wits end (parhaps the semantic token approach is the toughest pill to swallow)

This commit is contained in:
mrq 2025-03-01 21:03:25 -06:00
parent 1d3290b023
commit 4afa4ccce5

View File

@ -105,6 +105,14 @@ However, while this solution boasts being lightweight, there are some caveats fo
* speaker similarity is rather mediocre for unseen speakers, the model isn't as robust for mapping speakers to its latent space as it is for seen speakers.
* despite being rather robust, some vocal stutters makes it way in.
### "Postmortem" ""Postmortem""
The model even working at all might entirely be a fluke.
A naive embedding implementation (`./vall_e/models/base.py`) manages to "just work" for EnCodec, while other audio codecs (DAC, `nvidia/audio-codec-44khz`) fail to converge meaningfully.
A more codec-aware embedding/classifier implementation (`./vall_e/models/base_v2.py`) fails to properly learn all levels for any codec, even with all the additional cruft to help things. Even scaling the model up just has the gradients seem a little more chaotic with about the same training progression.
## Notices and Citations
Unless otherwise credited/noted in this repo or within the designated Python file, this repository is [licensed](/LICENSE) under AGPLv3.