Kickstart foreign language training using XTTS weights? #426
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#426
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Do you think it's possible to kickstart Tortoise foreign language training by using the XTTS weights mrq provided here related to this issue #386 ?
Right now I'm trying to fine-tune a polish model using 9k audio samples of single speaker accumulating to ~6h.
I've put a lot of work to implement polish cleaners and edited the tokenizer by adding special characters and most common vowel clusters.
My plan:
I suppose there are some things to change like add the language flag
[pl]
at the start of each sample in the dataset etc.There were some issues with the redaction like
[I'm happy] Text
:But i suppose I can just edit the token from
[pl]
to{pl}
to tackle this.I'm training the model the standard way for 1 more day right now and just thinking of solutions in case it will be dogshit
Pretty much how you outlined it.
You'll just need to prepend every line under
train.txt
with[pl]
, and make sure to supply your modified tokenizer and cleaner code inside DLAS here in hope of leveraging XTTS's """emergent""" language capabilities. I don't think XTTS's tokenizer/cleaner code is anything unique to try and re-re-implement back into DLAS (I honestly can't remember what difference outside of being butthurt over seeing my shitty code to attend to Japanese being slapped in).As long as I didn't neglect anything critical, both when "converting" (remapping) the weights and looking over any additional implementation details with XTTS months ago, I don't see any snags with finetuning off of the XTTS weights.
The only snag I can think of with using the XTTS weights is that the base model is already pretty dogshit, but I only got around to testing from the "converted"/remapped weights, so it could just be an issue from that. The only major benefit I can think of with finetuning the XTTS weights off another language is that the model should already be up to par with multi-lingual-ness, as it's already attending to language marker tokens.
The basic model turned to be pretty ok, it's very accurate mid-sentence but starts heavily "hallucinating" at the edges. It still has this american flavor sometimes :P, but I suppose 6h of training data is too little audio to train new language and accent.
Also I figured out that many audio samples for training and validation are cut mid-word or even sometimes have missing/added words at the edges (I had no overlapping audio, just segmentation was off). To combat this I recommend cutting the dataset by using whisperx
My conclusions for future readers: