Kickstart foreign language training using XTTS weights? #426

Closed
opened 2023-10-24 04:26:26 +07:00 by MisterCapi · 2 comments

Do you think it's possible to kickstart Tortoise foreign language training by using the XTTS weights mrq provided here related to this issue #386 ?

Right now I'm trying to fine-tune a polish model using 9k audio samples of single speaker accumulating to ~6h.
I've put a lot of work to implement polish cleaners and edited the tokenizer by adding special characters and most common vowel clusters.

My plan:

  1. Transplant XTTS weights into Tortoise, add XTTS tokenizer and diff decoder.
  2. Fine-tune on this transplanted model

I suppose there are some things to change like add the language flag [pl] at the start of each sample in the dataset etc.

There were some issues with the redaction like [I'm happy] Text:

Utilizing the language codes with XTTS's tokenizer would require wav2vec2 redaction (The [I am happy], stuff), since [en] Something will trigger the redaction code and will throw an error. I suppose the one responsible for the implementation couldn't bother with some logic to make it work with it. I suppose I can patch this myself by, instead, having {en} and whatnot in the tokenizer vocab. It's necessary if a user wants to use it and leverage the cross-langual-linguistics.

But i suppose I can just edit the token from [pl] to {pl} to tackle this.

I'm training the model the standard way for 1 more day right now and just thinking of solutions in case it will be dogshit

Do you think it's possible to kickstart Tortoise foreign language training by using the XTTS weights mrq provided [here](https://huggingface.co/ecker/coqui-xtts) related to this issue [#386](https://git.ecker.tech/mrq/ai-voice-cloning/issues/386) ? Right now I'm trying to fine-tune a polish model using 9k audio samples of single speaker accumulating to ~6h. I've put a lot of work to implement polish cleaners and edited the tokenizer by adding special characters and most common vowel clusters. My plan: 1. Transplant XTTS weights into Tortoise, add XTTS tokenizer and diff decoder. 2. Fine-tune on this transplanted model I suppose there are some things to change like add the language flag `[pl]` at the start of each sample in the dataset etc. There were some issues with the redaction like `[I'm happy] Text`: > Utilizing the language codes with XTTS's tokenizer would require wav2vec2 redaction (The [I am happy], stuff), since [en] Something will trigger the redaction code and will throw an error. I suppose the one responsible for the implementation couldn't bother with some logic to make it work with it. I suppose I can patch this myself by, instead, having {en} and whatnot in the tokenizer vocab. It's necessary if a user wants to use it and leverage the cross-langual-linguistics. But i suppose I can just edit the token from `[pl]` to `{pl}` to tackle this. I'm training the model the standard way for 1 more day right now and just thinking of solutions in case it will be dogshit

Pretty much how you outlined it.

You'll just need to prepend every line under train.txt with [pl], and make sure to supply your modified tokenizer and cleaner code inside DLAS here in hope of leveraging XTTS's """emergent""" language capabilities. I don't think XTTS's tokenizer/cleaner code is anything unique to try and re-re-implement back into DLAS (I honestly can't remember what difference outside of being butthurt over seeing my shitty code to attend to Japanese being slapped in).

As long as I didn't neglect anything critical, both when "converting" (remapping) the weights and looking over any additional implementation details with XTTS months ago, I don't see any snags with finetuning off of the XTTS weights.

The only snag I can think of with using the XTTS weights is that the base model is already pretty dogshit, but I only got around to testing from the "converted"/remapped weights, so it could just be an issue from that. The only major benefit I can think of with finetuning the XTTS weights off another language is that the model should already be up to par with multi-lingual-ness, as it's already attending to language marker tokens.

Pretty much how you outlined it. You'll just need to prepend every line under `train.txt` with `[pl]`, and make sure to supply your modified tokenizer and cleaner code inside DLAS [here](https://git.ecker.tech/mrq/DL-Art-School/src/branch/master/dlas/data/audio/voice_tokenizer.py) in hope of leveraging XTTS's """emergent""" language capabilities. I don't think XTTS's tokenizer/cleaner code is anything unique to try and re-re-implement back into DLAS (I honestly can't remember what difference outside of being butthurt over seeing my shitty code to attend to Japanese being slapped in). As long as I didn't neglect anything critical, both when "converting" (remapping) the weights and looking over any additional implementation details with XTTS months ago, I don't see any snags with finetuning off of the XTTS weights. The only snag I can think of with using the XTTS weights is that the base model is already pretty dogshit, but I only got around to testing from the "converted"/remapped weights, so it could just be an issue from that. The only major benefit I can think of with finetuning the XTTS weights off another language is that the model should already be up to par with multi-lingual-ness, as it's already attending to language marker tokens.

The basic model turned to be pretty ok, it's very accurate mid-sentence but starts heavily "hallucinating" at the edges. It still has this american flavor sometimes :P, but I suppose 6h of training data is too little audio to train new language and accent.

Also I figured out that many audio samples for training and validation are cut mid-word or even sometimes have missing/added words at the edges (I had no overlapping audio, just segmentation was off). To combat this I recommend cutting the dataset by using whisperx

My conclusions for future readers:

  • don't bother with XTTS weights, it's more headache than it's worth it (good luck fighting CLVP that is limited to 256 tokens), better to spend the time gathering a good dataset
  • I've tried pyannote and whisperx for segmentation, the latter one is more accurate and also transcribes so win-win my shitty script to save you time
The basic model turned to be pretty ok, it's very accurate mid-sentence but starts heavily "hallucinating" at the edges. It still has this american flavor sometimes :P, but I suppose 6h of training data is too little audio to train new language and accent. Also I figured out that many audio samples for training and validation are cut mid-word or even sometimes have missing/added words at the edges (I had no overlapping audio, just segmentation was off). To combat this I recommend cutting the dataset by using [whisperx](https://github.com/m-bain/whisperX) My conclusions for future readers: - don't bother with XTTS weights, it's more headache than it's worth it (good luck fighting CLVP that is limited to 256 tokens), better to spend the time gathering a good dataset - I've tried pyannote and whisperx for segmentation, the latter one is more accurate and also transcribes so win-win [my shitty script to save you time](https://github.com/MisterCapi/auto_dataset_tts/blob/master/main.py)
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#426
There is no content yet.