Finetune/Train XTTS-2 with a new language #487
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#487
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hello, I have read some previous issues and feel you guys here are very knowledgeable about voice cloning. I recently read about xTTS2 and wondering how to leverage currnent various speakers, multilingual, and cross-lingual capabilities with a new language.
I see that the vocabulary size is very small and don't understand why. If I want to add my language, I think there are two options: change the tokenizer by replacing a language with my language's vocab or extend my language's vocab to current one. Could you tell me what I should do? If I go with option 2, I think it can't leverage the pretrained model because the vocabulary size changes, so the embeddings change too.
Thank you so much
Going off of what I remember, in theory it should be as "simple" as:
text_embedding.weight
in the state_dictnumber_text_tokens
. I think the args are manually passed in whateverapi.py
variant xTTS2 mostly copied out of TorToiSe.start_text_token
which may or may not be also tied tonumber_text_tokens
since by default it's 255. I can't remember the specifics even for TorToiSe.The embedding is nothing more than a fancy lookup table to "translate" a token ID into the best representation of said token. Adding new tokens to the vocab is as simple as extending the existing embedding weights at
dim=0
to the new vocab size, and re-tuning to update the new embedding values. You can swap the IDs around and it won't affect anything, and you can add in or remove token IDs and it won't affect the model. You can get away with dropping the "merge" token IDs (I don't remember where it starts off the top of my head) and put that as what else you need for your language (just remember to also drop the merge array from the old tokenizer).A rough start also to modify the text embedding's weights in place:
And getting a baseline for the tokenizer definition:
I did check xTTS2's vocab and it seems they actually do have an extended vocab (where their language "embedding" is actually just a text token, I forgot that was the case).
Base TorToiSe's vocab is the one that's limited to 256, but xTTS2 should have a lot of wiggle room if you don't want to touch the text_embedding's weights themselves, so you should be fine with just using a new definition and try to reuse the existing one as much as you can (for example, keeping
a
as 14,b
as 15, etc.).Thank you for your explanation. It's very clear