Finetune/Train XTTS-2 with a new language #487

Closed
opened 2024-08-08 04:42:59 +00:00 by TuanVu · 2 comments

Hello, I have read some previous issues and feel you guys here are very knowledgeable about voice cloning. I recently read about xTTS2 and wondering how to leverage currnent various speakers, multilingual, and cross-lingual capabilities with a new language.
I see that the vocabulary size is very small and don't understand why. If I want to add my language, I think there are two options: change the tokenizer by replacing a language with my language's vocab or extend my language's vocab to current one. Could you tell me what I should do? If I go with option 2, I think it can't leverage the pretrained model because the vocabulary size changes, so the embeddings change too.
Thank you so much

Hello, I have read some previous issues and feel you guys here are very knowledgeable about voice cloning. I recently read about xTTS2 and wondering how to leverage currnent various speakers, multilingual, and cross-lingual capabilities with a new language. I see that the vocabulary size is very small and don't understand why. If I want to add my language, I think there are two options: change the tokenizer by replacing a language with my language's vocab or extend my language's vocab to current one. Could you tell me what I should do? If I go with option 2, I think it can't leverage the pretrained model because the vocabulary size changes, so the embeddings change too. Thank you so much
Owner

Going off of what I remember, in theory it should be as "simple" as:

  • obtain a tokenizer of the language of your choice
  • resize the text_embedding.weight in the state_dict
  • update the model config to use the new token count
    • In the constructor (for TorToiSe at least), it's number_text_tokens. I think the args are manually passed in whatever api.py variant xTTS2 mostly copied out of TorToiSe.
    • There's also a start_text_token which may or may not be also tied to number_text_tokens since by default it's 255. I can't remember the specifics even for TorToiSe.
  • you probably also need to include an additional token / embedding for your target language. I vaguely remember xTTS2 having some funny way to "embed" the language into the prompt, but it should help guide the model.

The embedding is nothing more than a fancy lookup table to "translate" a token ID into the best representation of said token. Adding new tokens to the vocab is as simple as extending the existing embedding weights at dim=0 to the new vocab size, and re-tuning to update the new embedding values. You can swap the IDs around and it won't affect anything, and you can add in or remove token IDs and it won't affect the model. You can get away with dropping the "merge" token IDs (I don't remember where it starts off the top of my head) and put that as what else you need for your language (just remember to also drop the merge array from the old tokenizer).


A rough start also to modify the text embedding's weights in place:

import torch

target = 1024 # or whatever target vocab size you want
dim = 0
fn = torch.randn # or torch.zeros

weights = torch.load("./path/to/the/ar.pth")
weights['text_embeddings.weight'] = torch.stack(
    [ x for x in weight ] +
    [ fn( weight[0].shape ).to(device=weight[0].device, dtype=weight[0].dtype) for _ in range( target - weight.shape[dim] ) ]
)
torch.save( weights, "./path/to/the/new_ar.pth" )

And getting a baseline for the tokenizer definition:

from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel, WordPiece
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing

unk_token = "[UNK]"
spl_tokens = [unk_token, "[START]", "[STOP]", "[SPACE]"]

trainer = BpeTrainer(special_tokens = spl_tokens, vocab_size = 1024) # or whatever vocab size you want
tokenizer = Tokenizer(BPE(unk_token = unk_token))
tokenizer.pre_tokenizer = Whitespace()
tokenizer.post_processor = TemplateProcessing(
    single="[START] $A [STOP]",
    special_tokens=[("[START]", 0), ("[STOP]", 261)],
)

tokenizer.train_from_iterator(tokenizer_data, trainer=trainer)
tokenizer.save("./new_tokenizer.json")

I did check xTTS2's vocab and it seems they actually do have an extended vocab (where their language "embedding" is actually just a text token, I forgot that was the case).

Base TorToiSe's vocab is the one that's limited to 256, but xTTS2 should have a lot of wiggle room if you don't want to touch the text_embedding's weights themselves, so you should be fine with just using a new definition and try to reuse the existing one as much as you can (for example, keeping a as 14, b as 15, etc.).

Going off of what I remember, in theory it should be as "simple" as: * obtain a tokenizer of the language of your choice + You can use a fresh tokenizer definition or just extend the existing one. You don't need to keep it the same size as long as you update the model config. + Adapting https://git.ecker.tech/mrq/vall-e/src/branch/master/scripts/train_tokenizer.py#L55 if you want to just throw a bunch of strings and see what sticks would be the easiest bet. * resize the `text_embedding.weight` in the state_dict + I suppose https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/utils/utils.py#L368 is a reference to doing this. * update the model config to use the new token count + In the constructor (for TorToiSe at least), it's `number_text_tokens`. I think the args are manually passed in whatever `api.py` variant xTTS2 mostly copied out of TorToiSe. + There's also a `start_text_token` which may or may not be also tied to `number_text_tokens` since by default it's 255. I can't remember the specifics even for TorToiSe. * you probably also need to include an additional token / embedding for your target language. I vaguely remember xTTS2 having some funny way to "embed" the language into the prompt, but it should help guide the model. The embedding is nothing more than a fancy lookup table to "translate" a token ID into the best representation of said token. Adding new tokens to the vocab is as simple as extending the existing embedding weights at `dim=0` to the new vocab size, and re-tuning to update the new embedding values. You can swap the IDs around and it won't affect anything, and you can add in or remove token IDs and it won't affect the model. You can get away with dropping the "merge" token IDs (I don't remember where it starts off the top of my head) and put that as what else you need for your language (just remember to also drop the merge array from the old tokenizer). --- A rough start also to modify the text embedding's weights in place: ``` import torch target = 1024 # or whatever target vocab size you want dim = 0 fn = torch.randn # or torch.zeros weights = torch.load("./path/to/the/ar.pth") weights['text_embeddings.weight'] = torch.stack( [ x for x in weight ] + [ fn( weight[0].shape ).to(device=weight[0].device, dtype=weight[0].dtype) for _ in range( target - weight.shape[dim] ) ] ) torch.save( weights, "./path/to/the/new_ar.pth" ) ``` And getting a baseline for the tokenizer definition: ``` from tokenizers import Tokenizer from tokenizers.models import BPE, Unigram, WordLevel, WordPiece from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.processors import TemplateProcessing unk_token = "[UNK]" spl_tokens = [unk_token, "[START]", "[STOP]", "[SPACE]"] trainer = BpeTrainer(special_tokens = spl_tokens, vocab_size = 1024) # or whatever vocab size you want tokenizer = Tokenizer(BPE(unk_token = unk_token)) tokenizer.pre_tokenizer = Whitespace() tokenizer.post_processor = TemplateProcessing( single="[START] $A [STOP]", special_tokens=[("[START]", 0), ("[STOP]", 261)], ) tokenizer.train_from_iterator(tokenizer_data, trainer=trainer) tokenizer.save("./new_tokenizer.json") ``` --- I did check xTTS2's [vocab](https://huggingface.co/coqui/XTTS-v2/raw/main/vocab.json) and it seems they actually do have an extended vocab (where their language "embedding" is actually just a text token, I forgot that was the case). Base TorToiSe's vocab is the one that's limited to 256, but xTTS2 should have a lot of wiggle room if you don't want to touch the text_embedding's weights themselves, so you *should* be fine with just using a new definition and try to reuse the existing one as much as you can (for example, keeping `a` as 14, `b` as 15, etc.).
Author

Thank you for your explanation. It's very clear

Thank you for your explanation. It's very clear
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#487
No description provided.