Discussion about Fine Tuning on a different language. #147

Open
opened 2023-03-17 12:37:33 +00:00 by Brugio96 · 13 comments

Hello everyone,

I know that there are other issues on fine-tuning Tortoise for different languages, but I think it would be helpful to have a centralized discussion to gather and share findings and useful information, possibly building a guide for future reference.

Although I haven't yet tried fine-tuning Tortoise for my language, which is Italian, I've read that it's possible with other languages.

Based on what I read, the following steps are needed or recommended when fine-tuning for a new language:

  • Change the text tokenizer
  • Increase the Text LR Ratio to 1

It would be great to expand the discussion to understand why these steps are necessary, which tokenizers have produced good or bad results, what are the characteristics the tokenizer needs to satisfy etc...

In your opinion and experience, what else can be beneficial when fine-tuning for a new language?

Some possible questions to consider include:

  • How much data is needed to obtain a good model for the new language?
  • Did you use a single speaker or multi-speaker model?
  • What training parameters did you use, and how many epochs did it take?

Then a more general question: which other models, aside from the AR model, should be fine-tuned to achieve better results?

Hello everyone, I know that there are other issues on fine-tuning Tortoise for different languages, but I think it would be helpful to have a centralized discussion to gather and share findings and useful information, possibly building a guide for future reference. Although I haven't yet tried fine-tuning Tortoise for my language, which is Italian, I've read that it's possible with other languages. Based on what I read, the following steps are needed or recommended when fine-tuning for a new language: * Change the text tokenizer * Increase the Text LR Ratio to 1 It would be great to expand the discussion to understand why these steps are necessary, which tokenizers have produced good or bad results, what are the characteristics the tokenizer needs to satisfy etc... In your opinion and experience, what else can be beneficial when fine-tuning for a new language? Some possible questions to consider include: * How much data is needed to obtain a good model for the new language? * Did you use a single speaker or multi-speaker model? * What training parameters did you use, and how many epochs did it take? Then a more general question: which other models, aside from the AR model, should be fine-tuned to achieve better results?
Owner

I've read that it's possible with other languages.

Yeah, there's a French model floating around, which I'm surprised as it was one of the earlier finetunes I've seen. I can't vouch for how well it is, as I haven't used it myself yet, but ironically, if it was trained with the other DLAS fork, then it should be good.

I've had my Japanese finetunes that proved to vary from unfavorable to decent, but those stem from other issues I've since resolved.

Change the text tokenizer

As things currently stand, you have to really be careful of how you go about it, as it's been a major pain in my ass to wrangle.

I imagine you'll get more benefit replacing the lesser-used tokens at the end with what you want: more than likely, extra symbols and combination of tokens (merges).

There's also the issue of both DLAS and TorToiSe pre-processing your text to normalize it by cleaning it up up and ASCII-izing anything, so be wary of that, and if you need it disabled, you can set the pre-tokenizer to null.

Apparently it's doable, and I imagine those people did the above and not what I did and completely replaced the tokenizer.

Increase the Text LR Ratio to 1

This definitely, as, even without the tokenizer, you're effectively rewriting the sequences of tokens (virtual phonemes), and with a new tokenizer, you're rewriting what those tokens (virtual phonemes) mean.

For languages, you could reduce the mel LR ratio if you wish to try and retain the variety of the base model, but I haven't seemed to got it to happen, as my mel LR ratio still seemed too high with setting it to 0.01 (which makes sense, it's just making the LR go from 1e-5 to 1e-7).

It would be great to expand the discussion to understand why these steps are necessary, which tokenizers have produced good or bad results, what are the characteristics the tokenizer needs to satisfy etc...

I've got decent Japanese without slotting out the tokenizer, and the French one most definitely uses the default one. It's decent for Japanese, and it'll most definitely be fine for Italian, but my issues come from:

  • the default tokenizer will parse 言わんこっちゃない => ['y', 'an', '[SPACE]', 'w', 'an', 'k', 'o', 't', 'su', 'ch', 'i', 'y', 'an', 'a', 'i'], where:
    • 言 romanized wrong (the ASCII-izer might be for Chinese kanji instead)
    • space inserted after the kanji even though it doesn't need one
    • syllables don't merge ('k', 'o' should be 'ko')
    • syllables merge wrong ('w', 'an' should be 'wa', 'n')

But technically you can just bruteforce the merge problem (as the GPT model will "know" a sequence of tokens will map to a sequence of phonemes). The harmful problem stems from the ASCII-izing (romanizing), which a tokenizer vocab could replace wholesale, but would be quite the pain to define.

As I said though, it shouldn't be an issue for common languages. Training can easily bruteforce inaccurate merges fine. For non-latin languages, just don't expect it to be pretty, and validate with the Utilities > Tokenizer tab (for Italian, should be fine).

You can also just do away with the merges array entirely and let the GPT model figure out merges. I think my latest training did that, but I'll have to scrap it due to it ASCII-izing/romanizing kanji wrong.

How much data is needed to obtain a good model for the new language?

Did you use a single speaker or multi-speaker model?

Hard to say a minimum, as you'll always benefit from more data. I've got something that works with repeating anything from the training dataset (and naturally had issues from any Japanese outside of the dataset) from one voice in 10 lines, and I've got a handful of tests from a dataset of 7.5k, and another with 15.5k lines, both with about ~60 different speakers? But I felt the range wasn't decent, but that could just be because it's from a gacha and lines are not necessarily delivered normally, or that I didn't reduce the mel LR ratio (which I don't care, I was needing it to finetune for specific voices anyways).

What training parameters did you use, and how many epochs did it take?

Fuck if I remember specifically, I think my latest shot was:

  • 16 epochs, I think the A100-80G cranked out 16 epochs (16 epochs at ~15k lines = 240000 iterations) in six hours before the paperspace instance auto-shutoff.
  • LR: 1e-5
  • LR scheme: MultiStepLR
  • text LR ratio: 1
  • mel LR ratio: don't remember specifically, I need to play around with this more in terms of language training
  • LR schedule: default [2, 4, 9, 18, ...]
  • batch size: 2048
  • gradient accumulation: 16?

which other models, aside from the AR model, should be fine-tuned to achieve better results?

Just the AR needs finetuning.

  • the CLVP/CVVP could (and should-ish) for specific languages, but as far as I am concerned, it isn't a priority (or at least possible, as I don't know of the means to go about it right now). It didn't seem to impose much of an issue.
  • I still believe you won't ever need to touch the diffusion model, as the mel tokens never get redefined like we can redefine the text tokens with slotting in a different tokenizer.
  • the vocoder just handles processing a mel spectrogram into an actual waveform, and BigVGAN does a better job than the default vocoder anyhow.
> I've read that it's possible with other languages. Yeah, there's a French model floating around, which I'm surprised as it was one of the earlier finetunes I've seen. I can't vouch for how well it is, as I haven't used it myself yet, but ironically, if it was trained with the other DLAS fork, then it should be good. I've had my Japanese finetunes that proved to vary from unfavorable to decent, but those stem from other issues I've since resolved. > Change the text tokenizer As things currently stand, you have to *really* be careful of how you go about it, as it's been a major pain in my ass to wrangle. I imagine you'll get more benefit replacing the lesser-used tokens at the end with what you want: more than likely, extra symbols and combination of tokens (merges). There's also the issue of both DLAS and TorToiSe pre-processing your text to normalize it by cleaning it up up and ASCII-izing anything, so be wary of that, and if you need it disabled, you can set the `pre-tokenizer` to `null`. Apparently it's doable, and I imagine those people did the above and not what I did and completely replaced the tokenizer. > Increase the Text LR Ratio to 1 This definitely, as, even without the tokenizer, you're effectively rewriting the sequences of tokens (virtual phonemes), and with a new tokenizer, you're rewriting what those tokens (virtual phonemes) mean. For languages, you *could* reduce the mel LR ratio if you wish to try and retain the variety of the base model, but I haven't seemed to got it to happen, as my mel LR ratio still seemed too high with setting it to 0.01 (which makes sense, it's just making the LR go from 1e-5 to 1e-7). > It would be great to expand the discussion to understand why these steps are necessary, which tokenizers have produced good or bad results, what are the characteristics the tokenizer needs to satisfy etc... I've got decent Japanese without slotting out the tokenizer, and the French one most definitely uses the default one. It's *decent* for Japanese, and it'll most definitely be fine for Italian, but my issues come from: * the default tokenizer will parse `言わんこっちゃない` => `['y', 'an', '[SPACE]', 'w', 'an', 'k', 'o', 't', 'su', 'ch', 'i', 'y', 'an', 'a', 'i']`, where: - 言 romanized wrong (the ASCII-izer might be for Chinese kanji instead) - space inserted after the kanji even though it doesn't need one - syllables don't merge (`'k', 'o'` should be `'ko'`) - syllables merge wrong (`'w', 'an'` should be `'wa', 'n'`) But *technically* you can just bruteforce the merge problem (as the GPT model will "know" a sequence of tokens will map to a sequence of phonemes). The harmful problem stems from the ASCII-izing (romanizing), which a tokenizer vocab *could* replace wholesale, but would be quite the pain to define. As I said though, it shouldn't be an issue for common languages. Training can easily bruteforce inaccurate merges fine. For non-latin languages, just don't expect it to be pretty, and validate with the `Utilities` > `Tokenizer` tab (for Italian, should be fine). You can also just do away with the `merges` array entirely and let the GPT model figure out merges. I think my latest training did that, but I'll have to scrap it due to it ASCII-izing/romanizing kanji wrong. > How much data is needed to obtain a good model for the new language? > Did you use a single speaker or multi-speaker model? Hard to say a minimum, as you'll always benefit from more data. I've got something that *works* with repeating anything from the training dataset (and naturally had issues from any Japanese outside of the dataset) from one voice in 10 lines, and I've got a handful of tests from a dataset of 7.5k, and another with 15.5k lines, both with about ~60 different speakers? But I felt the range wasn't decent, but that could just be because it's from a gacha and lines are not necessarily delivered normally, or that I didn't reduce the mel LR ratio (which I don't care, I was needing it to finetune for specific voices anyways). > What training parameters did you use, and how many epochs did it take? Fuck if I remember specifically, I think my latest shot was: * 16 epochs, I think the A100-80G cranked out 16 epochs (16 epochs at ~15k lines = 240000 iterations) in six hours before the paperspace instance auto-shutoff. * LR: 1e-5 * LR scheme: MultiStepLR * text LR ratio: 1 * mel LR ratio: don't remember specifically, I need to play around with this more in terms of language training * LR schedule: default `[2, 4, 9, 18, ...]` * batch size: 2048 * gradient accumulation: 16? > which other models, aside from the AR model, should be fine-tuned to achieve better results? Just the AR needs finetuning. * the CLVP/CVVP *could* (and should-ish) for specific languages, but as far as I am concerned, it isn't a priority (or at least possible, as I don't know of the means to go about it right now). It didn't seem to impose much of an issue. * I still believe you won't ever need to touch the diffusion model, as the mel tokens never get redefined like we can redefine the text tokens with slotting in a different tokenizer. * the vocoder just handles processing a mel spectrogram into an actual waveform, and BigVGAN does a better job than the default vocoder anyhow.

I've had my Japanese finetunes that proved to vary from unfavorable to decent, but those stem from other issues I've since resolved.

Did you convert all the Kanji in your training set to Hiragana beforehand or leave as is?

16 epochs, I think the A100-80G cranked out 16 epochs (16 epochs at ~15k lines = 240000 iterations) in six hours before the paperspace instance auto-shutoff.

How much did that cost?

> I've had my Japanese finetunes that proved to vary from unfavorable to decent, but those stem from other issues I've since resolved. Did you convert all the Kanji in your training set to Hiragana beforehand or leave as is? > 16 epochs, I think the A100-80G cranked out 16 epochs (16 epochs at ~15k lines = 240000 iterations) in six hours before the paperspace instance auto-shutoff. How much did that cost?
Owner

Did you convert all the Kanji in your training set to Hiragana beforehand or leave as is?

My naivety left it in as-is, and had whatever DLAS/tortoise uses handle it (which seem to be unidecode). I probably should have validated more, but naturally, hindsight is 20/20.

I'm experimenting with a preprocessor that will parse Japanese string => kana (hirgana/katakana are practically interchangeable), and let the tokenizer vocab handle kana. Some issues:

  • the vocab refuses to load with X ゃ merges.
  • I might need to romanize kana, so particles get converted to what they phonetically are (example: は => wa), which I suppose is correct since this is all about phonetics and shouldn't really matter, but I feel like there's a bit of a dissonance, for lack of a better term.

The preprocessing will tie to a model.language entry (just language in the JSON wil make the tokenizer throw a hissy), so in theory you can further expand with additional per-language preprocessing.

How much did that cost?

Paperspace has some "free" ones with the $40/month growth tier, but availability is pretty much up to chance (I refuse to pay by the hour). As much as I hate Paperspace, I gotta do what I gotta do for experiments. There just isn't an actual equivalent to Paperspace that I'm aware of.

> Did you convert all the Kanji in your training set to Hiragana beforehand or leave as is? My naivety left it in as-is, and had whatever DLAS/tortoise uses handle it (which seem to be unidecode). I probably should have validated more, but naturally, hindsight is 20/20. I'm experimenting with a preprocessor that will parse `Japanese string` => `kana` (hirgana/katakana are practically interchangeable), and let the tokenizer vocab handle kana. Some issues: * the vocab refuses to load with `X ゃ` merges. * I might need to romanize kana, so particles get converted to what they phonetically are (example: は => wa), which I suppose is correct since this is all about phonetics and shouldn't really matter, but I *feel* like there's a bit of a dissonance, for lack of a better term. The preprocessing will tie to a `model.language` entry (just `language` in the JSON wil make the tokenizer throw a hissy), so in theory you can further expand with additional per-language preprocessing. > How much did that cost? Paperspace has some "free" ones with the $40/month growth tier, but availability is pretty much up to chance (I refuse to pay by the hour). As much as I hate Paperspace, I gotta do what I gotta do for experiments. There just isn't an actual equivalent to Paperspace that I'm aware of.

My naivety left it in as-is, and had whatever DLAS/tortoise uses handle it (which seem to be unidecode). I probably should have validated more, but naturally, hindsight is 20/20.

Do the "Prompt Engineering" options still work with the Japanese model or do you have to substitute something like[僕はとても幸せです、]?

>My naivety left it in as-is, and had whatever DLAS/tortoise uses handle it (which seem to be unidecode). I probably should have validated more, but naturally, hindsight is 20/20. Do the "Prompt Engineering" options still work with the Japanese model or do you have to substitute something like[僕はとても幸せです、]?
Owner

Do the "Prompt Engineering" options still work with the Japanese model

I don't know, and desu I don't care too much to find out if it does.

From a cursory auditing, and my limited knowledge on how text redaction works ("""prompt engineering""" is a silly name that I don't know why I even called it that), it doesn't care about the tokenizer at all, as that lives within the realm of the AR, while text redaction relies on the wav2vec2 model, which is in it's own bubble of working with non-English.

In other words, try it and find out, because it's an issue left up to how well the default wav2vec2 model handles things.

> Do the "Prompt Engineering" options still work with the Japanese model I don't know, and desu I don't care too much to find out if it does. From a [cursory auditing](https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/wav2vec_alignment.py#L125), and my limited knowledge on how text redaction works ("""prompt engineering""" is a silly name that I don't know why I even called it that), it doesn't care about the tokenizer at all, as that lives within the realm of the AR, while text redaction relies on the wav2vec2 model, which is in it's own bubble of working with non-English. In other words, try it and find out, because it's an issue left up to how well the default wav2vec2 model handles things.
Owner

Trained a new model with the Japanese tokenizer, and after ~55 epochs (~825000 samples processed), I have a better Japanese model:

image

However, it needs to be finetuned for a specific voice, it's multi-voice capabilities are terrible, and it'll have a dreadful issue where you need punctuation so it "knows" when a sentence stops, or it'll keep generating extraneous words (example 1).

But I'm happy with it. There doesn't seem to be any actual lingual quirks, so a replaced tokenizer can work, but it seems it needs twice the training time over normally training with the default tokenizer (naturally). I'll add it to my HF repo in a moment.

Trained a new model with the [Japanese tokenizer](https://git.ecker.tech/mrq/ai-voice-cloning/src/branch/master/models/tokenizers/japanese.json), and after ~55 epochs (~825000 samples processed), I have a better Japanese model: * https://vocaroo.com/1lZkJSTC6Jju * https://vocaroo.com/1bxRmmv0io4i ![image](/attachments/62600bea-5253-4cf5-9d31-1b54e497ea8d) However, it ***needs*** to be finetuned for a specific voice, it's multi-voice capabilities are terrible, and it'll have a dreadful issue where you need punctuation so it "knows" when a sentence stops, or it'll keep generating extraneous words (example 1). But I'm happy with it. There doesn't seem to be any actual lingual quirks, so a replaced tokenizer *can* work, but it seems it needs twice the training time over normally training with the default tokenizer (naturally). I'll add it to my HF repo in a moment.

Trained a new model with the Japanese tokenizer, and after ~55 epochs (~825000 samples processed), I have a better Japanese model:

Was it with VALL-E or DLAS?

What effect does "language": "ja", in the japanese.json have?

>Trained a new model with the Japanese tokenizer, and after ~55 epochs (~825000 samples processed), I have a better Japanese model: Was it with VALL-E or DLAS? What effect does `"language": "ja",` in the japanese.json have?
Owner

Was it with VALL-E or DLAS?

DLAS.

What effect does "language": "ja", in the japanese.json have?

Hints to both DLAS's tokenizer and tortoise's tokenizer to properly process the text for the tokenizer, instead of relying on the default pathway that ASCII-fies it by poorly romanizing the text.

> Was it with VALL-E or DLAS? DLAS. > What effect does "language": "ja", in the japanese.json have? Hints to both DLAS's [tokenizer](https://git.ecker.tech/mrq/DL-Art-School/src/branch/master/codes/data/audio/voice_tokenizer.py#L70) and tortoise's [tokenizer](https://git.ecker.tech/mrq/tortoise-tts/src/branch/main/tortoise/utils/tokenizer.py#L188) to properly process the text for the tokenizer, instead of relying on the default pathway that ASCII-fies it by poorly romanizing the text.

the default pathway that ASCII-fies it

Well, that explains why all my attempts at IPA tokenization failed.

> the default pathway that ASCII-fies it Well, that explains why all my attempts at IPA tokenization failed.
Author

Thank you very much for the extended response, I've tried training with an italian dataset, with Text LR Ratio to 1 and with no change to the tokenizer.
The results are decent but there is a very strange accent to it unfortunately.

Thank you very much for the extended response, I've tried training with an italian dataset, with Text LR Ratio to 1 and with no change to the tokenizer. The results are decent but there is a very strange accent to it unfortunately.

I have finetuned a model for Cantonese, a tonal language, like Mandarin. I also tried to mix it with english.

  1. Cantonese
    https://soundcloud.com/keith-hon/0a-2?si=1ac3624677c041de964ae85ced07d783&utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing

  2. Cantonese + English
    https://soundcloud.com/keith-hon/cantonese-english-mixed-tts-with-voice-clone

which sounds decent in my opinon. However, I will need to train a custom CLVP model to pick the best outputs from the diffusion model.

I plan to mix with other asian languages as well.

I have finetuned a model for Cantonese, a tonal language, like Mandarin. I also tried to mix it with english. 1. Cantonese https://soundcloud.com/keith-hon/0a-2?si=1ac3624677c041de964ae85ced07d783&utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing 2. Cantonese + English https://soundcloud.com/keith-hon/cantonese-english-mixed-tts-with-voice-clone which sounds decent in my opinon. However, I will need to train a custom CLVP model to pick the best outputs from the diffusion model. I plan to mix with other asian languages as well.

I'd love a spanish model ... I wouldn't find anything about this yet. Only a little in Bark.

I'd love a spanish model ... I wouldn't find anything about this yet. Only a little in Bark.

Trained a new model with the Japanese tokenizer, and after ~55 epochs (~825000 samples processed), I have a better Japanese model:

image

However, it needs to be finetuned for a specific voice, it's multi-voice capabilities are terrible, and it'll have a dreadful issue where you need punctuation so it "knows" when a sentence stops, or it'll keep generating extraneous words (example 1).

But I'm happy with it. There doesn't seem to be any actual lingual quirks, so a replaced tokenizer can work, but it seems it needs twice the training time over normally training with the default tokenizer (naturally). I'll add it to my HF repo in a moment.

@mrq hello, may I ask you, are you sure that those 825000 samples weren't actually just 82500 samples? Because I'm training mine 500k samples on 3090RTX and one epoch takes about 15 hours, so those 800k samples would take more than 24 hours/per epoch and 55 epochs would take more than 2 months on any average consumer GPU...

Otherwise I agree it's impossible to get multi-voice capabilities and one needs to finetune to specific voice... training_and_validation_metrics_spec I am also including my training metrics, maybe it helps someone trying to finetune the model on new language. The results are surprisingly good.

> Trained a new model with the [Japanese tokenizer](https://git.ecker.tech/mrq/ai-voice-cloning/src/branch/master/models/tokenizers/japanese.json), and after ~55 epochs (~825000 samples processed), I have a better Japanese model: > * https://vocaroo.com/1lZkJSTC6Jju > * https://vocaroo.com/1bxRmmv0io4i > > ![image](/attachments/62600bea-5253-4cf5-9d31-1b54e497ea8d) > > However, it ***needs*** to be finetuned for a specific voice, it's multi-voice capabilities are terrible, and it'll have a dreadful issue where you need punctuation so it "knows" when a sentence stops, or it'll keep generating extraneous words (example 1). > > But I'm happy with it. There doesn't seem to be any actual lingual quirks, so a replaced tokenizer *can* work, but it seems it needs twice the training time over normally training with the default tokenizer (naturally). I'll add it to my HF repo in a moment. @mrq hello, may I ask you, are you sure that those 825000 samples weren't actually just 82500 samples? Because I'm training mine 500k samples on 3090RTX and one epoch takes about 15 hours, so those 800k samples would take more than 24 hours/per epoch and 55 epochs would take more than 2 months on any average consumer GPU... Otherwise I agree it's impossible to get multi-voice capabilities and one needs to finetune to specific voice... ![training_and_validation_metrics_spec](/attachments/c3409f99-74cc-4938-a336-81b6b88bd901) I am also including my training metrics, maybe it helps someone trying to finetune the model on new language. The results are surprisingly good.
Sign in to join this conversation.
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#147
No description provided.