Mispronouncing certain letters for slavic languages #133

Open
opened 2023-03-14 08:57:33 +00:00 by nk990 · 46 comments

I did finetune on Slovenian and it has decent quality it does missing emotions but I think it is normal cos of the small dataset. What does bother me is Mispronounciation of certain letters like: Č needs to be pronuncieted as tʃ not k or Ž(ʒ) not z.
Should I finetune also vcoder or maybe there is the text cleaner who is breaking things??

I did finetune on Slovenian and it has decent quality it does missing emotions but I think it is normal cos of the small dataset. What does bother me is Mispronounciation of certain letters like: Č needs to be pronuncieted as tʃ not k or Ž(ʒ) not z. Should I finetune also vcoder or maybe there is the text cleaner who is breaking things??
Owner

Did you happen to train with the default Text LR Ratio?

For new languages, you'll want to increase the text LR ratio to 1, as you're effectively re-teaching the model a new language (or specifically, a new sequence of phonemes to expect).

There's a possibility the VQVAE needs adjustment too, but I'm not too sure if that's necessary as I've had decent luck with just increasing the text LR ratio for Japanese.

Did you happen to train with the default `Text LR Ratio`? For new languages, you'll want to increase the text LR ratio to 1, as you're effectively re-teaching the model a new language (or specifically, a new sequence of phonemes to expect). There's a possibility the VQVAE needs adjustment too, but I'm not too sure if that's necessary as I've had decent luck with just increasing the text LR ratio for Japanese.
Author

Nope, I used default config settings. Gonna try to increase text LR ratio, can you also share your config file that you've been using for japanese, just for comparison of the parameters.

Nope, I used default config settings. Gonna try to increase text LR ratio, can you also share your config file that you've been using for japanese, just for comparison of the parameters.

For new languages, you'll want to increase the text LR ratio to 1, as you're effectively re-teaching the model a new language (or specifically, a new sequence of phonemes to expect).

Does this also apply to training a voice sample with a heavy accent, or should I leave it at the default as long as it's some form of English?

> For new languages, you'll want to increase the text LR ratio to 1, as you're effectively re-teaching the model a new language (or specifically, a new sequence of phonemes to expect). Does this also apply to training a voice sample with a heavy accent, or should I leave it at the default as long as it's some form of English?
Owner

Nothing special, just max the LR sliders. The finer LR schedule will make it bake faster but crank the LR down to a good rate by epoch 9:

  • LR: 0.0001
  • Mel LR ratio: 1.0
  • Text LR ratio: 1.0
  • MultiStepLR, default schedule ([2, 4, 9, 18, 25, 33, 50])

The only issue, though, is that I've done some different batches over the past few days, and I haven't got to check my latest batch, something like:

  • batch one didn't trim clips that exceeded 11.6s (dataset size of ~8k, for ~15 epochs)
  • batch two continued from batch one, but with long clips trimmed (dataset size of 15k, for another ~15 epochs), and might have adjusted LR ratios, and has the problem of resuming making the loss jump down sharply immediately on resume
  • batch three finedtuned from batch 2 on a specific voice and sounded decent (just wanted to test finetuning a finetuned language)
  • batch four is a clean run, but the mel ratio was set to 0.25 (since the other batches latched too heavily onto my dataset)

Just crank the three sliders to max, and set your epochs / batch size / gradient accumulation size as needed.

Nothing special, just max the LR sliders. The finer LR schedule will make it bake faster but crank the LR down to a good rate by epoch 9: * LR: 0.0001 * Mel LR ratio: 1.0 * Text LR ratio: 1.0 * MultiStepLR, default schedule (`[2, 4, 9, 18, 25, 33, 50]`) The only issue, though, is that I've done some different batches over the past few days, and I haven't got to check my latest batch, something like: * batch one didn't trim clips that exceeded 11.6s (dataset size of ~8k, for ~15 epochs) * batch two continued from batch one, but with long clips trimmed (dataset size of 15k, for another ~15 epochs), and might have adjusted LR ratios, and has the problem of resuming making the loss jump down sharply immediately on resume * batch three finedtuned from batch 2 on a specific voice and sounded *decent* (just wanted to test finetuning a finetuned language) * batch four is a clean run, but the mel ratio was set to 0.25 (since the other batches latched too heavily onto my dataset) Just crank the three sliders to max, and set your epochs / batch size / gradient accumulation size as needed.
  • batch one didn't trim clips that exceeded 11.6s (dataset size of ~8k, for ~15 epochs)

Hold up, by "dataset size of ~8k" do you mean train.txt was ~8kb or ~8k clips?

> * batch one didn't trim clips that exceeded 11.6s (dataset size of ~8k, for ~15 epochs) Hold up, by "dataset size of ~8k" do you mean *train.txt was ~8kb* or *~8k clips*?
Owner

Does this also apply to training a voice sample with a heavy accent, or should I leave it at the default as long as it's some form of English?

mmm, if it's strictly accented English, you shouldn't need to up the text LR ratio. As backwards as it sounds, "teaching" how phonemes should sound is how training an accent should go about.

Now, if that dialect does show through in the text (example: oi guv), then yeah, you'll need to up the text LR, as it's effectively a different language (phoneme sequence).

Although this is just conjecture based on my current understanding of the AR model, which seems to change on a dime randomly every week.

> Does this also apply to training a voice sample with a heavy accent, or should I leave it at the default as long as it's some form of English? mmm, if it's strictly accented English, you shouldn't need to up the text LR ratio. As backwards as it sounds, "teaching" how phonemes should sound is how training an accent should go about. Now, if that dialect does show through in the text (example: oi guv), then yeah, you'll need to up the text LR, as it's effectively a different language (phoneme sequence). Although this is just conjecture based on my current understanding of the AR model, which seems to change on a dime randomly every week.
Owner

Hold up, by "dataset size of ~8k" do you mean train.txt was ~8kb or ~8k clips?

8k lines, one clip per line, 8k clips. I don't have a total audio metric, but providing that metrics is meaningless desu since I believe each sequence is a fixed mel token length. Providing more or less audio cumulatively won't impact it as much.

I don't think a large dataset size is all that imperative for teaching it a new language/dialect, but I needed something large as my very first cursory test on one voice with a low dataset size had some missing features of Japanese, due to there not being anything for the model to "learn".

> Hold up, by "dataset size of ~8k" do you mean train.txt was ~8kb or ~8k clips? 8k lines, one clip per line, 8k clips. I don't have a total audio metric, but providing that metrics is meaningless desu since I believe each sequence is a fixed mel token length. Providing more or less audio cumulatively won't impact it as much. I don't think a large dataset size is all that imperative for teaching it a new language/dialect, but I needed something large as my very first cursory test on one voice with a low dataset size had some missing features of Japanese, due to there not being anything for the model to "learn".

8k lines, one clip per line, 8k clips.

お兄ちゃん、大きすぎる~!

What settings are you using in "Prepare Dataset" so that you don't have to check and fix each clip manually? How big is your validation.txt?

> 8k lines, one clip per line, 8k clips. お兄ちゃん、大きすぎる~! What settings are you using in "Prepare Dataset" so that you don't have to check and fix each clip manually? How big is your validation.txt?
Owner

What settings are you using in "Prepare Dataset" so that you don't have to check and fix each clip manually?

Trim silence, text cull length 4, audio cull length 1 second. Nothing major, just leveraging the implicit slicing for lines took long (and the segments are decent at least).

How big is your validation.txt?

For the 8k, not sure, as I don't have that saved, since it was pre-implicit slicing of large lines.

The 15k one clocks at 898 lines for validation.

> What settings are you using in "Prepare Dataset" so that you don't have to check and fix each clip manually? Trim silence, text cull length 4, audio cull length 1 second. Nothing major, just leveraging the implicit slicing for lines took long (and the segments are *decent* at least). > How big is your validation.txt? For the 8k, not sure, as I don't have that saved, since it was pre-implicit slicing of large lines. The 15k one clocks at 898 lines for validation.

Trim silence, text cull length 4, audio cull length 1 second.

Just to be explicit: "text cull length" is Validation Text Length Threshold, and "audio cull length" is Validation Audio Length Threshold, correct? Nothing for the offsets?

> Trim silence, text cull length 4, audio cull length 1 second. Just to be explicit: "text cull length" is Validation Text Length Threshold, and "audio cull length" is Validation Audio Length Threshold, correct? Nothing for the offsets?
Owner

Oh, actually, you could switch out the tokenizer for niche languages. I was doing a glance-over at how tortoise tokenizes text (since I was theorizing if providing your own phonemes for training/inference would be beneficial at all), and it's just a bog standard HF tokenizer with a config.

I'm not too sure of the implications though. They're not true phonemes, rather virtual-ones (in the loosest sense). In theory, if you were able to magically source a better tuned tokenizer.json for a language, you just need to provide it during training and inferencing, but you'd have to make extra sure you increase the text LR ratio up to 1, as you're completely re-writing what tokens mean what, rather than "teaching" it a new token sequence (language).

That said, I'm not too sure how necessary this is, as I got decent results from the standard tokens with Japanese. It could very well be acceptable to go about it from the backwards direction of rewriting what each "phoneme" (text token) sounds like (map to which mel tokens).

I might just have the training configuration template point to tortoise's tokenizer.json (since they're the same, I didn't actually need to provide it in this repo), so a user can replace that one file under ./modules/tortoise-tts/data/tokenizers.json if they want, rather than have a niche option in the training configuration generator.

As for a tool to make one, that's unfortunately left as an exercise to the end user.

Although, it might be fairly simple to correct it by hand; replace a similar sound in English with the characters of what it corresponds to, and maybe add some sound merges in the merge array.


"text cull length" is Validation Text Length Threshold, and "audio cull length" is Validation Audio Length Threshold, correct?

Correct. I can never remember what I actually call the settings.

Nothing for the offsets?

Not necessary, unless you're consistently getting audio trimmed either too much or not enough on either ends, and you can blanket correct it by providing slice time offsets.

Oh, actually, you *could* switch out the tokenizer for niche languages. I was doing a glance-over at how tortoise [tokenizes text](https://git.ecker.tech/mrq/tortoise-tts/src/branch/main/tortoise/utils/tokenizer.py) (since I was theorizing if providing your own phonemes for training/inference would be beneficial at all), and it's just a bog standard HF tokenizer with a [config](https://git.ecker.tech/mrq/ai-voice-cloning/src/branch/master/models/tortoise/bpe_lowercase_asr_256.json). I'm not too sure of the implications though. They're not true phonemes, rather virtual-ones (in the loosest sense). In theory, if you were able to magically source a better tuned tokenizer.json for a language, you just need to provide it during training and inferencing, but you'd have to make extra sure you increase the text LR ratio up to 1, as you're completely re-writing what tokens mean what, rather than "teaching" it a new token sequence (language). That said, I'm not too sure how necessary this is, as I got decent results from the standard tokens with Japanese. It could very well be acceptable to go about it from the backwards direction of rewriting what each "phoneme" (text token) sounds like (map to which mel tokens). I *might* just have the training configuration template point to tortoise's tokenizer.json (since they're the same, I didn't actually need to provide it in this repo), so a user can replace that one file under `./modules/tortoise-tts/data/tokenizers.json` if they want, rather than have a niche option in the training configuration generator. As for a tool to make one, that's unfortunately left as an exercise to the end user. Although, it might be fairly simple to correct it by hand; replace a similar sound in English with the characters of what it corresponds to, and maybe add some sound merges in the merge array. --- > "text cull length" is Validation Text Length Threshold, and "audio cull length" is Validation Audio Length Threshold, correct? Correct. I can never remember what I actually call the settings. > Nothing for the offsets? Not necessary, unless you're consistently getting audio trimmed either too much or not enough on either ends, and you can blanket correct it by providing slice time offsets.

I'm not too sure of the implications though. They're not true phonemes, rather virtual-ones (in the loosest sense). In theory, if you were able to magically source a better tuned tokenizer.json for a language, you just need to provide it during training and inferencing, but you'd have to make extra sure you increase the text LR ratio up to 1, as you're completely re-writing what tokens mean what, rather than "teaching" it a new token sequence (language).

There are a number of IPA tokenizers available but the fact that none of the TTS solutions I'm familiar with use them makes me think they either wouldn't provide as big a quality improvement as I think they would or that there's some kind of performance hit involved with the additional codespace. Or maybe I'm the only weirdo who wants to do generation with that level of granularity~

> I'm not too sure of the implications though. They're not true phonemes, rather virtual-ones (in the loosest sense). In theory, if you were able to magically source a better tuned tokenizer.json for a language, you just need to provide it during training and inferencing, but you'd have to make extra sure you increase the text LR ratio up to 1, as you're completely re-writing what tokens mean what, rather than "teaching" it a new token sequence (language). There are a number of IPA tokenizers available but the fact that none of the TTS solutions I'm familiar with use them makes me think they either wouldn't provide as big a quality improvement as I think they would or that there's some kind of performance hit involved with the additional codespace. Or maybe I'm the only weirdo who wants to do generation with that level of granularity~
Owner

I caved and added a way to override the tokenizer JSON under Settings, because I realized it actually does affect Japanese (at least, from seeing it merge the "phonemes"). The overrided tokenizer JSON also gets used in place for the training configuration.

I don't think it's necessary for me to retrain (again, for the unknownth time) my Japanese finetune, as I probably bruteforced my way through token merging.


There are a number of IPA tokenizers available but the fact that none of the TTS solutions I'm familiar with use them

Both of the VALL-E implementations use them, at least, the shitty bloated one:

  • But Stanley didn't want to go back to the office.
  • bʌt_stænli_dɪdnt_wɔnt_tə_ɡoʊ_bæk_tə_ðɪ_ɑːfɪs.

The clean one doesn't literally use IPA phonemes, but rather what seems to be an ASCII representation:

  • I'm kind of lost. I'm looking for Silent Hill.
  • AY1 _ AE1 M _ K AY1 N D _ AH1 V _ L AO1 S T _ AY1 _ AE1 M _ L UH1 K IH0 NG _ F AO1 R _ S AY1 L AH0 N T _ HH IH1 L

So just given that alone, I'm sure VALL-E will give handle those sorts of things much better than TorToiSe.

I caved and added a way to override the tokenizer JSON under `Settings`, because I realized it actually does affect Japanese (at least, from seeing it merge the "phonemes"). The overrided tokenizer JSON also gets used in place for the training configuration. I don't *think* it's necessary for me to retrain (again, for the unknownth time) my Japanese finetune, as I probably bruteforced my way through token merging. --- > There are a number of IPA tokenizers available but the fact that none of the TTS solutions I'm familiar with use them Both of the VALL-E implementations use them, at least, the [shitty bloated one](https://github.com/lifeiteng/vall-e): * But Stanley didn't want to go back to the office. * `bʌt_stænli_dɪdnt_wɔnt_tə_ɡoʊ_bæk_tə_ðɪ_ɑːfɪs.` The [clean one](https://github.com/enhuiz/vall-e) doesn't literally use IPA phonemes, but rather what seems to be an ASCII representation: * I'm kind of lost. I'm looking for Silent Hill. * `AY1 _ AE1 M _ K AY1 N D _ AH1 V _ L AO1 S T _ AY1 _ AE1 M _ L UH1 K IH0 NG _ F AO1 R _ S AY1 L AH0 N T _ HH IH1 L` So just given that alone, I'm sure VALL-E will give handle those sorts of things much better than TorToiSe.
  • batch one didn't trim clips that exceeded 11.6s (dataset size of ~8k, for ~15 epochs)

Only 15 epochs? Is this a typo? I've been doing 200-1500 for most of my training, and that's just for english voices lol

I know with 8k clips that's probably a lot of sets per epoch, it just seemed like a very low number! I thought the training needed to iterate the same files hundreds of times in order to learn anything

> * batch one didn't trim clips that exceeded 11.6s (dataset size of ~8k, for ~15 epochs) Only 15 epochs? Is this a typo? I've been doing 200-1500 for most of my training, and that's just for english voices lol I know with 8k clips that's probably a lot of sets per epoch, it just seemed like a very low number! I thought the training needed to iterate the same files hundreds of times in order to learn anything
Owner

desu the first finetune test has a much smaller size (dataset size of 4.5k for 11 epochs). Granted, all of the hyperparameters play a role in how decent something is trained (batch size and gradient accumulation factor quite a bit, even single vs multi-GPU seems to affect things too), so there's no good simple answer on it outside of how nice of a loss curve you get.

It's conjecture, but in my experience I feel language benefits more from a varied dataset rather than many iterations, rather finetuning for speech (a voice) is more about iterations over a varied dataset (although a varied dataset does help, I've got decent results even with a small dataset from just bruteforcing it over a very low loss rate).

desu [the first finetune test](https://github.com/152334H/DL-Art-School#results) has a much smaller size (dataset size of 4.5k for 11 epochs). Granted, all of the hyperparameters play a role in how decent something is trained (batch size and gradient accumulation factor quite a bit, even single vs multi-GPU seems to affect things too), so there's no good simple answer on it outside of how nice of a loss curve you get. It's conjecture, but in my experience I feel language benefits more from a varied dataset rather than many iterations, rather finetuning for speech (a voice) is more about iterations over a varied dataset (although a varied dataset does help, I've got decent results even with a small dataset from just bruteforcing it over a very low loss rate).

desu the first finetune test has a much smaller size (dataset size of 4.5k for 11 epochs). Granted, all of the hyperparameters play a role in how decent something is trained (batch size and gradient accumulation factor quite a bit, even single vs multi-GPU seems to affect things too), so there's no good simple answer on it outside of how nice of a loss curve you get.

It's conjecture, but in my experience I feel language benefits more from a varied dataset rather than many iterations, rather finetuning for speech (a voice) is more about iterations over a varied dataset (although a varied dataset does help, I've got decent results even with a small dataset from just bruteforcing it over a very low loss rate).

Ahh interesting. Yeh I've toyed with learning rates even lower than the 0.00001 default with very small datasets.

Still gradually figuring out what a good learning curve might look like!

> desu [the first finetune test](https://github.com/152334H/DL-Art-School#results) has a much smaller size (dataset size of 4.5k for 11 epochs). Granted, all of the hyperparameters play a role in how decent something is trained (batch size and gradient accumulation factor quite a bit, even single vs multi-GPU seems to affect things too), so there's no good simple answer on it outside of how nice of a loss curve you get. > > It's conjecture, but in my experience I feel language benefits more from a varied dataset rather than many iterations, rather finetuning for speech (a voice) is more about iterations over a varied dataset (although a varied dataset does help, I've got decent results even with a small dataset from just bruteforcing it over a very low loss rate). Ahh interesting. Yeh I've toyed with learning rates even lower than the 0.00001 default with very small datasets. Still gradually figuring out what a good learning curve might look like!

I caved and added a way to override the tokenizer JSON under Settings, because I realized it actually does affect Japanese (at least, from seeing it merge the "phonemes"). The overrided tokenizer JSON also gets used in place for the training configuration.

At last, I can train it to speak Ubykh (or at least pronounce გვფრცქვნი)!

The clean one doesn't literally use IPA phonemes, but rather what seems to be an ASCII representation:

  • I'm kind of lost. I'm looking for Silent Hill.
  • AY1 _ AE1 M _ K AY1 N D _ AH1 V _ L AO1 S T _ AY1 _ AE1 M _ L UH1 K IH0 NG _ F AO1 R _ S AY1 L AH0 N T _ HH IH1 L

That's ARPABET, it only covers phonemes of American English.

>I caved and added a way to override the tokenizer JSON under Settings, because I realized it actually does affect Japanese (at least, from seeing it merge the "phonemes"). The overrided tokenizer JSON also gets used in place for the training configuration. At last, I can train it to speak Ubykh (or at least pronounce გვფრცქვნი)! > The [clean one](https://github.com/enhuiz/vall-e) doesn't literally use IPA phonemes, but rather what seems to be an ASCII representation: > * I'm kind of lost. I'm looking for Silent Hill. > * `AY1 _ AE1 M _ K AY1 N D _ AH1 V _ L AO1 S T _ AY1 _ AE1 M _ L UH1 K IH0 NG _ F AO1 R _ S AY1 L AH0 N T _ HH IH1 L` That's [ARPABET](https://en.wikipedia.org/wiki/ARPABET), it only covers phonemes of American English.
Author

At last, I can train it to speak Ubykh (or at least pronounce გვფრცქვნი)!

@psammites Can you also share you experience with us? Like how many clips do you have in your dataset, how many epochs you set for that dataset and did you change other params, did you ovveride tokenizer?

@mrq I set params like you suggested:
LR: 0.0001
Mel LR ratio: 1.0
Text LR ratio: 1.0
MultiStepLR, default schedule ([2, 4, 9, 18, 25, 33, 50])

my dataset has 383 clips about 47 min of audio and even with that small dataset the quality is very good.

Last night i tried to train 10 epoch then 20,50,100 the quality is always good no big differences,but the mispronunciation for č,ž still persists...

I'll try to override tokenizer and run traing again, I'll let you know when is done!

> At last, I can train it to speak Ubykh (or at least pronounce გვფრცქვნი)! @psammites Can you also share you experience with us? Like how many clips do you have in your dataset, how many epochs you set for that dataset and did you change other params, did you ovveride tokenizer? @mrq I set params like you suggested: LR: 0.0001 Mel LR ratio: 1.0 Text LR ratio: 1.0 MultiStepLR, default schedule ([2, 4, 9, 18, 25, 33, 50]) my dataset has 383 clips about 47 min of audio and even with that small dataset the quality is very good. Last night i tried to train 10 epoch then 20,50,100 the quality is always good no big differences,but the mispronunciation for č,ž still persists... I'll try to override tokenizer and run traing again, I'll let you know when is done!

At last, I can train it to speak Ubykh (or at least pronounce გვფრცქვნი)!

@psammites Can you also share you experience with us? Like how many clips do you have in your dataset, how many epochs you set for that dataset and did you change other params, did you ovveride tokenizer?

I've been using comparably tiny datasets (16, 32, 48 clips) because of the time involved in reviewing the .wav's and proofreading the transcriptions (no amount of mucking with the split offsets has produced acceptable automated results). So far I've limited my testing to English with strong regional and ESL accents:

  • North Korean ESL Speaker: 16 max length lines in batches of 16. Was crap at 2500 epochs, is now pretty good after 5000 epochs. Validation disabled.

  • Northern Indian Native(?) Speaker: This was the first model I trained on an earlier version before segmentation or validation were implemented. Fairly good results after 500 iterations with ~48 lines (I don't remember exactly, I've deleted and reinstalled several times since then and kept the model but not the logs). I spent hours with ffmpeg filters trying to lose the traffic honking in the background and I don't think it made a difference.

  • Southern US (Alabama) Native Speaker: I think 32 clips for this one, 500 epochs? Good results, I hardly had to touch the transcription for this one. The second model I trained, also before splitting and validation were added.

  • Gulf Arabic ESL Speaker: Almost done baking this one, 32 lines in batches of 16, 500 epochs, 32 lines validation. Just hit epoch 450 (iteration 900) after about 50 minutes of training.

  • Vietnamese ESL(?) Speaker: Terrible results no matter what I did, dataset size, etc. The only thing I didn't try was ditching validation. On the backburner for now because I suspect the default tokenizer isn't equipped to handle tonal languages or the resulting accents.

  • Northern Chinese ESL Speaker: Another one with horrible results no matter what I did. See tonal accent theory above.

(I've been using YouTube videos as training data because I can pull the subtitles down as json with yt-dlp and use jq to mung them into segment lists for chopping with ffmpeg. They're milisecond-granular, so superior to what whisper produces. The transcription quality is sometimes better too.)

I left all the learning rates at the defaults and used MultiStepLR. I started disabling validation because I got better results before it was introduced and with it my models sounded worse and worse the more I trained them. (Also because I don't have the time to manually validate a second dataset.) No apparent quality difference between models trained on a single GPU and those trained on multiple GPU's.

My first non-English test is likely to be Malaysian because the phonemic inventory is a subset of the English one. If the results are acceptable I'll move on to standard Indonesian (baby steps!) and see if I can get tortise-tts to roll its R's. Provided that all works I'll do some actual tokenizer modification and add ñ to see if we can get a distinct /ɲ/ when training on Spanish.

For your Slovenian model, have you tried constructing a stripped down dataset with a high number of č/ž minimal pairs? You might also try transcribing them as ch and zh, not that I can articulate a reason why other than "Unicode is sometimes weird". I was on the right track but the wrong train. See edit below.

Edit: Gulf Arabic ESL accent model is accurate but a little metallic sounding on faster presets. Going to throw it back in the oven for another 500 and then bake an Egyptian Arabic ESL accent model with the same settings for comparison.

Post-nap edit: Replace [č] with [q], [ž] with [x], crank the Text LR Ratio to the max and let her rip. Should work without replacing the tokenizer or diffusion model.

Post-coffee edit: If you still need to replace the tokenizer you could yamlize this and swap it into bpe_lowercase_asr_256.json's vocab section, but you'd still have to find or train a matching diffusion model. (I didn't have any luck turning one up but then again I don't speak Slovenian.)

> > At last, I can train it to speak Ubykh (or at least pronounce გვფრცქვნი)! > > @psammites Can you also share you experience with us? Like how many clips do you have in your dataset, how many epochs you set for that dataset and did you change other params, did you ovveride tokenizer? I've been using comparably tiny datasets (16, 32, 48 clips) because of the time involved in reviewing the .wav's and proofreading the transcriptions (no amount of mucking with the split offsets has produced acceptable automated results). So far I've limited my testing to English with strong regional and ESL accents: * [North Korean ESL Speaker](https://www.youtube.com/watch?v=C0kWjEYMAfc): 16 max length lines in batches of 16. Was crap at 2500 epochs, is now pretty good after 5000 epochs. Validation disabled. * [Northern Indian Native(?) Speaker](https://www.youtube.com/watch?v=MNbXF7r8V5Y): This was the first model I trained on an earlier version before segmentation or validation were implemented. Fairly good results after 500 iterations with ~48 lines (I don't remember exactly, I've deleted and reinstalled several times since then and kept the model but not the logs). I spent hours with ffmpeg filters trying to lose the traffic honking in the background and I don't think it made a difference. * [Southern US (Alabama) Native Speaker](https://www.youtube.com/watch?v=_YLegTD3YE4): I think 32 clips for this one, 500 epochs? Good results, I hardly had to touch the transcription for this one. The second model I trained, also before splitting and validation were added. * [Gulf Arabic ESL Speaker](https://www.youtube.com/watch?v=_zncB6hngZg): Almost done baking this one, 32 lines in batches of 16, 500 epochs, 32 lines validation. Just hit epoch 450 (iteration 900) after about 50 minutes of training. * [Vietnamese ESL(?) Speaker](https://www.youtube.com/watch?v=HXcMIDoAulQ): Terrible results no matter what I did, dataset size, etc. The only thing I didn't try was ditching validation. On the backburner for now because I suspect the default tokenizer isn't equipped to handle tonal languages or the resulting accents. * [Northern Chinese ESL Speaker](https://www.youtube.com/watch?v=KM9vvGReycU): Another one with horrible results no matter what I did. See tonal accent theory above. (I've been using YouTube videos as training data because I can pull the subtitles down as json with yt-dlp and use jq to mung them into segment lists for chopping with ffmpeg. They're milisecond-granular, so superior to what whisper produces. The transcription quality is sometimes better too.) I left all the learning rates at the defaults and used MultiStepLR. I started disabling validation because I got better results before it was introduced and with it my models sounded worse and worse the more I trained them. (Also because I don't have the time to manually validate a second dataset.) No apparent quality difference between models trained on a single GPU and those trained on multiple GPU's. My first non-English test is likely to be Malaysian because the phonemic inventory is a subset of the English one. If the results are acceptable I'll move on to standard Indonesian (baby steps!) and see if I can get tortise-tts to roll its R's. Provided that all works I'll do some actual tokenizer modification and add ñ to see if we can get a distinct /ɲ/ when training on Spanish. For your Slovenian model, have you tried constructing a stripped down dataset with a high number of č/ž minimal pairs? ~~You might also try transcribing them as *ch* and *zh*, not that I can articulate a reason why other than "Unicode is sometimes weird".~~ I was on the right track but the wrong train. See edit below. Edit: Gulf Arabic ESL accent model is accurate but a little metallic sounding on faster presets. Going to throw it back in the oven for another 500 and then bake an Egyptian Arabic ESL accent model with the same settings for comparison. Post-nap edit: Replace [č] with [q], [ž] with [x], crank the Text LR Ratio to the max and let her rip. Should work without replacing the tokenizer or diffusion model. Post-coffee edit: If you still need to replace the tokenizer you could yamlize [this](https://bpemb.h-its.org/sl/sl.wiki.bpe.vs1000.vocab) and swap it into bpe_lowercase_asr_256.json's vocab section, but you'd still have to find or train a matching diffusion model. (I didn't have any luck turning one up but then again I don't speak Slovenian.)
nk990 changed title from Mispronouncing certain letters for Slovenian language to Mispronouncing certain letters for slavic languages 2023-03-15 13:20:59 +00:00
Author

Post-nap, post-coffee edit: Replace [č] with [q], [ž] with [x], crank up the Text LR Ratio to the max and let her rip. Should work without replacing the tokenizer or diffusion model.

this can be a workaround for Slovenian, but if you have Croatian when you have:

  • ć /tɕ/
  • č /tʃ/
  • d /d/
  • đ /dʑ/
  • dž /dʒ/
  • ž /ʒ/

and some other particular cases... how to make a workaround then?

> Post-nap, post-coffee edit: Replace [č] with [q], [ž] with [x], crank up the Text LR Ratio to the max and let her rip. Should work without replacing the tokenizer or diffusion model. this can be a workaround for Slovenian, but if you have Croatian when you have: * ć /tɕ/ * č /tʃ/ * d /d/ * đ /dʑ/ * dž /dʒ/ * ž /ʒ/ and some other particular cases... how to make a workaround then?

and some other particular cases... how to make a workaround then?

That's about pushing the limits of what you can do without replacing the tokenizer:

[ć] /tɕ/  -> [w]
[č] /tʃ/  -> [ch]
[d] /d/   -> [q]
[đ] /dʑ/  -> [x]
[dž] /dʒ/ -> [dzh]
[ž] /ʒ/   -> [zh]

YMMV no warranty expressed or implied fasten your seatbelt cross your fingers wear a helmet and click "Train".

Edit: @mrq mentioned VALL-E above. There's an academic paper out on a multi-lingual version called "VALL-E X" but AFAIK they haven't released the code yet.

> and some other particular cases... how to make a workaround then? That's about pushing the limits of what you can do without replacing the tokenizer: ``` [ć] /tɕ/ -> [w] [č] /tʃ/ -> [ch] [d] /d/ -> [q] [đ] /dʑ/ -> [x] [dž] /dʒ/ -> [dzh] [ž] /ʒ/ -> [zh] ``` YMMV no warranty expressed or implied fasten your seatbelt cross your fingers wear a helmet and click "Train". Edit: @mrq mentioned VALL-E above. There's an academic paper out on a multi-lingual version called "VALL-E X" but AFAIK they haven't released the code yet.
Owner

I started disabling validation because I got better results before it was introduced

Validation has no effect on training quality (it sometimes will eat up your total iteration count and cause training to terminate early, but that's more of a bug).

There's an academic paper out on a multi-lingual version called "VALL-E X" but AFAIK they haven't released the code yet.

VALL-E X is more for a model that outputs multi-lingual speech. VALL-E already (should) support training non-English models.

> I started disabling validation because I got better results before it was introduced Validation has no effect on training quality (it sometimes will eat up your total iteration count and cause training to terminate early, but that's more of a bug). > There's an academic paper out on a multi-lingual version called "VALL-E X" but AFAIK they haven't released the code yet. VALL-E X is more for a model that outputs multi-lingual speech. VALL-E already (should) support training non-English models.

I started disabling validation because I got better results before it was introduced

Validation has no effect on training quality (it sometimes will eat up your total iteration count and cause training to terminate early, but that's more of a bug).

Even if your validation.txt contains bad transcriptions?

VALL-E already (should) support training non-English models.

This doesn't look like anywhere near enough coverage to do true multi-lingual speech. Unless they're getting it somewhere else?

> > I started disabling validation because I got better results before it was introduced > > Validation has no effect on training quality (it sometimes will eat up your total iteration count and cause training to terminate early, but that's more of a bug). Even if your validation.txt contains bad transcriptions? > VALL-E already (should) support training non-English models. [This doesn't look like anywhere near enough coverage](https://github.com/lifeiteng/vall-e/blob/main/egs/libritts/data/tokenized/unique_text_tokens.k2symbols) to do true multi-lingual speech. Unless they're getting it somewhere else?
Owner

Even if your validation.txt contains bad transcriptions?

Correct. It's just a way to get a defacto metric to see how well a model is when handling outside data. Nothing from the validation pass gets used for training.

This doesn't look like anywhere near enough coverage to do true multi-lingual speech. Unless they're getting it somewhere else?

That's generated from the LibriTTS dataset specifically. Any other dataset will have a different list of unique tokens.

> Even if your validation.txt contains bad transcriptions? Correct. It's just a way to get a defacto metric to see how well a model is when handling outside data. Nothing from the validation pass gets used for training. > This doesn't look like anywhere near enough coverage to do true multi-lingual speech. Unless they're getting it somewhere else? That's generated from the LibriTTS dataset specifically. Any other dataset will have a different list of unique tokens.

This doesn't look like anywhere near enough coverage to do true multi-lingual speech. Unless they're getting it somewhere else?

That's generated from the LibriTTS dataset specifically. Any other dataset will have a different list of unique tokens.

As I understand it if I provide my own dataset then I have to provide my own inference model trained on it, which would remove the impetus to migrate to VALL-E. (I'm less than impressed by their demo page: they've cherrypicked so that "Speaker Prompt" and "Ground Truth" are always from the same accent and sex, which makes me think cross-sex and cross-dialect output are terrible. If they could do something more challenging like making a banana-bender sound like a newfie I think they'd be promoting it.)

> > This doesn't look like anywhere near enough coverage to do true multi-lingual speech. Unless they're getting it somewhere else? > > That's generated from the LibriTTS dataset specifically. Any other dataset will have a different list of unique tokens. As I understand it if I provide my own dataset then I have to provide my own inference model trained on it, which would remove the impetus to migrate to VALL-E. (I'm less than impressed by their demo page: they've cherrypicked so that "Speaker Prompt" and "Ground Truth" are always from the same accent and sex, which makes me think cross-sex and cross-dialect output are terrible. If they could do something more challenging like making a banana-bender sound like a newfie I think they'd be promoting it.)
Owner

As I understand it if I provide my own dataset then I have to provide my own inference model trained on it

No shit. You already need to provide your own model anyways, as there's no publicly released one, hence the statement on muh ethics at the bottom of the demo page.

I'm less than impressed by their demo page

desu TorToiSe's demo page was also very lackluster; they're moreso demonstrations with zero-shot inferencing (and even then, a little love and elbow grease makes TorToiSe's zero-shot inferencing with the base model actually decent without finetunes).

Anyways, VALL-E specifically is outside of the scope of discussion despite my not-so-hidden commits about integrating it. I shouldn't be discussing it until I get something trained, and I don't have the time to babysit training during the week.


Anyways. I mentioned the VALL-E implementations using phonemes instead of the rather-archaic tokenization of common English elements (not to TorToiSe's fault specifically, GPT transformers are notorious for tokenizing like that), I could just modify the tokenization process entirely:

  • run the transcribed text through a phonemizer (the good VALL-E implementation uses G2p)
  • output the phonemized text into the train.txt / validation.txt files
  • provide a vocab that instead will tokenize on IPAs

Although, you might need to supply per-language merging of phonemes, and I don't know specifically how well a model will perform without merged tokens (I did a quick test, but I didn't spend much time on testing it).

I'm sure I'll catch a wild hair and implement it, as it's something that sounds solid in principle. I just don't know when I'll get to it.

> As I understand it if I provide my own dataset then I have to provide my own inference model trained on it No shit. You already need to provide your own model anyways, as there's no publicly released one, hence the statement on muh ethics at the bottom of the demo page. > I'm less than impressed by their demo page desu TorToiSe's demo page was also very lackluster; they're moreso demonstrations with zero-shot inferencing (and even then, a little love and elbow grease makes TorToiSe's zero-shot inferencing with the base model actually decent without finetunes). Anyways, VALL-E specifically is outside of the scope of discussion despite my not-so-hidden commits about integrating it. I shouldn't be discussing it until I get *something* trained, and I don't have the time to babysit training during the week. --- Anyways. I mentioned the VALL-E implementations using phonemes instead of the rather-archaic tokenization of common English elements (not to TorToiSe's fault specifically, GPT transformers are notorious for tokenizing like that), I could just modify the tokenization process entirely: * run the transcribed text through a phonemizer (the [good VALL-E implementation](https://github.com/enhuiz/vall-e/blob/main/vall_e/emb/g2p.py) uses [G2p](https://github.com/roedoejet/g2p)) * output the phonemized text into the train.txt / validation.txt files * provide a vocab that instead will tokenize on IPAs Although, you might need to supply per-language merging of phonemes, and I don't know specifically how well a model will perform without merged tokens (I did a quick test, but I didn't spend much time on testing it). I'm sure I'll catch a wild hair and implement it, as it's something that sounds solid in principle. I just don't know when I'll get to it.

As I understand it if I provide my own dataset then I have to provide my own inference model trained on it

No shit. You already need to provide your own model anyways, as there's no publicly released one, hence the statement on muh ethics at the bottom of the demo page.

Yea, so if I'm going to have to train a model I might as well train one for the software that I already know can do what I want; Fuck training for a whole week just to make a middle-aged British dude sound like a slightly different middle-aged British dude inside a tin can. Plus they're built against an older version of CUDA Toolkit and I burnt way more than enough time wrangling with CUDA incompatibilities doing the WSL setup.

desu TorToiSe's demo page was also very lackluster; they're moreso demonstrations with zero-shot inferencing (and even then, a little love and elbow grease makes TorToiSe's zero-shot inferencing with the base model actually decent without finetunes).

You know what would make it even better? If tts.get_conditioning_latents() could use multiple GPU's...

I could probably just leverage phonemizer by itself, but it depends on espeak-ng (which was CBT to get working under Windows).

  1. Install Windows eSpeak NG .msi from the repo
  2. pip install git+https://github.com/bootphon/phonemizer
  3. $env:PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll' (or wherever)
PS C:\Users\sneed> echo "Tandanya adalah lelucon halus..." | phonemize.exe -l id
tandaɲa adalah ləlutʃon halus
PS C:\Users\sneed> echo "Il segno è uno scherzo sottile..." | phonemize.exe -l it
il seɲɲo uno skertso sotːile
PS C:\Users\sneed> echo "Das Zeichen ist ein subtiler Witz..." | phonemize.exe -l de
das tsaɪçən ɪst aɪn zʊptiːlɜ vɪts
> > As I understand it if I provide my own dataset then I have to provide my own inference model trained on it > > No shit. You already need to provide your own model anyways, as there's no publicly released one, hence the statement on muh ethics at the bottom of the demo page. Yea, so if I'm going to have to train a model I might as well train one for the software that I already know can do what I want; Fuck training for a whole week just to make a middle-aged British dude sound like a slightly different middle-aged British dude inside a tin can. Plus they're built against an older version of CUDA Toolkit and I burnt way more than enough time wrangling with CUDA incompatibilities doing the WSL setup. > desu TorToiSe's demo page was also very lackluster; they're moreso demonstrations with zero-shot inferencing (and even then, a little love and elbow grease makes TorToiSe's zero-shot inferencing with the base model actually decent without finetunes). You know what would make it even better? If tts.get_conditioning_latents() could use multiple GPU's... > I could probably just leverage phonemizer by itself, but it depends on espeak-ng (which was CBT to get working under Windows). 1. Install Windows eSpeak NG .msi from [the repo](https://github.com/espeak-ng/espeak-ng/releases) 1. `pip install git+https://github.com/bootphon/phonemizer` 1. `$env:PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'` (or wherever) ``` PS C:\Users\sneed> echo "Tandanya adalah lelucon halus..." | phonemize.exe -l id tandaɲa adalah ləlutʃon halus PS C:\Users\sneed> echo "Il segno è uno scherzo sottile..." | phonemize.exe -l it il seɲɲo uno skertso sotːile PS C:\Users\sneed> echo "Das Zeichen ist ein subtiler Witz..." | phonemize.exe -l de das tsaɪçən ɪst aɪn zʊptiːlɜ vɪts ```
Owner

You know what would make it even better? If tts.get_conditioning_latents() could use multiple GPU's...

A nightmare to think about implementing, and beyond the scope of the matter.

$env:PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'

I both used set and manually declared the env var in whatever Windows sand it didn't like that. I don't remember what got it to work, but it was unintuitive enough I'm not going to leverage it.


There's allosaurus that I'll just leverage:

  • fairly language agnostic.
    • base g2p doesn't support English, limited language support, and g2p-en uses ARPABET, and I'd rather not reparse it into IPAs
  • parses the audio itself rather than text
    • could outright remove the need for Whisper to transcribe.
    • ironically, WhisperX does things phoneme-based, but fuck that.

My only issue, though, is that I would still need a text phonemizer for inferencing, defeating the whole entire purpose of using an audio-based one.

> You know what would make it even better? If tts.get_conditioning_latents() could use multiple GPU's... A nightmare to think about implementing, and beyond the scope of the matter. > $env:PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll' I both used `set` and manually declared the env var in whatever Windows sand it didn't like that. I don't remember what got it to work, but it was unintuitive enough I'm not going to leverage it. --- There's [allosaurus](https://github.com/xinjli/allosaurus) that I'll just leverage: * fairly language agnostic. - base g2p doesn't support English, limited language support, and g2p-en uses ARPABET, and I'd rather not reparse it into IPAs * parses the audio itself rather than text - could outright remove the need for Whisper to transcribe. - ironically, WhisperX does things phoneme-based, but fuck that. My only issue, though, is that I would still need a text phonemizer for inferencing, defeating the whole entire purpose of using an audio-based one.

$env:PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'

I both used set and manually declared the env var in whatever Windows sand it didn't like that. I don't remember what got it to work, but it was unintuitive enough I'm not going to leverage it.

If you did it via the EditEnvironmentVariables control panel you need to close your PowerShell session and open a new one for the change to apply, but where I think people get tripped up is the eSpeak NG documentation gives one the impression that the thing to do is set ESPEAK_DATA_PATH to "C:\Program Files\eSpeak NG\espeak-ng-data" but that's for running espeak-ng.exe itself, which isn't what we want. Phonemize just wants a path to the .dll that it came in the package.

Edit: To be clear to anyone else who might be having trouble installing on Windows you need to put the full path including file name of the dll in PHONEMIZER_ESPEAK_LIBRARY. Ignore ESPEAK_DATA_PATH, Phonemizer doesn't care about it.

There's allosaurus that I'll just leverage:

  • fairly language agnostic.
    • base g2p doesn't support English, limited language support, and g2p-en uses ARPABET, and I'd rather not reparse it into IPAs
  • parses the audio itself rather than text
    • could outright remove the need for Whisper to transcribe.
    • ironically, WhisperX does things phoneme-based, but fuck that.

My only issue, though, is that I would still need a text phonemizer for inferencing, defeating the whole entire purpose of using an audio-based one.

Another issue is that tortoise wants floating point sampled wav files but allosaurus chokes unless they're integer sampled:

sneed@FMRLYCHKS:/mnt/d/ai-voice-cloning/voices/JiHyunMinimal$ python -m allosaurus.run -i Jihyun_00019.wav
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/mconroy/.local/lib/python3.10/site-packages/allosaurus/run.py", line 71, in <module>
    phones = recognizer.recognize(args.input, args.lang, args.topk, args.emit, args.timestamp)
  File "/home/mconroy/.local/lib/python3.10/site-packages/allosaurus/app.py", line 68, in recognize
    audio = read_audio(filename)
  File "/home/mconroy/.local/lib/python3.10/site-packages/allosaurus/audio.py", line 17, in read_audio
    wf = wave.open(filename)
  File "/usr/lib/python3.10/wave.py", line 509, in open
    return Wave_read(f)
  File "/usr/lib/python3.10/wave.py", line 163, in __init__
    self.initfp(f)
  File "/usr/lib/python3.10/wave.py", line 143, in initfp
    self._read_fmt_chunk(chunk)
  File "/usr/lib/python3.10/wave.py", line 268, in _read_fmt_chunk
    raise Error('unknown format: %r' % (wFormatTag,))
wave.Error: unknown format: 3
sneed@FMRLYCHKS:/mnt/d/ai-voice-cloning/voices/JiHyunMinimal$ ffprobe Jihyun_00019.wav
ffprobe version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2007-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, wav, from 'Jihyun_00019.wav':
  Duration: 00:00:09.70, bitrate: 705 kb/s
  Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, 1 channels, flt, 705 kb/s
sneed@FMRLYCHKS:/mnt/d/ai-voice-cloning/voices/JiHyunMinimal$
> > $env:PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll' > > I both used `set` and manually declared the env var in whatever Windows sand it didn't like that. I don't remember what got it to work, but it was unintuitive enough I'm not going to leverage it. If you did it via the EditEnvironmentVariables control panel you need to close your PowerShell session and open a new one for the change to apply, but where I think people get tripped up is the eSpeak NG documentation gives one the impression that the thing to do is set `ESPEAK_DATA_PATH` to "C:\Program Files\eSpeak NG\espeak-ng-data\" but that's for running espeak-ng.exe itself, which isn't what we want. Phonemize just wants a path to the .dll that it came in the package. Edit: To be clear to anyone else who might be having trouble installing on Windows you need to put the full path including file name of the dll in `PHONEMIZER_ESPEAK_LIBRARY`. Ignore `ESPEAK_DATA_PATH`, Phonemizer doesn't care about it. > There's [allosaurus](https://github.com/xinjli/allosaurus) that I'll just leverage: > * fairly language agnostic. > - base g2p doesn't support English, limited language support, and g2p-en uses ARPABET, and I'd rather not reparse it into IPAs > * parses the audio itself rather than text > - could outright remove the need for Whisper to transcribe. > - ironically, WhisperX does things phoneme-based, but fuck that. >My only issue, though, is that I would still need a text phonemizer for inferencing, defeating the whole entire purpose of using an audio-based one. Another issue is that tortoise wants floating point sampled wav files but allosaurus chokes unless they're integer sampled: ``` sneed@FMRLYCHKS:/mnt/d/ai-voice-cloning/voices/JiHyunMinimal$ python -m allosaurus.run -i Jihyun_00019.wav Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/mconroy/.local/lib/python3.10/site-packages/allosaurus/run.py", line 71, in <module> phones = recognizer.recognize(args.input, args.lang, args.topk, args.emit, args.timestamp) File "/home/mconroy/.local/lib/python3.10/site-packages/allosaurus/app.py", line 68, in recognize audio = read_audio(filename) File "/home/mconroy/.local/lib/python3.10/site-packages/allosaurus/audio.py", line 17, in read_audio wf = wave.open(filename) File "/usr/lib/python3.10/wave.py", line 509, in open return Wave_read(f) File "/usr/lib/python3.10/wave.py", line 163, in __init__ self.initfp(f) File "/usr/lib/python3.10/wave.py", line 143, in initfp self._read_fmt_chunk(chunk) File "/usr/lib/python3.10/wave.py", line 268, in _read_fmt_chunk raise Error('unknown format: %r' % (wFormatTag,)) wave.Error: unknown format: 3 sneed@FMRLYCHKS:/mnt/d/ai-voice-cloning/voices/JiHyunMinimal$ ffprobe Jihyun_00019.wav ffprobe version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2007-2021 the FFmpeg developers built with gcc 11 (Ubuntu 11.2.0-19ubuntu1) configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 70.100 / 56. 70.100 libavcodec 58.134.100 / 58.134.100 libavformat 58. 76.100 / 58. 76.100 libavdevice 58. 13.100 / 58. 13.100 libavfilter 7.110.100 / 7.110.100 libswscale 5. 9.100 / 5. 9.100 libswresample 3. 9.100 / 3. 9.100 libpostproc 55. 9.100 / 55. 9.100 Input #0, wav, from 'Jihyun_00019.wav': Duration: 00:00:09.70, bitrate: 705 kb/s Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, 1 channels, flt, 705 kb/s sneed@FMRLYCHKS:/mnt/d/ai-voice-cloning/voices/JiHyunMinimal$ ```
Owner
blah

I did everything under the sun until I copied the DLL to the current working directly and it Just Worked™.

Anyways, the entire experience soured me enough that I'll refuse to leverage eSpeak.

image

Another issue is that tortoise wants floating point sampled wav files but allosaurus chokes unless they're integer sampled:

Already resolved with stuff like torchaudio.save(f"{indir}/audio/{basename}", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16), which I've been intending to do anyways to reduce filesize for training (precision doesn't matter when they're stuck at 22.5KHz).


Nevermind, Allosaurus isn't accurate enough. Phonemizer with eSpeak magically works too, so I don't know what broke.


Japanese with Phonemizer on Windows is impossible, and even then I'd have to parse it as romaji on Linux, so out with Phonemizer + eSpeak.

> \[blah\] I did everything under the sun until I copied the DLL to the current working directly and it Just Worked™. Anyways, the entire experience soured me enough that I'll refuse to leverage eSpeak. ![image](/attachments/99000328-8e9f-49ff-9e87-27ea1a781382) > Another issue is that tortoise wants floating point sampled wav files but allosaurus chokes unless they're integer sampled: Already resolved with stuff like `torchaudio.save(f"{indir}/audio/{basename}", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)`, which I've been intending to do anyways to reduce filesize for training (precision doesn't matter when they're stuck at 22.5KHz). --- Nevermind, Allosaurus isn't accurate enough. Phonemizer with eSpeak magically works too, so I don't know what broke. --- Japanese with Phonemizer on Windows is impossible, and even then I'd have to parse it as romaji on Linux, so out with Phonemizer + eSpeak.
Owner

I did a quick train on James Sunderland again, but with the training dataset text as IPA phonemes; it works, to say the least:

  • default tokenizer will just convert IPA to english equivalent, which sometimes works out.
  • without a tailored tokenizer vocab for IPAs, we're back to square one of non-English pronunciations being off.
    • even using one with all the phonemes outputted, the tokenizer process will still turn ɹ and ə as unknown tokens, and it's literally no difference from just using the default tokenizer vocab without merges.
  • sometimes passing in IPAs to the text prompt will give me something decent, other times it will not work at all, and the other other times it'll just speak fast (almost like 11.AI's pacings)
  • normal English works fine, but more likely to have 11.AI-esque pacing problems.

I'm starting to feel it's a bit of a silly endeavor to go on.

I did a quick train on James Sunderland again, but with the training dataset text as IPA phonemes; it *works*, to say the least: * default tokenizer will just convert IPA to english equivalent, which sometimes works out. * without a tailored tokenizer vocab for IPAs, we're back to square one of non-English pronunciations being off. - even using one with all the phonemes outputted, the tokenizer process will still turn `ɹ` and `ə` as unknown tokens, and it's literally no difference from just using the default tokenizer vocab without merges. * sometimes passing in IPAs to the text prompt will give me something decent, other times it will not work at all, and the other other times it'll just speak fast (almost like 11.AI's pacings) * normal English works fine, but more likely to have 11.AI-esque pacing problems. I'm starting to feel it's a bit of a silly endeavor to go on.

Japanese with Phonemizer on Windows is impossible, and even then I'd have to parse it as romaji on Linux, so out with Phonemizer + eSpeak.

Figure 1

You're likely running into two problems, the first is, well...
Figure 2

And the second is this, thankfully, a one-line fix.

And if you don't want to mess with patching Phonemizer (or using another version)then eSpeak can do it with some coaxing:

Figure 3

Allegedly the UTF8 handling has been fixed in the new cross-platform version of PowerShell but I haven't tried yet.

I'm starting to feel it's a bit of a silly endeavor to go on.

With the default tokenizer, yea. It just strips out all the accents. That's part of the reason @nk990 is running into issues, to it there's no difference between [č] and [c] (or [ć], or [ĉ], or [ç], &c...)

> Japanese with Phonemizer on Windows is impossible, and even then I'd have to parse it as romaji on Linux, so out with Phonemizer + eSpeak. ![Figure 1](https://files.catbox.moe/l53w4c.png) You're likely running into two problems, the first is, well... ![Figure 2](https://files.catbox.moe/u80v06.png) And the second is [this](https://github.com/bootphon/phonemizer/issues/146), thankfully, a one-line fix. And if you don't want to mess with patching Phonemizer (or using another version)then eSpeak can do it with some coaxing: ![Figure 3](https://files.catbox.moe/cresgr.png) Allegedly the UTF8 handling has been fixed in the new cross-platform version of PowerShell but I haven't tried yet. >I'm starting to feel it's a bit of a silly endeavor to go on. With the default tokenizer, yea. It just strips out all the accents. That's part of the reason @nk990 is running into issues, to it there's no difference between [č] and [c] (or [ć], or [ĉ], or [ç], &c...)
Author

With the default tokenizer, yea. It just strips out all the accents. That's part of the reason @nk990 is running into issues, to it there's no difference between [č] and [c] (or [ć], or [ĉ], or [ç], &c...)

@psammites infact it was very strange thought...I've tried almost anything but no success...

> With the default tokenizer, yea. It just strips out all the accents. That's part of the reason @nk990 is running into issues, to it there's no difference between [č] and [c] (or [ć], or [ĉ], or [ç], &c...) @psammites infact it was very strange thought...I've tried almost anything but no success...
Owner

With the default tokenizer, yea. It just strips out all the accents.

That's because DLAS/tortoise's VoiceBpeTokenizer will preprocess all text:

  text = convert_to_ascii(text)
  text = lowercase(text)
  text = expand_numbers(text)
  text = expand_abbreviations(text)
  text = collapse_whitespace(text)
  text = text.replace('"', '')

Ironically, it has other levels of cleaning, but always defaults to the one defined for English. You can (now) disable this by setting pre_tokenizer to null or false in the vocab JSON. I had to do it since it converted all IPAs into ASCII and caused headaches.


Anywho, unless I'm neglecting something crucial, it's a bust. 500 epochs both with my aggressive-but-proven LR scheduler and a conservative initial LR to slowly train it both produce nothing intelligible (the voice is still captured at least). In hindsight, practically re-training what each text token represents isn't as easy as simply just remapping which text token sounds like.

I have some ideas on remedies/solutions:

  • do not axe all the pre-existing tokens in the vocab, just replace the last X number of merged tokens with the IPAs instead.
  • have the IPAs map logically to the ASCII counterpart (doable, you can have multiple tokens map to the same value).

Although, redefining the token vocab might mean having to retrain an entire new model.

> With the default tokenizer, yea. It just strips out all the accents. That's because DLAS/tortoise's VoiceBpeTokenizer will preprocess all text: ``` text = convert_to_ascii(text) text = lowercase(text) text = expand_numbers(text) text = expand_abbreviations(text) text = collapse_whitespace(text) text = text.replace('"', '') ``` Ironically, it has other levels of cleaning, but always defaults to the one defined for English. You can (now) disable this by setting `pre_tokenizer` to `null` or `false` in the vocab JSON. I had to do it since it converted all IPAs into ASCII and caused headaches. --- Anywho, unless I'm neglecting something crucial, it's a bust. 500 epochs both with my aggressive-but-proven LR scheduler and a conservative initial LR to slowly train it both produce nothing intelligible (the voice is still captured at least). In hindsight, practically re-training what each text token represents isn't as easy as simply just remapping which text token sounds like. I have some ideas on remedies/solutions: * do not axe all the pre-existing tokens in the vocab, just replace the last X number of merged tokens with the IPAs instead. * have the IPAs map logically to the ASCII counterpart (doable, you can have multiple tokens map to the same value). Although, redefining the token vocab might mean having to retrain an entire new model.

Although, redefining the token vocab might mean having to retrain an entire new model.

I might take a crack at this weekend, weather permitting. Now that you've added the option to select tokenizer and diffusion model it's a possibility.

> Although, redefining the token vocab might mean having to retrain an entire new model. I might take a crack at this weekend, weather permitting. Now that you've added the option to select tokenizer and diffusion model it's a possibility.
Owner

Tried again, added some missing IPAs and merges; sounds like James Sunderland if he was Fr*nch: https://vocaroo.com/1c9ylHLoNrwU ("You're not Mary." => jʊɹ nɑːt mɛɹi.). Seeing the mapping even be wrong makes me feel like I'm neglecting something. I could just very well need to train it on a larger English dataset (although if that was the case, I should at least be able to replicate my input dataset).

Hope you can get something usable out of it. If it doesn't work, oh well, I needed an IPA phonemizer anyways for VALL-E.

Tried again, added some missing IPAs and merges; sounds like James Sunderland if he was Fr\*nch: https://vocaroo.com/1c9ylHLoNrwU ("You're not Mary." => `jʊɹ nɑːt mɛɹi.`). Seeing the mapping even be wrong makes me feel like I'm neglecting something. I could just very well need to train it on a larger English dataset (although if that was the case, I should at least be able to replicate my input dataset). Hope you can get something usable out of it. If it doesn't work, oh well, I needed an IPA phonemizer anyways for VALL-E.

Tried again, added some missing IPAs and merges; sounds like James Sunderland if he was Fr*nch: https://vocaroo.com/1c9ylHLoNrwU ("You're not Mary." => jʊɹ nɑːt mɛɹi.).

That's with the ipa.json from 1a8c5de517 and the existing diffusion model? I wonder what http://ipa-reader.xyz/ is using.

Edit: It's passing SSML to Amazon's Polly TTS.

> Tried again, added some missing IPAs and merges; sounds like James Sunderland if he was Fr\*nch: https://vocaroo.com/1c9ylHLoNrwU ("You're not Mary." => `jʊɹ nɑːt mɛɹi.`). That's with the ipa.json from 1a8c5de517 and the existing diffusion model? I wonder what http://ipa-reader.xyz/ is using. Edit: It's passing SSML to Amazon's Polly TTS.
Owner

That's with the ipa.json from 1a8c5de517 and the existing diffusion model?

Correct.


I tried again, this time with aggressive settings (LR: 0.0001, schedule: [2,4,9], and not relevant: 44 lines, bs=44, ga=1), just to try and get it to overfit (and it most definitely has overfitted), and it worked-ish: https://vocaroo.com/1ceGyFYPxzjA (ignoring the weird quirk at the end). image

Nice to see it is feasible then instead of being a lost cause, and nothing else really needs immediate replacement.

  • desu the diffusion model doesn't need to ever be replaced, it does a decent job at mapping mel tokens to assemble a mel spectrogram, and we're never going to re-define what mel tokens represent like we're redefining what text tokens represent (to my understanding at least).
  • I might then need to train it on an actual large English dataset (LJSpeech, to start), and then use that for future IPA-based finetunes, if it works well enough.
  • however, I believe the CLVP needs to be retrained, or the CVVP is then required to be used instead (again, if my understanding is correct), as the CLVP processes candidates based on the tokenized text input, rather than the mel tokens (and the text tokens are now redefined).
> That's with the ipa.json from 1a8c5de517 and the existing diffusion model? Correct. --- I tried again, this time with aggressive settings (LR: `0.0001`, schedule: `[2,4,9]`, and not relevant: 44 lines, bs=44, ga=1), just to try and get it to overfit (and it most definitely has overfitted), and it worked-ish: https://vocaroo.com/1ceGyFYPxzjA (ignoring the weird quirk at the end). ![image](/attachments/0afeb573-7176-4787-8cd7-92e3d1c465bb) Nice to see it is feasible then instead of being a lost cause, and nothing else really needs immediate replacement. * desu the diffusion model doesn't *need* to ever be replaced, it does a decent job at mapping mel tokens to assemble a mel spectrogram, and we're never going to re-define what mel tokens represent like we're redefining what text tokens represent (to my understanding at least). * I might then need to train it on an actual large English dataset (LJSpeech, to start), and then use that for future IPA-based finetunes, if it works well enough. * however, I believe the CLVP needs to be retrained, or the CVVP is then required to be used instead (again, if my understanding is correct), as the CLVP processes candidates based on the tokenized text input, rather than the mel tokens (and the text tokens are now redefined).

@nk990 Do you have an IPA-annotated Slovnenian dataset? I added the missing symbols to models/tokenizers/ipa.json but SOFES is transcribed in SAMPA and the UCLA Phonetics Lab Slovenian corpus is insufficient (and potato quality recordings).

@nk990 Do you have an IPA-annotated Slovnenian dataset? I added the missing symbols to models/tokenizers/ipa.json but SOFES is transcribed in SAMPA and the UCLA Phonetics Lab Slovenian corpus is insufficient (and potato quality recordings).
Author

@nk990 Do you have an IPA-annotated Slovnenian dataset? I added the missing symbols to models/tokenizers/ipa.json but SOFES is transcribed in SAMPA and the UCLA Phonetics Lab Slovenian corpus is insufficient (and potato quality recordings).

@psammites I'm using mozilla common voice dataset and espeak (I think is the most accurate for slavic langages) for annotation, nothing special!

> @nk990 Do you have an IPA-annotated Slovnenian dataset? I added the missing symbols to models/tokenizers/ipa.json but SOFES is transcribed in SAMPA and the UCLA Phonetics Lab Slovenian corpus is insufficient (and potato quality recordings). @psammites I'm using mozilla common voice dataset and espeak (I think is the most accurate for slavic langages) for annotation, nothing special!
Author

@mrq I suggest to change
istead of:

phonemes = phonemizer( text, preserve_punctuation=True, strip=True )
if you could try this:

phonemes = phonemize(text,language=lang,strip=True,preserve_punctuation=True, with_stress=True, backend='espeak')

I'm using stress with vits tts and it worsk very well at least for my lang

Also it would be nice to have aditional param like backend so user can choose between espeak or gruut or other

@mrq I suggest to change istead of: `phonemes = phonemizer( text, preserve_punctuation=True, strip=True ) ` if you could try this: ` phonemes = phonemize(text,language=lang,strip=True,preserve_punctuation=True, with_stress=True, backend='espeak') ` I'm using stress with vits tts and it worsk very well at least for my lang Also it would be nice to have aditional param like backend so user can choose between espeak or gruut or other
Owner

phonemes = phonemize(text,language=lang,strip=True,preserve_punctuation=True, with_stress=True, backend='espeak')

Added.

Also it would be nice to have aditional param like backend so user can choose between espeak or gruut or other

I don't have it exposed in the web UI, but you can either pass in --phonemizer-backend= or set phonemizer-backend in the ./config/exec.json.


desu I should probably wait to try and train anything substantial until I get a better IPA token vocab, since it's starting to become a real pain to try and train anything substantial that uses it.

> phonemes = phonemize(text,language=lang,strip=True,preserve_punctuation=True, with_stress=True, backend='espeak') Added. > Also it would be nice to have aditional param like backend so user can choose between espeak or gruut or other I don't have it exposed in the web UI, but you can either pass in `--phonemizer-backend=` or set `phonemizer-backend` in the `./config/exec.json`. --- desu I should probably wait to try and train anything substantial until I get a better IPA token vocab, since it's starting to become a real pain to try and train anything substantial that uses it.

@nk990 Could you have a look at Wikipedia:Slovene_phonology and tell me how accurate it is? In particular the /ʋ/ section:

  • Before a vowel, the pronunciation is labiodental, [ʋ].[5]
  • Before or after a vowel, the pronunciation is bilabial [u̯] and forms a diphthong.[5][13][14]
  • At the beginning of a syllable, before a consonant (for example in vsi 'all'), the pronunciation varies more widely by speaker and area. Many speakers convert /ʋ/ into a full vowel [u] in this position.[5][13] For those speakers that retain a consonantal pronunciation, it pre-labializes the following consonant.[5][13][15] Thus, vsi may be pronounced as disyllabic [uˈsî] or monosyllabic [ˈʷsî].
  • In some dialects /ʋ/ turned into /v/ instead of [u̯]/[w]/[ᵂ] and devoices as a normal obstruent (see consonant changes), so vsi would in those dialects be pronounced [ˈfsî].[16]
@nk990 Could you have a look at [Wikipedia:Slovene_phonology](https://en.wikipedia.org/wiki/Slovene_phonology) and tell me how accurate it is? In particular the /ʋ/ section: > * Before a vowel, the pronunciation is labiodental, [ʋ].[5] > * Before or after a vowel, the pronunciation is bilabial [u̯] and forms a diphthong.[5][13][14] > * At the beginning of a syllable, before a consonant (for example in vsi 'all'), the pronunciation varies more widely by speaker and area. Many speakers convert /ʋ/ into a full vowel [u] in this position.[5][13] For those speakers that retain a consonantal pronunciation, it pre-labializes the following consonant.[5][13][15] Thus, vsi may be pronounced as disyllabic [uˈsî] or monosyllabic [ˈʷsî]. > * In some dialects /ʋ/ turned into /v/ instead of [u̯]/[w]/[ᵂ] and devoices as a normal obstruent (see consonant changes), so vsi would in those dialects be pronounced [ˈfsî].[16]
Author

Before a vowel, the pronunciation is labiodental, [ʋ].[5]
Before or after a vowel, the pronunciation is bilabial [u̯] and forms a diphthong.[5][13][14]
At the beginning of a syllable, before a consonant (for example in vsi 'all'), the pronunciation varies more widely by speaker and area. Many speakers convert /ʋ/ into a full vowel [u] in this position.[5][13] For those speakers that retain a consonantal pronunciation, it pre-labializes the following consonant.[5][13][15] Thus, vsi may be pronounced as disyllabic [uˈsî] or monosyllabic [ˈʷsî].
In some dialects /ʋ/ turned into /v/ instead of [u̯]/[w]/[ᵂ] and devoices as a normal obstruent (see consonant changes), so vsi would in those dialects be pronounced [ˈfsî].[16]

@psammites yes it's correct

> Before a vowel, the pronunciation is labiodental, [ʋ].[5] Before or after a vowel, the pronunciation is bilabial [u̯] and forms a diphthong.[5][13][14] At the beginning of a syllable, before a consonant (for example in vsi 'all'), the pronunciation varies more widely by speaker and area. Many speakers convert /ʋ/ into a full vowel [u] in this position.[5][13] For those speakers that retain a consonantal pronunciation, it pre-labializes the following consonant.[5][13][15] Thus, vsi may be pronounced as disyllabic [uˈsî] or monosyllabic [ˈʷsî]. In some dialects /ʋ/ turned into /v/ instead of [u̯]/[w]/[ᵂ] and devoices as a normal obstruent (see consonant changes), so vsi would in those dialects be pronounced [ˈfsî].[16] @psammites yes it's correct

@psammites yes it's correct

So before a vowel is it labiodental? Or is it bilabial and part of a dipthong?

phonemize and the SAMPA included in SOFES (converted to IPA) don't agree:

Spelling Phonemize SOFES
čevelj tʃeːʋɛʎ tʃevəl
evgen eːwɡən eʊgən
želiva ʒɛliːʋa ʒɛliva
žemva ʒeːmʋa ʒemva
dvajseti dʋajseːti dvaɪsti
miroslav miroːslaw mirɔslaʊ
petkovšek pɛtkoːwʃɛk pɛtkɔʊʃɛk
slivnik sliːwnik sliʊnik
uvrstiti uʋərstiːti uvərstiti
včasih utʃaːsih wtʃasix
vaših ʋaːʃih vaʃix
vrhovnik ʋərxoːwnik vərxɔʊnik
> @psammites yes it's correct So before a vowel is it labiodental? Or is it bilabial and part of a dipthong? `phonemize` and the SAMPA included in [SOFES](https://www.clarin.si/repository/xmlui/handle/11356/1125) (converted to IPA) don't agree: | Spelling | Phonemize | SOFES | | -------- | --------- | -------- | | čevelj | tʃeːʋɛʎ | tʃevəl | | evgen | eːwɡən | eʊgən | | želiva | ʒɛliːʋa | ʒɛliva | | žemva | ʒeːmʋa | ʒemva | | dvajseti | dʋajseːti | dvaɪsti | | miroslav | miroːslaw | mirɔslaʊ | | petkovšek | pɛtkoːwʃɛk | pɛtkɔʊʃɛk | | slivnik | sliːwnik | sliʊnik | | uvrstiti | uʋərstiːti | uvərstiti | | včasih | utʃaːsih | wtʃasix | | vaših | ʋaːʃih | vaʃix | | vrhovnik | ʋərxoːwnik | vərxɔʊnik |
Author

@psammites yes it's correct

So before a vowel is it labiodental? Or is it bilabial and part of a dipthong?

phonemize and the SAMPA included in SOFES (converted to IPA) don't agree...

@psammites The major issue of slavic language group is complexity, and am not such linguistic expert who can explain you.

So this is true:

Before a vowel, the pronunciation is labiodental, [ʋ].[5]
Before or after a vowel, the pronunciation is bilabial [u̯] and forms a diphthong.[5][13][14]

But also there are to much cases with rules deviation.

> @psammites yes it's correct > So before a vowel is it labiodental? Or is it bilabial and part of a dipthong? > > phonemize and the SAMPA included in SOFES (converted to IPA) don't agree... @psammites The major issue of slavic language group is complexity, and am not such linguistic expert who can explain you. So this is true: ``` Before a vowel, the pronunciation is labiodental, [ʋ].[5] Before or after a vowel, the pronunciation is bilabial [u̯] and forms a diphthong.[5][13][14] ``` But also there are to much cases with rules deviation.
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#133
No description provided.