Mispronouncing certain letters for slavic languages #133
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#133
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I did finetune on Slovenian and it has decent quality it does missing emotions but I think it is normal cos of the small dataset. What does bother me is Mispronounciation of certain letters like: Č needs to be pronuncieted as tʃ not k or Ž(ʒ) not z.
Should I finetune also vcoder or maybe there is the text cleaner who is breaking things??
Did you happen to train with the default
Text LR Ratio
?For new languages, you'll want to increase the text LR ratio to 1, as you're effectively re-teaching the model a new language (or specifically, a new sequence of phonemes to expect).
There's a possibility the VQVAE needs adjustment too, but I'm not too sure if that's necessary as I've had decent luck with just increasing the text LR ratio for Japanese.
Nope, I used default config settings. Gonna try to increase text LR ratio, can you also share your config file that you've been using for japanese, just for comparison of the parameters.
Does this also apply to training a voice sample with a heavy accent, or should I leave it at the default as long as it's some form of English?
Nothing special, just max the LR sliders. The finer LR schedule will make it bake faster but crank the LR down to a good rate by epoch 9:
[2, 4, 9, 18, 25, 33, 50]
)The only issue, though, is that I've done some different batches over the past few days, and I haven't got to check my latest batch, something like:
Just crank the three sliders to max, and set your epochs / batch size / gradient accumulation size as needed.
Hold up, by "dataset size of ~8k" do you mean train.txt was ~8kb or ~8k clips?
mmm, if it's strictly accented English, you shouldn't need to up the text LR ratio. As backwards as it sounds, "teaching" how phonemes should sound is how training an accent should go about.
Now, if that dialect does show through in the text (example: oi guv), then yeah, you'll need to up the text LR, as it's effectively a different language (phoneme sequence).
Although this is just conjecture based on my current understanding of the AR model, which seems to change on a dime randomly every week.
8k lines, one clip per line, 8k clips. I don't have a total audio metric, but providing that metrics is meaningless desu since I believe each sequence is a fixed mel token length. Providing more or less audio cumulatively won't impact it as much.
I don't think a large dataset size is all that imperative for teaching it a new language/dialect, but I needed something large as my very first cursory test on one voice with a low dataset size had some missing features of Japanese, due to there not being anything for the model to "learn".
お兄ちゃん、大きすぎる~!
What settings are you using in "Prepare Dataset" so that you don't have to check and fix each clip manually? How big is your validation.txt?
Trim silence, text cull length 4, audio cull length 1 second. Nothing major, just leveraging the implicit slicing for lines took long (and the segments are decent at least).
For the 8k, not sure, as I don't have that saved, since it was pre-implicit slicing of large lines.
The 15k one clocks at 898 lines for validation.
Just to be explicit: "text cull length" is Validation Text Length Threshold, and "audio cull length" is Validation Audio Length Threshold, correct? Nothing for the offsets?
Oh, actually, you could switch out the tokenizer for niche languages. I was doing a glance-over at how tortoise tokenizes text (since I was theorizing if providing your own phonemes for training/inference would be beneficial at all), and it's just a bog standard HF tokenizer with a config.
I'm not too sure of the implications though. They're not true phonemes, rather virtual-ones (in the loosest sense). In theory, if you were able to magically source a better tuned tokenizer.json for a language, you just need to provide it during training and inferencing, but you'd have to make extra sure you increase the text LR ratio up to 1, as you're completely re-writing what tokens mean what, rather than "teaching" it a new token sequence (language).
That said, I'm not too sure how necessary this is, as I got decent results from the standard tokens with Japanese. It could very well be acceptable to go about it from the backwards direction of rewriting what each "phoneme" (text token) sounds like (map to which mel tokens).
I might just have the training configuration template point to tortoise's tokenizer.json (since they're the same, I didn't actually need to provide it in this repo), so a user can replace that one file under
./modules/tortoise-tts/data/tokenizers.json
if they want, rather than have a niche option in the training configuration generator.As for a tool to make one, that's unfortunately left as an exercise to the end user.
Although, it might be fairly simple to correct it by hand; replace a similar sound in English with the characters of what it corresponds to, and maybe add some sound merges in the merge array.
Correct. I can never remember what I actually call the settings.
Not necessary, unless you're consistently getting audio trimmed either too much or not enough on either ends, and you can blanket correct it by providing slice time offsets.
There are a number of IPA tokenizers available but the fact that none of the TTS solutions I'm familiar with use them makes me think they either wouldn't provide as big a quality improvement as I think they would or that there's some kind of performance hit involved with the additional codespace. Or maybe I'm the only weirdo who wants to do generation with that level of granularity~
I caved and added a way to override the tokenizer JSON under
Settings
, because I realized it actually does affect Japanese (at least, from seeing it merge the "phonemes"). The overrided tokenizer JSON also gets used in place for the training configuration.I don't think it's necessary for me to retrain (again, for the unknownth time) my Japanese finetune, as I probably bruteforced my way through token merging.
Both of the VALL-E implementations use them, at least, the shitty bloated one:
bʌt_stænli_dɪdnt_wɔnt_tə_ɡoʊ_bæk_tə_ðɪ_ɑːfɪs.
The clean one doesn't literally use IPA phonemes, but rather what seems to be an ASCII representation:
AY1 _ AE1 M _ K AY1 N D _ AH1 V _ L AO1 S T _ AY1 _ AE1 M _ L UH1 K IH0 NG _ F AO1 R _ S AY1 L AH0 N T _ HH IH1 L
So just given that alone, I'm sure VALL-E will give handle those sorts of things much better than TorToiSe.
Only 15 epochs? Is this a typo? I've been doing 200-1500 for most of my training, and that's just for english voices lol
I know with 8k clips that's probably a lot of sets per epoch, it just seemed like a very low number! I thought the training needed to iterate the same files hundreds of times in order to learn anything
desu the first finetune test has a much smaller size (dataset size of 4.5k for 11 epochs). Granted, all of the hyperparameters play a role in how decent something is trained (batch size and gradient accumulation factor quite a bit, even single vs multi-GPU seems to affect things too), so there's no good simple answer on it outside of how nice of a loss curve you get.
It's conjecture, but in my experience I feel language benefits more from a varied dataset rather than many iterations, rather finetuning for speech (a voice) is more about iterations over a varied dataset (although a varied dataset does help, I've got decent results even with a small dataset from just bruteforcing it over a very low loss rate).
Ahh interesting. Yeh I've toyed with learning rates even lower than the 0.00001 default with very small datasets.
Still gradually figuring out what a good learning curve might look like!
At last, I can train it to speak Ubykh (or at least pronounce გვფრცქვნი)!
That's ARPABET, it only covers phonemes of American English.
@psammites Can you also share you experience with us? Like how many clips do you have in your dataset, how many epochs you set for that dataset and did you change other params, did you ovveride tokenizer?
@mrq I set params like you suggested:
LR: 0.0001
Mel LR ratio: 1.0
Text LR ratio: 1.0
MultiStepLR, default schedule ([2, 4, 9, 18, 25, 33, 50])
my dataset has 383 clips about 47 min of audio and even with that small dataset the quality is very good.
Last night i tried to train 10 epoch then 20,50,100 the quality is always good no big differences,but the mispronunciation for č,ž still persists...
I'll try to override tokenizer and run traing again, I'll let you know when is done!
I've been using comparably tiny datasets (16, 32, 48 clips) because of the time involved in reviewing the .wav's and proofreading the transcriptions (no amount of mucking with the split offsets has produced acceptable automated results). So far I've limited my testing to English with strong regional and ESL accents:
North Korean ESL Speaker: 16 max length lines in batches of 16. Was crap at 2500 epochs, is now pretty good after 5000 epochs. Validation disabled.
Northern Indian Native(?) Speaker: This was the first model I trained on an earlier version before segmentation or validation were implemented. Fairly good results after 500 iterations with ~48 lines (I don't remember exactly, I've deleted and reinstalled several times since then and kept the model but not the logs). I spent hours with ffmpeg filters trying to lose the traffic honking in the background and I don't think it made a difference.
Southern US (Alabama) Native Speaker: I think 32 clips for this one, 500 epochs? Good results, I hardly had to touch the transcription for this one. The second model I trained, also before splitting and validation were added.
Gulf Arabic ESL Speaker: Almost done baking this one, 32 lines in batches of 16, 500 epochs, 32 lines validation. Just hit epoch 450 (iteration 900) after about 50 minutes of training.
Vietnamese ESL(?) Speaker: Terrible results no matter what I did, dataset size, etc. The only thing I didn't try was ditching validation. On the backburner for now because I suspect the default tokenizer isn't equipped to handle tonal languages or the resulting accents.
Northern Chinese ESL Speaker: Another one with horrible results no matter what I did. See tonal accent theory above.
(I've been using YouTube videos as training data because I can pull the subtitles down as json with yt-dlp and use jq to mung them into segment lists for chopping with ffmpeg. They're milisecond-granular, so superior to what whisper produces. The transcription quality is sometimes better too.)
I left all the learning rates at the defaults and used MultiStepLR. I started disabling validation because I got better results before it was introduced and with it my models sounded worse and worse the more I trained them. (Also because I don't have the time to manually validate a second dataset.) No apparent quality difference between models trained on a single GPU and those trained on multiple GPU's.
My first non-English test is likely to be Malaysian because the phonemic inventory is a subset of the English one. If the results are acceptable I'll move on to standard Indonesian (baby steps!) and see if I can get tortise-tts to roll its R's. Provided that all works I'll do some actual tokenizer modification and add ñ to see if we can get a distinct /ɲ/ when training on Spanish.
For your Slovenian model, have you tried constructing a stripped down dataset with a high number of č/ž minimal pairs?
You might also try transcribing them as ch and zh, not that I can articulate a reason why other than "Unicode is sometimes weird".I was on the right track but the wrong train. See edit below.Edit: Gulf Arabic ESL accent model is accurate but a little metallic sounding on faster presets. Going to throw it back in the oven for another 500 and then bake an Egyptian Arabic ESL accent model with the same settings for comparison.
Post-nap edit: Replace [č] with [q], [ž] with [x], crank the Text LR Ratio to the max and let her rip. Should work without replacing the tokenizer or diffusion model.
Post-coffee edit: If you still need to replace the tokenizer you could yamlize this and swap it into bpe_lowercase_asr_256.json's vocab section, but you'd still have to find or train a matching diffusion model. (I didn't have any luck turning one up but then again I don't speak Slovenian.)
Mispronouncing certain letters for Slovenian languageto Mispronouncing certain letters for slavic languagesthis can be a workaround for Slovenian, but if you have Croatian when you have:
and some other particular cases... how to make a workaround then?
That's about pushing the limits of what you can do without replacing the tokenizer:
YMMV no warranty expressed or implied fasten your seatbelt cross your fingers wear a helmet and click "Train".
Edit: @mrq mentioned VALL-E above. There's an academic paper out on a multi-lingual version called "VALL-E X" but AFAIK they haven't released the code yet.
Validation has no effect on training quality (it sometimes will eat up your total iteration count and cause training to terminate early, but that's more of a bug).
VALL-E X is more for a model that outputs multi-lingual speech. VALL-E already (should) support training non-English models.
Even if your validation.txt contains bad transcriptions?
This doesn't look like anywhere near enough coverage to do true multi-lingual speech. Unless they're getting it somewhere else?
Correct. It's just a way to get a defacto metric to see how well a model is when handling outside data. Nothing from the validation pass gets used for training.
That's generated from the LibriTTS dataset specifically. Any other dataset will have a different list of unique tokens.
As I understand it if I provide my own dataset then I have to provide my own inference model trained on it, which would remove the impetus to migrate to VALL-E. (I'm less than impressed by their demo page: they've cherrypicked so that "Speaker Prompt" and "Ground Truth" are always from the same accent and sex, which makes me think cross-sex and cross-dialect output are terrible. If they could do something more challenging like making a banana-bender sound like a newfie I think they'd be promoting it.)
No shit. You already need to provide your own model anyways, as there's no publicly released one, hence the statement on muh ethics at the bottom of the demo page.
desu TorToiSe's demo page was also very lackluster; they're moreso demonstrations with zero-shot inferencing (and even then, a little love and elbow grease makes TorToiSe's zero-shot inferencing with the base model actually decent without finetunes).
Anyways, VALL-E specifically is outside of the scope of discussion despite my not-so-hidden commits about integrating it. I shouldn't be discussing it until I get something trained, and I don't have the time to babysit training during the week.
Anyways. I mentioned the VALL-E implementations using phonemes instead of the rather-archaic tokenization of common English elements (not to TorToiSe's fault specifically, GPT transformers are notorious for tokenizing like that), I could just modify the tokenization process entirely:
Although, you might need to supply per-language merging of phonemes, and I don't know specifically how well a model will perform without merged tokens (I did a quick test, but I didn't spend much time on testing it).
I'm sure I'll catch a wild hair and implement it, as it's something that sounds solid in principle. I just don't know when I'll get to it.
Yea, so if I'm going to have to train a model I might as well train one for the software that I already know can do what I want; Fuck training for a whole week just to make a middle-aged British dude sound like a slightly different middle-aged British dude inside a tin can. Plus they're built against an older version of CUDA Toolkit and I burnt way more than enough time wrangling with CUDA incompatibilities doing the WSL setup.
You know what would make it even better? If tts.get_conditioning_latents() could use multiple GPU's...
pip install git+https://github.com/bootphon/phonemizer
$env:PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'
(or wherever)A nightmare to think about implementing, and beyond the scope of the matter.
I both used
set
and manually declared the env var in whatever Windows sand it didn't like that. I don't remember what got it to work, but it was unintuitive enough I'm not going to leverage it.There's allosaurus that I'll just leverage:
My only issue, though, is that I would still need a text phonemizer for inferencing, defeating the whole entire purpose of using an audio-based one.
If you did it via the EditEnvironmentVariables control panel you need to close your PowerShell session and open a new one for the change to apply, but where I think people get tripped up is the eSpeak NG documentation gives one the impression that the thing to do is set
ESPEAK_DATA_PATH
to "C:\Program Files\eSpeak NG\espeak-ng-data" but that's for running espeak-ng.exe itself, which isn't what we want. Phonemize just wants a path to the .dll that it came in the package.Edit: To be clear to anyone else who might be having trouble installing on Windows you need to put the full path including file name of the dll in
PHONEMIZER_ESPEAK_LIBRARY
. IgnoreESPEAK_DATA_PATH
, Phonemizer doesn't care about it.Another issue is that tortoise wants floating point sampled wav files but allosaurus chokes unless they're integer sampled:
I did everything under the sun until I copied the DLL to the current working directly and it Just Worked™.
Anyways, the entire experience soured me enough that I'll refuse to leverage eSpeak.
Already resolved with stuff like
torchaudio.save(f"{indir}/audio/{basename}", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)
, which I've been intending to do anyways to reduce filesize for training (precision doesn't matter when they're stuck at 22.5KHz).Nevermind, Allosaurus isn't accurate enough. Phonemizer with eSpeak magically works too, so I don't know what broke.
Japanese with Phonemizer on Windows is impossible, and even then I'd have to parse it as romaji on Linux, so out with Phonemizer + eSpeak.
I did a quick train on James Sunderland again, but with the training dataset text as IPA phonemes; it works, to say the least:
ɹ
andə
as unknown tokens, and it's literally no difference from just using the default tokenizer vocab without merges.I'm starting to feel it's a bit of a silly endeavor to go on.
You're likely running into two problems, the first is, well...

And the second is this, thankfully, a one-line fix.
And if you don't want to mess with patching Phonemizer (or using another version)then eSpeak can do it with some coaxing:
Allegedly the UTF8 handling has been fixed in the new cross-platform version of PowerShell but I haven't tried yet.
With the default tokenizer, yea. It just strips out all the accents. That's part of the reason @nk990 is running into issues, to it there's no difference between [č] and [c] (or [ć], or [ĉ], or [ç], &c...)
@psammites infact it was very strange thought...I've tried almost anything but no success...
That's because DLAS/tortoise's VoiceBpeTokenizer will preprocess all text:
Ironically, it has other levels of cleaning, but always defaults to the one defined for English. You can (now) disable this by setting
pre_tokenizer
tonull
orfalse
in the vocab JSON. I had to do it since it converted all IPAs into ASCII and caused headaches.Anywho, unless I'm neglecting something crucial, it's a bust. 500 epochs both with my aggressive-but-proven LR scheduler and a conservative initial LR to slowly train it both produce nothing intelligible (the voice is still captured at least). In hindsight, practically re-training what each text token represents isn't as easy as simply just remapping which text token sounds like.
I have some ideas on remedies/solutions:
Although, redefining the token vocab might mean having to retrain an entire new model.
I might take a crack at this weekend, weather permitting. Now that you've added the option to select tokenizer and diffusion model it's a possibility.
Tried again, added some missing IPAs and merges; sounds like James Sunderland if he was Fr*nch: https://vocaroo.com/1c9ylHLoNrwU ("You're not Mary." =>
jʊɹ nɑːt mɛɹi.
). Seeing the mapping even be wrong makes me feel like I'm neglecting something. I could just very well need to train it on a larger English dataset (although if that was the case, I should at least be able to replicate my input dataset).Hope you can get something usable out of it. If it doesn't work, oh well, I needed an IPA phonemizer anyways for VALL-E.
That's with the ipa.json from
1a8c5de517
and the existing diffusion model? I wonder what http://ipa-reader.xyz/ is using.Edit: It's passing SSML to Amazon's Polly TTS.
Correct.
I tried again, this time with aggressive settings (LR:
0.0001
, schedule:[2,4,9]
, and not relevant: 44 lines, bs=44, ga=1), just to try and get it to overfit (and it most definitely has overfitted), and it worked-ish: https://vocaroo.com/1ceGyFYPxzjA (ignoring the weird quirk at the end).Nice to see it is feasible then instead of being a lost cause, and nothing else really needs immediate replacement.
@nk990 Do you have an IPA-annotated Slovnenian dataset? I added the missing symbols to models/tokenizers/ipa.json but SOFES is transcribed in SAMPA and the UCLA Phonetics Lab Slovenian corpus is insufficient (and potato quality recordings).
@psammites I'm using mozilla common voice dataset and espeak (I think is the most accurate for slavic langages) for annotation, nothing special!
@mrq I suggest to change
istead of:
phonemes = phonemizer( text, preserve_punctuation=True, strip=True )
if you could try this:
phonemes = phonemize(text,language=lang,strip=True,preserve_punctuation=True, with_stress=True, backend='espeak')
I'm using stress with vits tts and it worsk very well at least for my lang
Also it would be nice to have aditional param like backend so user can choose between espeak or gruut or other
Added.
I don't have it exposed in the web UI, but you can either pass in
--phonemizer-backend=
or setphonemizer-backend
in the./config/exec.json
.desu I should probably wait to try and train anything substantial until I get a better IPA token vocab, since it's starting to become a real pain to try and train anything substantial that uses it.
@nk990 Could you have a look at Wikipedia:Slovene_phonology and tell me how accurate it is? In particular the /ʋ/ section:
@psammites yes it's correct
So before a vowel is it labiodental? Or is it bilabial and part of a dipthong?
phonemize
and the SAMPA included in SOFES (converted to IPA) don't agree:@psammites The major issue of slavic language group is complexity, and am not such linguistic expert who can explain you.
So this is true:
But also there are to much cases with rules deviation.