Getting total gibberish when finetuning on a new language #221
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#221
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have a 30 hour dataset of spoken dutch, segmented into clips from 2-10 seconds. I generated a new tokenizer.json from a script that generates one from a lot of text in the desired language. Found this on the old finetuning repo
I am also only using the basic cleaners just like the japanese model.
However I am getting gibberish when training the model and I don't know why. Should I use the ipa tokenizer instead? Also am I missing something else here?
Thanks in advance
How long have you trained the model for?
4 epochs, I have added my train.yaml and tokenizer.json below as text files. Currently also transcribing a 300h dataset to see if that helps.
This is how I generated my tokenizer.json, certainly seemed to work for this guy. Also quite an interesting read on learning new languages.
Do you happen to know if the ipa tokenizer would be easier?
I don't think the IPA tokenizer would be required for Dutch. What's your loss graph look like?
OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs.
Anyways it's still really encouraging and I'm glad the custom tokenizer works.
For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times.
Can you tell how to use it, please?