Getting total gibberish when finetuning on a new language #221

Closed
opened 2023-04-29 14:41:37 +00:00 by arrivederci · 5 comments

I have a 30 hour dataset of spoken dutch, segmented into clips from 2-10 seconds. I generated a new tokenizer.json from a script that generates one from a lot of text in the desired language. Found this on the old finetuning repo

I am also only using the basic cleaners just like the japanese model.

However I am getting gibberish when training the model and I don't know why. Should I use the ipa tokenizer instead? Also am I missing something else here?

Thanks in advance

I have a 30 hour dataset of spoken dutch, segmented into clips from 2-10 seconds. I generated a new tokenizer.json from a script that generates one from a lot of text in the desired language. Found this on the old finetuning [repo](https://github.com/152334H/DL-Art-School/discussions/51) I am also only using the basic cleaners just like the japanese model. However I am getting gibberish when training the model and I don't know why. Should I use the ipa tokenizer instead? Also am I missing something else here? Thanks in advance

How long have you trained the model for?

How long have you trained the model for?
Author

4 epochs, I have added my train.yaml and tokenizer.json below as text files. Currently also transcribing a 300h dataset to see if that helps.

This is how I generated my tokenizer.json, certainly seemed to work for this guy. Also quite an interesting read on learning new languages.

Do you happen to know if the ipa tokenizer would be easier?

4 epochs, I have added my train.yaml and tokenizer.json below as text files. Currently also transcribing a 300h dataset to see if that helps. [This](https://github.com/152334H/DL-Art-School/discussions/51) is how I generated my tokenizer.json, certainly seemed to work for this guy. Also quite an interesting read on learning new languages. Do you happen to know if the ipa tokenizer would be easier?

I don't think the IPA tokenizer would be required for Dutch. What's your loss graph look like?

I don't think the IPA tokenizer would be required for Dutch. What's your loss graph look like?
Author

OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs.

Anyways it's still really encouraging and I'm glad the custom tokenizer works.

For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times.

OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs. Anyways it's still really encouraging and I'm glad the custom tokenizer works. For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times.

OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs.

Anyways it's still really encouraging and I'm glad the custom tokenizer works.

For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times.

Can you tell how to use it, please?

> OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs. > > Anyways it's still really encouraging and I'm glad the custom tokenizer works. > > For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times. Can you tell how to use it, please?
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#221
No description provided.