Getting total gibberish when finetuning on a new language #221

New Issue

arrivederci · 2023-04-29T14:41:37Z

arrivederci commented

2023-04-29 14:41:37 +00:00

I have a 30 hour dataset of spoken dutch, segmented into clips from 2-10 seconds. I generated a new tokenizer.json from a script that generates one from a lot of text in the desired language. Found this on the old finetuning repo

I am also only using the basic cleaners just like the japanese model.

However I am getting gibberish when training the model and I don't know why. Should I use the ipa tokenizer instead? Also am I missing something else here?

Thanks in advance

I have a 30 hour dataset of spoken dutch, segmented into clips from 2-10 seconds. I generated a new tokenizer.json from a script that generates one from a lot of text in the desired language. Found this on the old finetuning [repo](https://github.com/152334H/DL-Art-School/discussions/51) I am also only using the basic cleaners just like the japanese model. However I am getting gibberish when training the model and I don't know why. Should I use the ipa tokenizer instead? Also am I missing something else here? Thanks in advance

psammites commented

2023-04-30 11:33:47 +00:00

How long have you trained the model for?

arrivederci commented

2023-04-30 12:10:48 +00:00

4 epochs, I have added my train.yaml and tokenizer.json below as text files. Currently also transcribing a 300h dataset to see if that helps.

This is how I generated my tokenizer.json, certainly seemed to work for this guy. Also quite an interesting read on learning new languages.

Do you happen to know if the ipa tokenizer would be easier?

4 epochs, I have added my train.yaml and tokenizer.json below as text files. Currently also transcribing a 300h dataset to see if that helps. [This](https://github.com/152334H/DL-Art-School/discussions/51) is how I generated my tokenizer.json, certainly seemed to work for this guy. Also quite an interesting read on learning new languages. Do you happen to know if the ipa tokenizer would be easier?

train.yml.txt

11 KiB

dutch_tokenizer.json.txt

7.2 KiB

arrivederci closed this issue

2023-04-30 12:13:55 +00:00

arrivederci reopened this issue

2023-04-30 12:14:17 +00:00

psammites commented

2023-05-01 01:06:26 +00:00

I don't think the IPA tokenizer would be required for Dutch. What's your loss graph look like?

arrivederci commented

2023-05-01 12:36:43 +00:00

OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs.

Anyways it's still really encouraging and I'm glad the custom tokenizer works.

For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times.

OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs. Anyways it's still really encouraging and I'm glad the custom tokenizer works. For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times.

tokenizer.py.zip

751 B

👍 2

arrivederci closed this issue

2023-05-02 22:48:49 +00:00

protomato commented

2023-05-21 10:15:46 +00:00

OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs.

Anyways it's still really encouraging and I'm glad the custom tokenizer works.

For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times.

Can you tell how to use it, please?

> OK nevermind it is actually producing pretty good output now, correctly pronouncing most of the words. I retrained on a 200 hour dataset of dutch audiobooks this night. The voice cloning doesn't really work, and there's the occasional english accent but that's probably because my learning rate was 1e-4 and it only trained for 2 epochs. > > Anyways it's still really encouraging and I'm glad the custom tokenizer works. > > For anyone interested I put the script for generating my dutch tokenizer below. Just give it some large text files in your target language, (I gave it 3 ebooks) and it will generate one for you. Maybe it would be good to put this in the wiki, since this has come up a couple of times. Can you tell how to use it, please?

Sign in to join this conversation.