Recommendations for creating tokenizer.json for a specific language #187

New Issue

pheonis · 2023-04-02T03:29:08Z

pheonis commented

2023-04-02 03:29:08 +00:00

Hello

I was training a model in Hindi language with a dataset of 1000 voice files and after 80 epoches, i compared the output of 15 epoch and 80 epoch model and found out there is not any difference in the quality..both the models failed to pronounce some hindi words correctly. Quality of the synthesized voices are almost the same.. So, im thinking of stopping the training because the quality of the model is not improving and not getting anywhere.

Now, I want to train the dataset with a custom tokenizer.json for Hindi language. I didnt find any documentation on how to create a custom tokenizer.json for a specific language.

Understaning some of the discussion on tokenizer in here is also difficult for non coders like me..

In this thread on github someone discussed, he generated tokenizer.json using an ebook that contains all the letters.

Any idea on this anybody?someone who is trying to create tokenizer for their language own language please share your thoughts

Hello I was training a model in Hindi language with a dataset of 1000 voice files and after 80 epoches, i compared the output of 15 epoch and 80 epoch model and found out there is not any difference in the quality..both the models failed to pronounce some hindi words correctly. Quality of the synthesized voices are almost the same.. So, im thinking of stopping the training because the quality of the model is not improving and not getting anywhere. Now, I want to train the dataset with a custom tokenizer.json for Hindi language. I didnt find any documentation on how to create a custom tokenizer.json for a specific language. Understaning some of the discussion on tokenizer in here is also difficult for non coders like me.. In this thread on [github](https://github.com/152334H/DL-Art-School/discussions/51) someone discussed, he generated tokenizer.json using an ebook that contains all the letters. Any idea on this anybody?someone who is trying to create tokenizer for their language own language please share your thoughts

psammites commented

2023-04-02 03:37:24 +00:00

Are you trying to create a tokenizer that handles devanagari or are you romanizing words is kadar?

Are you trying to create a tokenizer that handles devanagari or are you romanizing words *is kadar*?

pheonis commented

2023-04-02 04:25:21 +00:00

Are you trying to create a tokenizer that handles devanagari or are you romanizing words is kadar?

yes, i want the tokenizer to handle devanagari.

> Are you trying to create a tokenizer that handles devanagari or are you romanizing words *is kadar*? yes, i want the tokenizer to handle devanagari.

psammites commented

2023-04-02 04:48:36 +00:00

You can check out models/tokenizers/japanese.json for an example of how to do it, but because Japanese rules for syllable construction are far more limited you've got your work cut out for you if you want to handle all the edge cases like र्त्स्न्य.

You can check out `models/tokenizers/japanese.json` for an example of how to do it, but because Japanese rules for syllable construction are far more limited you've got your work cut out for you if you want to handle all the edge cases like र्त्स्न्य.

pheonis closed this issue

2023-04-04 02:53:13 +00:00

Sign in to join this conversation.