Recommendations for creating tokenizer.json for a specific language #187

Closed
opened 2023-04-02 03:29:08 +00:00 by pheonis · 3 comments

Hello

I was training a model in Hindi language with a dataset of 1000 voice files and after 80 epoches, i compared the output of 15 epoch and 80 epoch model and found out there is not any difference in the quality..both the models failed to pronounce some hindi words correctly. Quality of the synthesized voices are almost the same.. So, im thinking of stopping the training because the quality of the model is not improving and not getting anywhere.

Now, I want to train the dataset with a custom tokenizer.json for Hindi language. I didnt find any documentation on how to create a custom tokenizer.json for a specific language.

Understaning some of the discussion on tokenizer in here is also difficult for non coders like me..

In this thread on github someone discussed, he generated tokenizer.json using an ebook that contains all the letters.

Any idea on this anybody?someone who is trying to create tokenizer for their language own language please share your thoughts

Hello I was training a model in Hindi language with a dataset of 1000 voice files and after 80 epoches, i compared the output of 15 epoch and 80 epoch model and found out there is not any difference in the quality..both the models failed to pronounce some hindi words correctly. Quality of the synthesized voices are almost the same.. So, im thinking of stopping the training because the quality of the model is not improving and not getting anywhere. Now, I want to train the dataset with a custom tokenizer.json for Hindi language. I didnt find any documentation on how to create a custom tokenizer.json for a specific language. Understaning some of the discussion on tokenizer in here is also difficult for non coders like me.. In this thread on [github](https://github.com/152334H/DL-Art-School/discussions/51) someone discussed, he generated tokenizer.json using an ebook that contains all the letters. Any idea on this anybody?someone who is trying to create tokenizer for their language own language please share your thoughts

Are you trying to create a tokenizer that handles devanagari or are you romanizing words is kadar?

Are you trying to create a tokenizer that handles devanagari or are you romanizing words *is kadar*?
Author

Are you trying to create a tokenizer that handles devanagari or are you romanizing words is kadar?

yes, i want the tokenizer to handle devanagari.

> Are you trying to create a tokenizer that handles devanagari or are you romanizing words *is kadar*? yes, i want the tokenizer to handle devanagari.

You can check out models/tokenizers/japanese.json for an example of how to do it, but because Japanese rules for syllable construction are far more limited you've got your work cut out for you if you want to handle all the edge cases like र्त्स्न्य.

You can check out `models/tokenizers/japanese.json` for an example of how to do it, but because Japanese rules for syllable construction are far more limited you've got your work cut out for you if you want to handle all the edge cases like र्त्स्न्य.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#187
No description provided.