Recommendations for creating tokenizer.json for a specific language #187
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#187
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hello
I was training a model in Hindi language with a dataset of 1000 voice files and after 80 epoches, i compared the output of 15 epoch and 80 epoch model and found out there is not any difference in the quality..both the models failed to pronounce some hindi words correctly. Quality of the synthesized voices are almost the same.. So, im thinking of stopping the training because the quality of the model is not improving and not getting anywhere.
Now, I want to train the dataset with a custom tokenizer.json for Hindi language. I didnt find any documentation on how to create a custom tokenizer.json for a specific language.
Understaning some of the discussion on tokenizer in here is also difficult for non coders like me..
In this thread on github someone discussed, he generated tokenizer.json using an ebook that contains all the letters.
Any idea on this anybody?someone who is trying to create tokenizer for their language own language please share your thoughts
Are you trying to create a tokenizer that handles devanagari or are you romanizing words is kadar?
yes, i want the tokenizer to handle devanagari.
You can check out
models/tokenizers/japanese.json
for an example of how to do it, but because Japanese rules for syllable construction are far more limited you've got your work cut out for you if you want to handle all the edge cases like र्त्स्न्य.