Custom tokenizer #269
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#269
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Not an issue, but I've been trying to finetune Tortoise for a different language (russian) and I'm wondering if I should use a different tokenizer.json instead if the default one.
I'm an absolute newb at this whole thing.
I heard you can generate tokenizer file from text, but have no idea how to go about this.
Can someone please point me at the right direction?
There's an example for Japanese in
ai-voice-cloning/models/tokenizers/japanese.json
and I think there are tokenizer tutorials up on huggingface.mmmmmmm
For Russian, I can't imagine you needing to use a new tokenizer definition for Russian. If you want to be extra sure, I believe it's
Utilities > Tokenizer
, and you can type in whatever sentence you want and see how it romanizes + tokenizes with the default one. If it doesn't mangle things too hard, it should be fine. The only qualm I would have is that it would try and do "merges" that might not be right in Russian. I don't think the merge issue is too big of a problem, since worse case it'll just get bruteforced out and "relearned" during finetuning.I only needed one for Japanese because the romanization it would try and do was god awful and mucked up the pronunciation for a lot of kanji, so (at the time) I saw it was just better to provide my own definition and disable the romanizer (normalizer).
Thanks, I will research
Good to know! So far, it seems to train well regardless of which tokenizer I use, but there are some syllables it has more trouble with, and I was wondering if tokenizer can be an issue. But it's probably because these words are less common in datasets.
I've been doing it in a little weird way, since I have to use Colab I've been training the model in portions, feeding it a different new smaller dataset every day, using previously saved models as a base each time.
I've been wondering if it's effective, but it seems to work, improving slowly after each training session.