Custom tokenizer #269

Open
opened 2023-06-17 10:41:50 +00:00 by protomato · 3 comments

Not an issue, but I've been trying to finetune Tortoise for a different language (russian) and I'm wondering if I should use a different tokenizer.json instead if the default one.
I'm an absolute newb at this whole thing.
I heard you can generate tokenizer file from text, but have no idea how to go about this.
Can someone please point me at the right direction?

Not an issue, but I've been trying to finetune Tortoise for a different language (russian) and I'm wondering if I should use a different tokenizer.json instead if the default one. I'm an absolute newb at this whole thing. I heard you can generate tokenizer file from text, but have no idea how to go about this. Can someone please point me at the right direction?

There's an example for Japanese in ai-voice-cloning/models/tokenizers/japanese.json and I think there are tokenizer tutorials up on huggingface.

There's an example for Japanese in `ai-voice-cloning/models/tokenizers/japanese.json` and I think there are tokenizer tutorials up on huggingface.
Owner

mmmmmmm

For Russian, I can't imagine you needing to use a new tokenizer definition for Russian. If you want to be extra sure, I believe it's Utilities > Tokenizer, and you can type in whatever sentence you want and see how it romanizes + tokenizes with the default one. If it doesn't mangle things too hard, it should be fine. The only qualm I would have is that it would try and do "merges" that might not be right in Russian. I don't think the merge issue is too big of a problem, since worse case it'll just get bruteforced out and "relearned" during finetuning.

I only needed one for Japanese because the romanization it would try and do was god awful and mucked up the pronunciation for a lot of kanji, so (at the time) I saw it was just better to provide my own definition and disable the romanizer (normalizer).

mmmmmmm For Russian, I can't imagine you *needing* to use a new tokenizer definition for Russian. If you want to be extra sure, I believe it's `Utilities > Tokenizer`, and you can type in whatever sentence you want and see how it romanizes + tokenizes with the default one. If it doesn't mangle things too hard, it should be fine. The only qualm I would have is that it would try and do "merges" that might not be right in Russian. I don't think the merge issue is *too* big of a problem, since worse case it'll just get bruteforced out and "relearned" during finetuning. I only needed one for Japanese because the romanization it would try and do was god awful and mucked up the pronunciation for a lot of kanji, so (at the time) I saw it was just better to provide my own definition and disable the romanizer (normalizer).
Author

There's an example for Japanese in ai-voice-cloning/models/tokenizers/japanese.json and I think there are tokenizer tutorials up on huggingface.

Thanks, I will research

mmmmmmm

Good to know! So far, it seems to train well regardless of which tokenizer I use, but there are some syllables it has more trouble with, and I was wondering if tokenizer can be an issue. But it's probably because these words are less common in datasets.
I've been doing it in a little weird way, since I have to use Colab I've been training the model in portions, feeding it a different new smaller dataset every day, using previously saved models as a base each time.
I've been wondering if it's effective, but it seems to work, improving slowly after each training session.

> There's an example for Japanese in `ai-voice-cloning/models/tokenizers/japanese.json` and I think there are tokenizer tutorials up on huggingface. Thanks, I will research > mmmmmmm Good to know! So far, it seems to train well regardless of which tokenizer I use, but there are some syllables it has more trouble with, and I was wondering if tokenizer can be an issue. But it's probably because these words are less common in datasets. I've been doing it in a little weird way, since I have to use Colab I've been training the model in portions, feeding it a different new smaller dataset every day, using previously saved models as a base each time. I've been wondering if it's effective, but it seems to work, improving slowly after each training session.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#269
No description provided.