Japanese Tokenizer Issues #346

Open
opened 2023-08-25 08:59:05 +00:00 by Jarod · 1 comment
Contributor

I've been using this for a fair while now, but cannot for the life of me use the packaged Japanese tokenizer. I run into errors when trying to train with it and can provide the stack if anyone is interested.

Can anyone else verify that the Japanese tokenizer works for training?

I've been using this for a fair while now, but cannot for the life of me use the packaged Japanese tokenizer. I run into errors when trying to train with it and can provide the stack if anyone is interested. Can anyone else verify that the Japanese tokenizer works for training?
Author
Contributor

Update:

Re-numbered the key-value pairs starting from 0 to take away any duplicates, resulted in some type of CUDA issue. Didn't bother looking through or following its suggestion in the error and reduced the number of total keys in the vocab list of the tokenizer.

The reduction to around 220 key-value pairs has enabled it to train, at around 470, it was not able to (maybe I'll look at it later, not sure if OOM). Currently training right now with only the kana.

Tbh, I'm not sure exactly how the model will handle direct Japanese... which is why I'm wondering if that's why there was romaji in the original tokenizer... but wakkanaina. I guess I'll see after the model trains. At least I know where to build off from the errors at.

Update: Re-numbered the key-value pairs starting from 0 to take away any duplicates, resulted in some type of CUDA issue. Didn't bother looking through or following its suggestion in the error and reduced the number of total keys in the vocab list of the tokenizer. The reduction to around 220 key-value pairs has enabled it to train, at around 470, it was not able to (maybe I'll look at it later, not sure if OOM). Currently training right now with only the kana. Tbh, I'm not sure exactly how the model will handle direct Japanese... which is why I'm wondering if that's why there was romaji in the original tokenizer... but wakkanaina. I guess I'll see after the model trains. At least I know where to build off from the errors at.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#346
No description provided.