Japanese Tokenizer Issues #346
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#346
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I've been using this for a fair while now, but cannot for the life of me use the packaged Japanese tokenizer. I run into errors when trying to train with it and can provide the stack if anyone is interested.
Can anyone else verify that the Japanese tokenizer works for training?
Update:
Re-numbered the key-value pairs starting from 0 to take away any duplicates, resulted in some type of CUDA issue. Didn't bother looking through or following its suggestion in the error and reduced the number of total keys in the vocab list of the tokenizer.
The reduction to around 220 key-value pairs has enabled it to train, at around 470, it was not able to (maybe I'll look at it later, not sure if OOM). Currently training right now with only the kana.
Tbh, I'm not sure exactly how the model will handle direct Japanese... which is why I'm wondering if that's why there was romaji in the original tokenizer... but wakkanaina. I guess I'll see after the model trains. At least I know where to build off from the errors at.