Japanese Tokenizer Issues #346

New Issue

Jarod · 2023-08-25T08:59:05Z

Jarod commented

2023-08-25 08:59:05 +00:00

I've been using this for a fair while now, but cannot for the life of me use the packaged Japanese tokenizer. I run into errors when trying to train with it and can provide the stack if anyone is interested.

Can anyone else verify that the Japanese tokenizer works for training?

I've been using this for a fair while now, but cannot for the life of me use the packaged Japanese tokenizer. I run into errors when trying to train with it and can provide the stack if anyone is interested. Can anyone else verify that the Japanese tokenizer works for training?

Jarod commented

2023-08-25 09:57:49 +00:00

Update:

Re-numbered the key-value pairs starting from 0 to take away any duplicates, resulted in some type of CUDA issue. Didn't bother looking through or following its suggestion in the error and reduced the number of total keys in the vocab list of the tokenizer.

The reduction to around 220 key-value pairs has enabled it to train, at around 470, it was not able to (maybe I'll look at it later, not sure if OOM). Currently training right now with only the kana.

Tbh, I'm not sure exactly how the model will handle direct Japanese... which is why I'm wondering if that's why there was romaji in the original tokenizer... but wakkanaina. I guess I'll see after the model trains. At least I know where to build off from the errors at.

Update: Re-numbered the key-value pairs starting from 0 to take away any duplicates, resulted in some type of CUDA issue. Didn't bother looking through or following its suggestion in the error and reduced the number of total keys in the vocab list of the tokenizer. The reduction to around 220 key-value pairs has enabled it to train, at around 470, it was not able to (maybe I'll look at it later, not sure if OOM). Currently training right now with only the kana. Tbh, I'm not sure exactly how the model will handle direct Japanese... which is why I'm wondering if that's why there was romaji in the original tokenizer... but wakkanaina. I guess I'll see after the model trains. At least I know where to build off from the errors at.

Sign in to join this conversation.