Large dataset finetuning #283

New Issue

tyb0v0 · 2023-06-26T01:55:37Z

tyb0v0 commented

2023-06-26 01:55:37 +00:00

For a large dataset(like 10k lines), do we need to separate them into multiple parts and train them? Since when I finetune them, there will always raise errors. Could anyone tell me about the process of it? By the way, I am finetuning a Chinese model, but the result was really bad with 200 samples(the result was not even Chinese), I am not sure how many datasets I need. Or, I did something wrong when finetuning another language.

errpr.PNG

114 KiB

psammites commented

2023-06-26 04:14:30 +00:00

Taking a wild guess that the problem is trying to process UTF encoded bopomofo using the default tokenizer. You might need to look at the example in models/tokenizers/japanese.json and write something similar for Chinese.

Taking a wild guess that the problem is trying to process UTF encoded bopomofo using the default tokenizer. You might need to look at the example in `models/tokenizers/japanese.json` and write something similar for Chinese.

helloitsme commented

2023-06-29 03:04:33 +00:00

Honestly, I believe large datasets are way overkill, as past some point, the data they provide is either unnecessary or introduces errors. All that the model is trying to do is understand various aspects of speech pronunciation and style, and the statistical relationships between those factors. In fact, what is bearing out into other kinds of models is data quality is better than data quantity, so it's better to identify the best, most consistent data, minimal to satisfying the natural variations of the speaker for the intended use case.

Sign in to join this conversation.