Large dataset finetuning #283

Open
opened 2023-06-26 01:55:37 +00:00 by tyb0v0 · 2 comments

For a large dataset(like 10k lines), do we need to separate them into multiple parts and train them? Since when I finetune them, there will always raise errors. Could anyone tell me about the process of it? By the way, I am finetuning a Chinese model, but the result was really bad with 200 samples(the result was not even Chinese), I am not sure how many datasets I need. Or, I did something wrong when finetuning another language.

For a large dataset(like 10k lines), do we need to separate them into multiple parts and train them? Since when I finetune them, there will always raise errors. Could anyone tell me about the process of it? By the way, I am finetuning a Chinese model, but the result was really bad with 200 samples(the result was not even Chinese), I am not sure how many datasets I need. Or, I did something wrong when finetuning another language.

Taking a wild guess that the problem is trying to process UTF encoded bopomofo using the default tokenizer. You might need to look at the example in models/tokenizers/japanese.json and write something similar for Chinese.

Taking a wild guess that the problem is trying to process UTF encoded bopomofo using the default tokenizer. You might need to look at the example in `models/tokenizers/japanese.json` and write something similar for Chinese.

Honestly, I believe large datasets are way overkill, as past some point, the data they provide is either unnecessary or introduces errors. All that the model is trying to do is understand various aspects of speech pronunciation and style, and the statistical relationships between those factors. In fact, what is bearing out into other kinds of models is data quality is better than data quantity, so it's better to identify the best, most consistent data, minimal to satisfying the natural variations of the speaker for the intended use case.

Honestly, I believe large datasets are way overkill, as past some point, the data they provide is either unnecessary or introduces errors. All that the model is trying to do is understand various aspects of speech pronunciation and style, and the statistical relationships between those factors. In fact, what is bearing out into other kinds of models is data quality is better than data quantity, so it's better to identify the best, most consistent data, minimal to satisfying the natural variations of the speaker for the intended use case.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#283
No description provided.