Large dataset finetuning #283
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#283
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
For a large dataset(like 10k lines), do we need to separate them into multiple parts and train them? Since when I finetune them, there will always raise errors. Could anyone tell me about the process of it? By the way, I am finetuning a Chinese model, but the result was really bad with 200 samples(the result was not even Chinese), I am not sure how many datasets I need. Or, I did something wrong when finetuning another language.
Taking a wild guess that the problem is trying to process UTF encoded bopomofo using the default tokenizer. You might need to look at the example in
models/tokenizers/japanese.json
and write something similar for Chinese.Honestly, I believe large datasets are way overkill, as past some point, the data they provide is either unnecessary or introduces errors. All that the model is trying to do is understand various aspects of speech pronunciation and style, and the statistical relationships between those factors. In fact, what is bearing out into other kinds of models is data quality is better than data quantity, so it's better to identify the best, most consistent data, minimal to satisfying the natural variations of the speaker for the intended use case.