Step by step data prep and training/finetuning guide? #244
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#244
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Is there a step by step guide to preparing datasets and training/finetuning? The wiki describes the functions of the various buttons in gradio with some caveats but it doesn't really give a hint beyond that.
Specifically what inputs are you supposed to use at each step and what the outputs are at each step. What are the steps you need to take? Like do you put raw wav files into the voices directory and hit Transcribe and Process or should they already be processed or arranged in some way?
Also in addition to learning how to prepare data from scratch and train I'd also like to know how to use a preprepared dataset. I have a dataset already formatted for training in Coqui that has wav files and a text file with matching entries. Can this be used directly with this software?
I might have a "guide" (list) somewhere in an Issues reply (although I don't think I did this, it's not something I can just give a simple guide on and expect it to cover all grounds), but the tabs are pretty much ordered in the way you go about preparing a dataset: Prepare > Generate > Run
Precisely. It's structured to already make use of a voice folder and doesn't require leaving the web UI itself (for training in TorToiSe, at the very least).
Yes-ish. If https://stt.readthedocs.io/en/latest/TRAINING_INTRO.html#training-data is right, then, if I remember right, you can leave the audio as-is (although I recommend NOT doing this, 16K is much too low compared to the 22K that gets used for training).
As for the text transcript, the web UI and training script (DLAS) expects the
./training/VOICENAME/train.txt
to be an LJSpeech-dataset formatted text file. For example:If that little link is right, you'll need to convert the CSV into that and drop the filesize entry.
The only caveat with using your own provided transcribed dataset is that each voice file in the dataset must be over 0.6 seconds and under 11.6s, as this is a hard limitation set in the training script itself (DLAS).
So when I attempt to train I get a cuda out of memory error
[Training] [2023-05-20T20:59:09.668073] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 2.69 GiB already allocated; 4.50 MiB free; 2.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I adjusted the batch size down to 4 as suggested when validating the training file and even down to 2 with the gradient accumulation size at 1 but it still gives the same error. I shrank down my dataset size to 5mb and it still gives the same error. I have a 3070 mobile with 8GB I've seen some others with identical cards as mine so it should work right?
How long are the clips in your data set?
I've taken a subset of clips that were previously segmented by the software to go through the training process again. They are on average 3s with some 4 and 5s clips. The total combined size is about 6mb. The settings were adjusted to what was recommended by the software with a batch size is 4 and the gradient accumulation size is 2 with bits and bytes enabled. I've also previously tried even lower batch size to no avail.
Can you run
ffprobe
on the clips and post the output?Hmm. The only thing that sticks out to me is that the audio is mono. I don't see any reason why that should be a problem but all my samples are stereo. Can you try and reproduce the fault with stereo audio?
I've found it best when running into cuda memory allocation errors to just restart everything. In fact, I run into that issue mostly when trying to do multiple tasks within the same session (ie train + generate, etc) so I just restart everything between big tasks.
The clips start out as stereo
but are converted to mono
The error just happens regardless.
The out of memory happens whether I restart or not. This is the memory usage with aivoicecloning started but before training. A large chunk of memory is already being used or set aside or what I'm not sure.
Enable
Do Not Load TTS On Startup
and restart.This worked...the program seems to set aside memory for certain tasks, which you might not want to do and not release it...this might also be the reason why voice clip generation often fails after the first instance.