Step by step data prep and training/finetuning guide? #244

New Issue

Fresh12 · 2023-05-20T09:27:56Z

Fresh12 commented

2023-05-20 09:27:56 +00:00

Is there a step by step guide to preparing datasets and training/finetuning? The wiki describes the functions of the various buttons in gradio with some caveats but it doesn't really give a hint beyond that.

Specifically what inputs are you supposed to use at each step and what the outputs are at each step. What are the steps you need to take? Like do you put raw wav files into the voices directory and hit Transcribe and Process or should they already be processed or arranged in some way?

Also in addition to learning how to prepare data from scratch and train I'd also like to know how to use a preprepared dataset. I have a dataset already formatted for training in Coqui that has wav files and a text file with matching entries. Can this be used directly with this software?

Is there a step by step guide to preparing datasets and training/finetuning? The wiki describes the functions of the various buttons in gradio with some caveats but it doesn't really give a hint beyond that. Specifically what inputs are you supposed to use at each step and what the outputs are at each step. What are the steps you need to take? Like do you put raw wav files into the voices directory and hit Transcribe and Process or should they already be processed or arranged in some way? Also in addition to learning how to prepare data from scratch and train I'd also like to know how to use a preprepared dataset. I have a dataset already formatted for training in Coqui that has wav files and a text file with matching entries. Can this be used directly with this software?

mrq commented

2023-05-20 23:55:30 +00:00

I might have a "guide" (list) somewhere in an Issues reply (although I don't think I did this, it's not something I can just give a simple guide on and expect it to cover all grounds), but the tabs are pretty much ordered in the way you go about preparing a dataset: Prepare > Generate > Run

in Prepare, select a voice, hit Transcribe and Process, wait
in Generate, set your hyperparameters, then hit Validate, then Save
in Run, select your config, hit Train

Like do you put raw wav files into the voices directory and hit Transcribe and Process

Precisely. It's structured to already make use of a voice folder and doesn't require leaving the web UI itself (for training in TorToiSe, at the very least).

I have a dataset already formatted for training in Coqui that has wav files and a text file with matching entries.

Yes-ish. If https://stt.readthedocs.io/en/latest/TRAINING_INTRO.html#training-data is right, then, if I remember right, you can leave the audio as-is (although I recommend NOT doing this, 16K is much too low compared to the 22K that gets used for training).

As for the text transcript, the web UI and training script (DLAS) expects the ./training/VOICENAME/train.txt to be an LJSpeech-dataset formatted text file. For example:

audio/1.wav|Transcription for first line
audio/2.wav|Transcription for second line

If that little link is right, you'll need to convert the CSV into that and drop the filesize entry.

The only caveat with using your own provided transcribed dataset is that each voice file in the dataset must be over 0.6 seconds and under 11.6s, as this is a hard limitation set in the training script itself (DLAS).

I might have a "guide" (list) somewhere in an Issues reply (although I don't think I did this, it's not something I can just give a simple guide on and expect it to cover all grounds), but the tabs are pretty much ordered in the way you go about preparing a dataset: Prepare > Generate > Run * in Prepare, select a voice, hit Transcribe and Process, wait * in Generate, set your hyperparameters, then hit Validate, then Save * in Run, select your config, hit Train > Like do you put raw wav files into the voices directory and hit Transcribe and Process Precisely. It's structured to already make use of a voice folder and doesn't require leaving the web UI itself (for training in TorToiSe, at the very least). > I have a dataset already formatted for training in Coqui that has wav files and a text file with matching entries. Yes-ish. If https://stt.readthedocs.io/en/latest/TRAINING_INTRO.html#training-data is right, then, if I remember right, you can leave the audio as-is (although I recommend NOT doing this, 16K is much too low compared to the 22K that gets used for training). As for the text transcript, the web UI and training script (DLAS) expects the `./training/VOICENAME/train.txt` to be an LJSpeech-dataset formatted text file. For example: ``` audio/1.wav|Transcription for first line audio/2.wav|Transcription for second line ``` If that little link is right, you'll need to convert the CSV into that and drop the filesize entry. The only caveat with using your own provided transcribed dataset is that each voice file in the dataset must be over 0.6 seconds and under 11.6s, as this is a hard limitation set in the training script itself (DLAS).

Fresh12 commented

2023-05-21 04:14:52 +00:00

I might have a "guide" (list) somewhere in an Issues reply (although I don't think I did this, it's not something I can just give a simple guide on and expect it to cover all grounds), but the tabs are pretty much ordered in the way you go about preparing a dataset: Prepare > Generate > Run

in Prepare, select a voice, hit Transcribe and Process, wait

in Generate, set your hyperparameters, then hit Validate, then Save

in Run, select your config, hit Train

Like do you put raw wav files into the voices directory and hit Transcribe and Process

Precisely. It's structured to already make use of a voice folder and doesn't require leaving the web UI itself (for training in TorToiSe, at the very least).

I have a dataset already formatted for training in Coqui that has wav files and a text file with matching entries.

Yes-ish. If https://stt.readthedocs.io/en/latest/TRAINING_INTRO.html#training-data is right, then, if I remember right, you can leave the audio as-is (although I recommend NOT doing this, 16K is much too low compared to the 22K that gets used for training).

As for the text transcript, the web UI and training script (DLAS) expects the ./training/VOICENAME/train.txt to be an LJSpeech-dataset formatted text file. For example:
audio/1.wav|Transcription for first line
audio/2.wav|Transcription for second line
If that little link is right, you'll need to convert the CSV into that and drop the filesize entry.

The only caveat with using your own provided transcribed dataset is that each voice file in the dataset must be over 0.6 seconds and under 11.6s, as this is a hard limitation set in the training script itself (DLAS).

So when I attempt to train I get a cuda out of memory error

[Training] [2023-05-20T20:59:09.668073] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 2.69 GiB already allocated; 4.50 MiB free; 2.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I adjusted the batch size down to 4 as suggested when validating the training file and even down to 2 with the gradient accumulation size at 1 but it still gives the same error. I shrank down my dataset size to 5mb and it still gives the same error. I have a 3070 mobile with 8GB I've seen some others with identical cards as mine so it should work right?

> I might have a "guide" (list) somewhere in an Issues reply (although I don't think I did this, it's not something I can just give a simple guide on and expect it to cover all grounds), but the tabs are pretty much ordered in the way you go about preparing a dataset: Prepare > Generate > Run > * in Prepare, select a voice, hit Transcribe and Process, wait > * in Generate, set your hyperparameters, then hit Validate, then Save > * in Run, select your config, hit Train > > > Like do you put raw wav files into the voices directory and hit Transcribe and Process > > Precisely. It's structured to already make use of a voice folder and doesn't require leaving the web UI itself (for training in TorToiSe, at the very least). > > > I have a dataset already formatted for training in Coqui that has wav files and a text file with matching entries. > > Yes-ish. If https://stt.readthedocs.io/en/latest/TRAINING_INTRO.html#training-data is right, then, if I remember right, you can leave the audio as-is (although I recommend NOT doing this, 16K is much too low compared to the 22K that gets used for training). > > As for the text transcript, the web UI and training script (DLAS) expects the `./training/VOICENAME/train.txt` to be an LJSpeech-dataset formatted text file. For example: > ``` > audio/1.wav|Transcription for first line > audio/2.wav|Transcription for second line > ``` > > If that little link is right, you'll need to convert the CSV into that and drop the filesize entry. > > The only caveat with using your own provided transcribed dataset is that each voice file in the dataset must be over 0.6 seconds and under 11.6s, as this is a hard limitation set in the training script itself (DLAS). > > So when I attempt to train I get a cuda out of memory error `[Training] [2023-05-20T20:59:09.668073] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 2.69 GiB already allocated; 4.50 MiB free; 2.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF` I adjusted the batch size down to 4 as suggested when validating the training file and even down to 2 with the gradient accumulation size at 1 but it still gives the same error. I shrank down my dataset size to 5mb and it still gives the same error. I have a 3070 mobile with 8GB I've seen some others with identical cards as mine so it should work right?

psammites commented

2023-05-23 16:59:27 +00:00

I adjusted the batch size down to 4 as suggested when validating the training file and even down to 2 with the gradient accumulation size at 1 but it still gives the same error. I shrank down my dataset size to 5mb and it still gives the same error. I have a 3070 mobile with 8GB I've seen some others with identical cards as mine so it should work right?

How long are the clips in your data set?

>I adjusted the batch size down to 4 as suggested when validating the training file and even down to 2 with the gradient accumulation size at 1 but it still gives the same error. I shrank down my dataset size to 5mb and it still gives the same error. I have a 3070 mobile with 8GB I've seen some others with identical cards as mine so it should work right? How long are the clips in your data set?

Fresh12 commented

2023-05-24 01:21:04 +00:00

I adjusted the batch size down to 4 as suggested when validating the training file and even down to 2 with the gradient accumulation size at 1 but it still gives the same error. I shrank down my dataset size to 5mb and it still gives the same error. I have a 3070 mobile with 8GB I've seen some others with identical cards as mine so it should work right?

How long are the clips in your data set?

I've taken a subset of clips that were previously segmented by the software to go through the training process again. They are on average 3s with some 4 and 5s clips. The total combined size is about 6mb. The settings were adjusted to what was recommended by the software with a batch size is 4 and the gradient accumulation size is 2 with bits and bytes enabled. I've also previously tried even lower batch size to no avail.

> >I adjusted the batch size down to 4 as suggested when validating the training file and even down to 2 with the gradient accumulation size at 1 but it still gives the same error. I shrank down my dataset size to 5mb and it still gives the same error. I have a 3070 mobile with 8GB I've seen some others with identical cards as mine so it should work right? > > > > How long are the clips in your data set? I've taken a subset of clips that were previously segmented by the software to go through the training process again. They are on average 3s with some 4 and 5s clips. The total combined size is about 6mb. The settings were adjusted to what was recommended by the software with a batch size is 4 and the gradient accumulation size is 2 with bits and bytes enabled. I've also previously tried even lower batch size to no avail.

psammites commented

2023-05-24 04:00:09 +00:00

Can you run ffprobe on the clips and post the output?

Can you run `ffprobe` on the clips and post the output?

Fresh12 commented

2023-05-25 08:48:51 +00:00

Can you run ffprobe on the clips and post the output?

ffprobe third_00000.wav
ffprobe version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2007-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  WARNING: library configuration mismatch
  avcodec     configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared --enable-version3 --disable-doc --disable-programs --enable-libaribb24 --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libtesseract --enable-libvo_amrwbenc --enable-libsmbclient
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, wav, from 'third_00000.wav':
  Duration: 00:00:01.90, bitrate: 352 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 1 channels, s16, 352 kb/s


Input #0, wav, from 'third_00001.wav':
  Duration: 00:00:03.94, bitrate: 352 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 1 channels, s16, 352 kb/s


Input #0, wav, from 'third_00002.wav':
  Duration: 00:00:03.47, bitrate: 352 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 1 channels, s16, 352 kb/s

> Can you run `ffprobe` on the clips and post the output? ``` ffprobe third_00000.wav ffprobe version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2007-2021 the FFmpeg developers built with gcc 11 (Ubuntu 11.2.0-19ubuntu1) configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared WARNING: library configuration mismatch avcodec configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared --enable-version3 --disable-doc --disable-programs --enable-libaribb24 --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libtesseract --enable-libvo_amrwbenc --enable-libsmbclient libavutil 56. 70.100 / 56. 70.100 libavcodec 58.134.100 / 58.134.100 libavformat 58. 76.100 / 58. 76.100 libavdevice 58. 13.100 / 58. 13.100 libavfilter 7.110.100 / 7.110.100 libswscale 5. 9.100 / 5. 9.100 libswresample 3. 9.100 / 3. 9.100 libpostproc 55. 9.100 / 55. 9.100 Input #0, wav, from 'third_00000.wav': Duration: 00:00:01.90, bitrate: 352 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 1 channels, s16, 352 kb/s Input #0, wav, from 'third_00001.wav': Duration: 00:00:03.94, bitrate: 352 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 1 channels, s16, 352 kb/s Input #0, wav, from 'third_00002.wav': Duration: 00:00:03.47, bitrate: 352 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 1 channels, s16, 352 kb/s ```

psammites commented

2023-05-25 14:57:12 +00:00

Hmm. The only thing that sticks out to me is that the audio is mono. I don't see any reason why that should be a problem but all my samples are stereo. Can you try and reproduce the fault with stereo audio?

helloitsme commented

2023-05-25 22:23:06 +00:00

I've found it best when running into cuda memory allocation errors to just restart everything. In fact, I run into that issue mostly when trying to do multiple tasks within the same session (ie train + generate, etc) so I just restart everything between big tasks.

Fresh12 commented

2023-05-28 09:17:13 +00:00

Hmm. The only thing that sticks out to me is that the audio is mono. I don't see any reason why that should be a problem but all my samples are stereo. Can you try and reproduce the fault with stereo audio?

The clips start out as stereo

nput #0, wav, from 'third.wav':
  Metadata:
    encoder         : Lavf58.45.100
  Duration: 00:09:34.01, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536(base) (base) mainuser@(base) mainuser@a(base) mainuser@(base) (((b(b(b(((((

but are converted to mono

Input #0, wav, from 'third_00000.wav':
  Duration: 00:00:01.90, bitrate: 352 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 1 channels, s16, 352 kb/s

The error just happens regardless.

I've found it best when running into cuda memory allocation errors to just restart everything. In fact, I run into that issue mostly when trying to do multiple tasks within the same session (ie train + generate, etc) so I just restart everything between big tasks.

The out of memory happens whether I restart or not. This is the memory usage with aivoicecloning started but before training. A large chunk of memory is already being used or set aside or what I'm not sure.


| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1111      G   /usr/lib/xorg/Xorg                           29MiB |
|    0   N/A  N/A     30678      C   python3                                    4762MiB |
+---------------------------------------------------------------------------------------+

> Hmm. The only thing that sticks out to me is that the audio is mono. I don't see any reason why that should be a problem but all my samples are stereo. Can you try and reproduce the fault with stereo audio? The clips start out as stereo ``` nput #0, wav, from 'third.wav': Metadata: encoder : Lavf58.45.100 Duration: 00:09:34.01, bitrate: 1536 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536(base) (base) mainuser@(base) mainuser@a(base) mainuser@(base) (((b(b(b((((( ``` but are converted to mono ``` Input #0, wav, from 'third_00000.wav': Duration: 00:00:01.90, bitrate: 352 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 1 channels, s16, 352 kb/s ``` The error just happens regardless. > I've found it best when running into cuda memory allocation errors to just restart everything. In fact, I run into that issue mostly when trying to do multiple tasks within the same session (ie train + generate, etc) so I just restart everything between big tasks. The out of memory happens whether I restart or not. This is the memory usage with aivoicecloning started but before training. A large chunk of memory is already being used or set aside or what I'm not sure. ``` | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1111 G /usr/lib/xorg/Xorg 29MiB | | 0 N/A N/A 30678 C python3 4762MiB | +---------------------------------------------------------------------------------------+ ```

psammites commented

2023-05-28 17:47:17 +00:00

A large chunk of memory is already being used or set aside or what I'm not sure.

Enable Do Not Load TTS On Startup and restart.

> A large chunk of memory is already being used or set aside or what I'm not sure. Enable `Do Not Load TTS On Startup` and restart.

Fresh12 commented

2023-06-05 18:42:58 +00:00

A large chunk of memory is already being used or set aside or what I'm not sure.

Enable Do Not Load TTS On Startup and restart.

This worked...the program seems to set aside memory for certain tasks, which you might not want to do and not release it...this might also be the reason why voice clip generation often fails after the first instance.

> > A large chunk of memory is already being used or set aside or what I'm not sure. > > Enable `Do Not Load TTS On Startup` and restart. > > > A large chunk of memory is already being used or set aside or what I'm not sure. > > Enable `Do Not Load TTS On Startup` and restart. > This worked...the program seems to set aside memory for certain tasks, which you might not want to do and not release it...this might also be the reason why voice clip generation often fails after the first instance.

Fresh12 closed this issue

2023-06-05 18:42:58 +00:00

Sign in to join this conversation.