Just some questions from a newbie... #69

Closed
opened 2023-03-06 03:11:52 +00:00 by nirurin · 23 comments

Was just wanting to sanity-check that I was doing things correctly!

So say I have picked a voice, added my samples, transcribed it, and it has spat out the list of wav snippets and train.txt into the 'training' folder.

I then go to 'generate config', refresh - and select that voice from the training folder.

If I then start the training, it will (eventually) output ... some other files (I haven't tested this part yet, it mentions a yaml file but I don't know if it outputs others too).

  1. Once the training is complete, do I keep the original voice files in the voice folder, and the 'training' output stays elsewhere and just improves upon the output?

  2. Does computing latents have any benefit on output quality?

  3. Should I compute latents before or after training? Or at all?

  4. If I do compute latents, do I then keep the original voice files, or are they replaced by latents?

  5. My 'voice' files are currently a few long clips (which are chopped up by the app into lots of snippets for the training). Would it be beneficial to then copy those files and replace the long wavs in the voice folder with them? (still keep them in the training folder as well). I guess this depends on if Tortoise works better with short voice files vs long ones...

Was just wanting to sanity-check that I was doing things correctly! So say I have picked a voice, added my samples, transcribed it, and it has spat out the list of wav snippets and train.txt into the 'training' folder. I then go to 'generate config', refresh - and select that voice from the training folder. If I then start the training, it will (eventually) output ... some other files (I haven't tested this part yet, it mentions a yaml file but I don't know if it outputs others too). 1. Once the training is complete, do I keep the original voice files in the voice folder, and the 'training' output stays elsewhere and just improves upon the output? 2. Does computing latents have any benefit on output quality? 3. Should I compute latents before or after training? Or at all? 4. If I do compute latents, do I then keep the original voice files, or are they replaced by latents? 5. My 'voice' files are currently a few long clips (which are chopped up by the app into lots of snippets for the training). Would it be beneficial to then copy those files and replace the long wavs in the voice folder with them? (still keep them in the training folder as well). I guess this depends on if Tortoise works better with short voice files vs long ones...
Owner

If I then start the training, it will (eventually) output ... some other files

Of (some) importance to the average user, it'll spit out:

  • in ./trainings/{name}-finetune/models/, the final model that you load later
  • in ./trainings/{name}-finetune/training_state/, resumable checkpoints

I say "some", because the UI will automatically pass a list of models found in the training folder under Settings > Autoregressive Model. You don't really need to touch any of the files that get outputted unless you're doing cleanup.

Once the training is complete, do I keep the original voice files in the voice folder, and the 'training' output stays elsewhere

Yes, you'll still need your original voice samples in the ./voices/ folder to generate the latents (a brief "snapshot" that defines the traits of a voice) against the finetuned model, like you usually would when normally generating on the base model.

Does computing latents have any benefit on output quality?

Yes, there's a "sweet spot" on how the combined voice samples gets chunked and the latents get computed. It's still something I need to explore more into to try and find a common technique for it, but you should be safe with using whatever number gets auto-suggested when picking a voice in the Generate tab. It definitely does affect how well a voice's traits gets "captured".

Should I compute latents before or after training? Or at all?

They'll get automatically regenerated if a new model is used, as they're cached per-model. You won't have to worry about manually regenerating them, as there was a quality loss if you still used a voice's precomputed latents that were used against the base model.

My 'voice' files are currently a few long clips (which are chopped up by the app into lots of snippets for the training). Would it be beneficial to then copy those files and replace the long wavs in the voice folder with them? (still keep them in the training folder as well). I guess this depends on if Tortoise works better with short voice files vs long ones...

In terms of using voices for generation (under the ./voices/ folder with the Generate tab), there's practically* no difference in how they're stored. You can have them in one file, or two, or use what gets spat out from preparing the dataset, though I wouldn't really do the last one, as whisper is a little too liberal in how it trims I've found.

* technically, the "ideal" would be to combine voices of similar delivery, and try and make all the voice samples equal length to avoid them getting chunked mid-phoneme, and voice chunk by how many files you have, but it's too much effort for something of probably-negligible gain.


In short, the web UI can most definitely be used without having to touch any more files outside of what you put into the ./voices/ folder. It'll cover picking out your newly finetuned models and letting you select them to load as the model for generation, and keeping conditional latents tied to a model. No additional adjustments are required for decent results.

I should probably give a simplified flowchart somewhere, but I feel even when I do a wiki cleanup, it's still pretty messy.

> If I then start the training, it will (eventually) output ... some other files Of (some) importance to the average user, it'll spit out: * in `./trainings/{name}-finetune/models/`, the final model that you load later * in `./trainings/{name}-finetune/training_state/`, resumable checkpoints I say "some", because the UI will automatically pass a list of models found in the training folder under `Settings` > `Autoregressive Model`. You don't really need to touch any of the files that get outputted unless you're doing cleanup. > Once the training is complete, do I keep the original voice files in the voice folder, and the 'training' output stays elsewhere Yes, you'll still need your original voice samples in the `./voices/` folder to generate the latents (a brief "snapshot" that defines the traits of a voice) against the finetuned model, like you usually would when normally generating on the base model. > Does computing latents have any benefit on output quality? Yes, there's a "sweet spot" on how the combined voice samples gets chunked and the latents get computed. It's still something I need to explore more into to try and find a common technique for it, but you should be safe with using whatever number gets auto-suggested when picking a voice in the `Generate` tab. It definitely does affect how well a voice's traits gets "captured". > Should I compute latents before or after training? Or at all? They'll get automatically regenerated if a new model is used, as they're cached per-model. You won't have to worry about manually regenerating them, as there was a quality loss if you still used a voice's precomputed latents that were used against the base model. > My 'voice' files are currently a few long clips (which are chopped up by the app into lots of snippets for the training). Would it be beneficial to then copy those files and replace the long wavs in the voice folder with them? (still keep them in the training folder as well). I guess this depends on if Tortoise works better with short voice files vs long ones... In terms of using voices for generation (under the `./voices/` folder with the `Generate` tab), there's practically\* no difference in how they're stored. You can have them in one file, or two, or use what gets spat out from preparing the dataset, though I wouldn't really do the last one, as whisper is a little too liberal in how it trims I've found. \* technically, the "ideal" would be to combine voices of similar delivery, and try and make all the voice samples equal length to avoid them getting chunked mid-phoneme, and voice chunk by how many files you have, but it's too much effort for something of probably-negligible gain. --- In short, the web UI can most definitely be used without having to touch any more files outside of what you put into the `./voices/` folder. It'll cover picking out your newly finetuned models and letting you select them to load as the model for generation, and keeping conditional latents tied to a model. No additional adjustments are required for decent results. I should probably give a simplified flowchart somewhere, but I feel even when I do a wiki cleanup, it's still pretty messy.
Author

Thankyou for the quick response!

I think part of my problem was just making sure that, once I've done the training (and there are now training files in the folder for a particular voice) I could just leave them there, and when I choose a voice to generate a speech I assume the webUI will automatically look to see if there is a set of finetune files for it, and use them if they exist in the training folder?

Thankyou for the quick response! I think part of my problem was just making sure that, once I've done the training (and there are now training files in the folder for a particular voice) I could just leave them there, and when I choose a voice to generate a speech I assume the webUI will automatically look to see if there is a set of finetune files for it, and use them if they exist in the training folder?
Owner

I assume the webUI will automatically look to see if there is a set of finetune files for it, and use them if they exist in the training folder?

You'll need to go into Settings and select the final model output in the Autoregressive Model dropdown (for example, it'll be something like ./models/{name}-finetune/models/#_gpt.pth).

  • if TTS is already loaded, it'll print a message in the command prompt/terminal window that it's loading, and will print when it finishes loading.
  • if TTS isn't loaded yet, it'll use the model for subsequent TTS loads
> I assume the webUI will automatically look to see if there is a set of finetune files for it, and use them if they exist in the training folder? You'll need to go into `Settings` and select the final model output in the `Autoregressive Model` dropdown (for example, it'll be something like `./models/{name}-finetune/models/#_gpt.pth`). * if TTS is already loaded, it'll print a message in the command prompt/terminal window that it's loading, and will print when it finishes loading. * if TTS isn't loaded yet, it'll use the model for subsequent TTS loads
Author

Ahh I see, so each voice will have its own model, so I select a voice and then the appropriate model for that voice. Makes sense.

I'll report back once I've done a test run, thankyou!

Ahh I see, so each voice will have its own model, so I select a voice and then the appropriate model for that voice. Makes sense. I'll report back once I've done a test run, thankyou!
Author

Something I have noticed with the transcription files, is that quite often it will do something like...

Audio1 = The quick brown fox jumped
Audio2 = over the lazy dog

Transcription1 = The quick brown fox jumped over
Transcription2 = the lazy dog

As in it'll put the right words, but the overlap will be in the wrong file. I'm correcting the ones I notice as I check through (not sure how big a deal it actually will be in the long run)

Is there a method or option of improving the whisper accuracy?

Something I have noticed with the transcription files, is that quite often it will do something like... Audio1 = The quick brown fox jumped Audio2 = over the lazy dog Transcription1 = The quick brown fox jumped over Transcription2 = the lazy dog As in it'll put the right words, but the overlap will be in the wrong file. I'm correcting the ones I notice as I check through (not sure how big a deal it actually will be in the long run) Is there a method or option of improving the whisper accuracy?
Owner

Use the larger models, you'll get more accuracy at the cost of throughput speed and more VRAM consumption, or use the whisperx backend that got added not too long ago that uses wav2vec2 alignment blah blah blah for better trimming.

You'll need to install the dependency for it, but it should be as simple as running update-force.bat, or:

.\venv\Scripts\activate.bat
pip3 install git+https://github.com/m-bain/whisperX
deactivate
Use the larger models, you'll get more accuracy at the cost of throughput speed and more VRAM consumption, or use the whisperx backend that got added not too long ago that uses wav2vec2 alignment blah blah blah for better trimming. You'll need to install the dependency for it, but it should be as simple as running update-force.bat, or: ``` .\venv\Scripts\activate.bat pip3 install git+https://github.com/m-bain/whisperX deactivate ```
Author

Tried to run a training session, set up the yaml etc, but got the following -

[Training] [2023-03-06T21:40:45.180518] from einops import rearrange, repeat, pack, unpack
[Training] [2023-03-06T21:40:45.183521] ImportError: cannot import name 'pack' from 'einops' (C:\Users\nirin\Desktop\AIVoice\ai-voice-cloning\venv\lib\site-packages\einops_init_.py)

Tried to run a training session, set up the yaml etc, but got the following - [Training] [2023-03-06T21:40:45.180518] from einops import rearrange, repeat, pack, unpack [Training] [2023-03-06T21:40:45.183521] ImportError: cannot import name 'pack' from 'einops' (C:\Users\nirin\Desktop\AIVoice\ai-voice-cloning\venv\lib\site-packages\einops\__init__.py)
Owner

Yeah, there was an issue (#73, #74) with haphazardly adding in whisperx as a backend, where it'll break a package DLAS requires. The setup and update-force scripts have been fixed to have DLAS install last (need to double check, but it was how I fixed it on my machines)

Run:

.\venv\Scripts\activate
pip3 install -r .\dlas\requirements.txt
deactivate

or

.\venv\Scripts\activate
pip3 install -U einops --no-deps
deactivate

or run update-force.bat.

Yeah, there was an issue (#73, #74) with haphazardly adding in whisperx as a backend, where it'll break a package DLAS requires. The setup and update-force scripts have been fixed to have DLAS install last (need to double check, but it was how I fixed it on my machines) Run: ``` .\venv\Scripts\activate pip3 install -r .\dlas\requirements.txt deactivate ``` or ``` .\venv\Scripts\activate pip3 install -U einops --no-deps deactivate ``` or run `update-force.bat`.
Author

Yeah, there was an issue (#73, #74) with haphazardly adding in whisperx as a backend, where it'll break a package DLAS requires. The setup and update-force scripts have been fixed to have DLAS install last (need to double check, but it was how I fixed it on my machines)

Run:

.\venv\Scripts\activate
pip3 install -r .\dlas\requirements.txt
deactivate

or

.\venv\Scripts\activate
pip3 install -U einops --no-deps
deactivate

or run update-force.bat.

Aha thankyou, I just did the 'update force bat' as I found you'd mentioned this in another thread a couple hours ago :)

I have a dataset that whisper split into 155 files, so I set it for 1000 epochs of 155 (gradient batch 5). Does that seem like an appropriate setup in your opinion?

Seems to be using 18gb out of my 24gb vram

> Yeah, there was an issue (#73, #74) with haphazardly adding in whisperx as a backend, where it'll break a package DLAS requires. The setup and update-force scripts have been fixed to have DLAS install last (need to double check, but it was how I fixed it on my machines) > > Run: > ``` > .\venv\Scripts\activate > pip3 install -r .\dlas\requirements.txt > deactivate > ``` > or > ``` > .\venv\Scripts\activate > pip3 install -U einops --no-deps > deactivate > ``` > or run `update-force.bat`. Aha thankyou, I just did the 'update force bat' as I found you'd mentioned this in another thread a couple hours ago :) I have a dataset that whisper split into 155 files, so I set it for 1000 epochs of 155 (gradient batch 5). Does that seem like an appropriate setup in your opinion? Seems to be using 18gb out of my 24gb vram
Owner

Does that seem like an appropriate setup in your opinion?

mmm, yeah, with the default LR rate of 1e-5, it should be at the same point (twice the size the size at half the requested epoch should have the same iteration count) a model I baked overnight (68 lines, bs=68, gradaccum=34, lr=1e-5, 2000 epochs): image

I don't believe the gradient accumulation size has any bearings on the training itself, as the batch size moreso has the influence over it, while the gradient accumulation does reduce your VRAM, it seems to also bump up your system RAM (at least in my testing where I'm constantly triggering OOM killers and sometimes outright crashing my system).

> Does that seem like an appropriate setup in your opinion? mmm, yeah, with the default LR rate of 1e-5, it *should* be at the same point (twice the size the size at half the requested epoch should have the same iteration count) a model I baked overnight (68 lines, bs=68, gradaccum=34, lr=1e-5, 2000 epochs): ![image](/attachments/a89aafd8-c14b-4df9-b1b8-1dc3279560a9) I don't believe the gradient accumulation size has any bearings on the training itself, as the batch size moreso has the influence over it, while the gradient accumulation does reduce your VRAM, it seems to also bump up your system RAM (at least in my testing where I'm constantly triggering OOM killers and sometimes outright crashing my system).
Author

Seems like it's only going to take about 2 hours, so I'm a little concerned I have something set 'too easy' lol. Though I mostly used the default options

Seems like it's only going to take about 2 hours, so I'm a little concerned I have something set 'too easy' lol. Though I mostly used the default options
Author

So the training has currently taken up about 400gb of space on the drive (luckily I have the space), but I'm wondering if this is how much will need to be stored long-term for this training to work? Or will it end up with a smaller set of files when it's completed?

So the training has currently taken up about 400gb of space on the drive (luckily I have the space), but I'm wondering if this is how much will need to be stored long-term for this training to work? Or will it end up with a smaller set of files when it's completed?
Author

Hmm. the training seems to have gotten to 100.1% and ... shows no sign of actually coming to a stop? I had expected it to send a console message or something to say that it was completed?

Hmm. the training seems to have gotten to 100.1% and ... shows no sign of actually coming to a stop? I had expected it to send a console message or something to say that it was completed?
Owner

So the training has currently taken up about 400gb of space on the drive (luckily I have the space), but I'm wondering if this is how much will need to be stored long-term for this training to work? Or will it end up with a smaller set of files when it's completed?

Those are just previous save points you can revert to (states) or compare against over time (models). You're save to delete all but the last one.

Hmm. the training seems to have gotten to 100.1% and ... shows no sign of actually coming to a stop? I had expected it to send a console message or something to say that it was completed?

There's a weird regression that happened to me last night. I'm not too sure what causes it, but the training process will finish and release its resources, but not close stdout, so the web UI will still hang and wait for further prints. There's also some other oddities where it will train a little more after what should be the final save, but I don't know.

I'll have to explore either just terminating the process when it reaches the last save point, or modify DLAS to print a "training complete" and have the web UI look for that as the go-to to stop training.

> So the training has currently taken up about 400gb of space on the drive (luckily I have the space), but I'm wondering if this is how much will need to be stored long-term for this training to work? Or will it end up with a smaller set of files when it's completed? Those are just previous save points you can revert to (states) or compare against over time (models). You're save to delete all but the last one. > Hmm. the training seems to have gotten to 100.1% and ... shows no sign of actually coming to a stop? I had expected it to send a console message or something to say that it was completed? There's a weird regression that happened to me last night. I'm not too sure what causes it, but the training process will finish and release its resources, but not close stdout, so the web UI will still hang and wait for further prints. There's also some other oddities where it will train a little more after what should be the final save, but I don't know. I'll have to explore either just terminating the process when it reaches the last save point, or modify DLAS to print a "training complete" and have the web UI look for that as the go-to to stop training.
Author

Ahh ok I see, I probably set the save states too frequently anyway so I ended up with loads of them.

However the final result is.... unfortunately pretty bad! Much worse than just using the voice without any training at all lol.

http://sndup.net/jrrb

It didn't even generate the whole sentance (the sentance it was meant to make, has a second half it didn't even attempt..)

Ahh ok I see, I probably set the save states too frequently anyway so I ended up with loads of them. However the final result is.... unfortunately pretty bad! Much worse than just using the voice without any training at all lol. http://sndup.net/jrrb It didn't even generate the whole sentance (the sentance it was meant to make, has a second half it didn't even attempt..)
Owner

Right, the final loss value's too high, you'll need to train it a lot more, although I guess it's a matter of how nice the dataset is with how fast it'll finetune too, even with similar metrics.

Right, the final loss value's too high, you'll need to train it a lot more, although I guess it's a matter of how nice the dataset is with how fast it'll finetune too, even with similar metrics.
Author

Right, the final loss value's too high, you'll need to train it a lot more, although I guess it's a matter of how nice the dataset is with how fast it'll finetune too, even with similar metrics.

Ahh ok, I did think it was a bit quick. Is there a rough value I should aim for? I see your GPT is 0.5 (I assume GPT is the value that matters, as my 'text' value started lower than yours, so I assume that ones not particularly relevant?)

I'll set the training to carry on for the night :)

> Right, the final loss value's too high, you'll need to train it a lot more, although I guess it's a matter of how nice the dataset is with how fast it'll finetune too, even with similar metrics. Ahh ok, I did think it was a bit quick. Is there a rough value I should aim for? I see your GPT is 0.5 (I assume GPT is the value that matters, as my 'text' value started lower than yours, so I assume that ones not particularly relevant?) I'll set the training to carry on for the night :)
Author

Sorry for all the questions!

When I reload the finteuned dataset and hit 'resume/import', it reloads all the correct settings, but the dropdown for 'source model' changes to 'null' which seems incorrect?

Sorry for all the questions! When I reload the finteuned dataset and hit 'resume/import', it reloads all the correct settings, but the dropdown for 'source model' changes to 'null' which seems incorrect?
Owner

Is there a rough value I should aim for?

To reiterate from the wiki:

However, there doesn't seem to be a "one-size-fits-all" value for what loss rate you should aim at. I've had some finetunes benefit a ton more from sub 0.01 loss rates, while others absolutely fried after 0.5 (although, it entirely depends on how low of a learning rate you have, rather than haphazardly quick-training it).

It depends.

but the dropdown for 'source model' changes to 'null' which seems incorrect?

It's fine. The source model gets ignored if a resume state is supplied to the configuration generator. It'll end up as null if it loads from a YAML that already was a resume, since no source model gets specified.

> Is there a rough value I should aim for? To reiterate from the [wiki](https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#run-training): > However, there doesn't seem to be a "one-size-fits-all" value for what loss rate you should aim at. I've had some finetunes benefit a ton more from sub 0.01 loss rates, while others absolutely fried after 0.5 (although, it entirely depends on how low of a learning rate you have, rather than haphazardly quick-training it). It depends. > but the dropdown for 'source model' changes to 'null' which seems incorrect? It's fine. The source model gets ignored if a resume state is supplied to the configuration generator. It'll end up as null if it loads from a YAML that already was a resume, since no source model gets specified.
Author

Is there a rough value I should aim for?

To reiterate from the wiki:

However, there doesn't seem to be a "one-size-fits-all" value for what loss rate you should aim at. I've had some finetunes benefit a ton more from sub 0.01 loss rates, while others absolutely fried after 0.5 (although, it entirely depends on how low of a learning rate you have, rather than haphazardly quick-training it).

It depends.

but the dropdown for 'source model' changes to 'null' which seems incorrect?

It's fine. The source model gets ignored if a resume state is supplied to the configuration generator. It'll end up as null if it loads from a YAML that already was a resume, since no source model gets specified.

ok thankyou yet again, you've been so amazing.

I just wanted to double check things, before I spend the night training against a null model that doesn't actually do anything haha.

> > Is there a rough value I should aim for? > > To reiterate from the [wiki](https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Training#run-training): > > > However, there doesn't seem to be a "one-size-fits-all" value for what loss rate you should aim at. I've had some finetunes benefit a ton more from sub 0.01 loss rates, while others absolutely fried after 0.5 (although, it entirely depends on how low of a learning rate you have, rather than haphazardly quick-training it). > > It depends. > > > but the dropdown for 'source model' changes to 'null' which seems incorrect? > > It's fine. The source model gets ignored if a resume state is supplied to the configuration generator. It'll end up as null if it loads from a YAML that already was a resume, since no source model gets specified. ok thankyou yet again, you've been so amazing. I just wanted to double check things, before I spend the night training against a null model that doesn't actually do anything haha.
Author

http://sndup.net/xsyh

This is the current output at around 4000 epochs, with the current graph in the image below. I'm not sure if I'm doing something wrong, or if I simply haven't done enough training yet for this particular dataset.

There's certainly some words in there, but its very garbled.

http://sndup.net/xsyh This is the current output at around 4000 epochs, with the current graph in the image below. I'm not sure if I'm doing something wrong, or if I simply haven't done enough training yet for this particular dataset. There's certainly some words in there, but its very garbled.
Owner

This might be related to #70, where the other DLAS repo had a compatibility fix for models that are going to be used for TorToiSe (I will admit I should have kept an eye on it for anything I might have missed, I don't think I looked at it since picking the YAML in the beginning). I haven't had a chance to check, as I had my two free machines experimenting with CosineAnnealingLRs, but I might let a 6800XT train again while I leave my 2060 for other tests.

I'd update and regenerate the configuration and train from scratch, as I'm pretty sure the compat. flags won't retroactively fixed previous finetunes.

This might be related to https://git.ecker.tech/mrq/ai-voice-cloning/issues/70, where the other DLAS repo had a compatibility fix for models that are going to be used for TorToiSe (I will admit I should have kept an eye on it for anything I might have missed, I don't think I looked at it since picking the YAML in the beginning). I haven't had a chance to check, as I had my two free machines experimenting with CosineAnnealingLRs, but I might let a 6800XT train again while I leave my 2060 for other tests. I'd update and regenerate the configuration and train from scratch, as I'm pretty sure the compat. flags won't retroactively fixed previous finetunes.
Owner

Yeah, that actually seemed to have fixed shitty output for some voices. image

https://vocaroo.com/101fw74rhfgz

Where last week it was consistently the worst output for Mary/Maria from Silent Hill 2.

Sort of sucks, since I need to re-train what I did keep around (but not so bad, since this took maybe 10 minutes to bake).

I most definitely suggest starting from scratch and remake the training configuration.

Yeah, that actually seemed to have fixed shitty output for some voices. ![image](/attachments/aaf0caca-8458-4d4c-af6f-44236cc3e613) https://vocaroo.com/101fw74rhfgz Where last week it was consistently the worst output for Mary/Maria from Silent Hill 2. Sort of sucks, since I need to re-train what I did keep around (but not so bad, since this took maybe 10 minutes to bake). I most definitely suggest starting from scratch and remake the training configuration.
120 KiB
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#69
No description provided.