Requesting tips to make inference as fast as possible #225

Open
opened 2023-05-01 10:10:44 +00:00 by FrioGlakka · 6 comments

I've found this repository to be suitably fast for my use case (automatically generating 30 sec sounds practically non-stop). But it would be perfect if it could generate it just a little bit faster.

I've already set the settings as low as possible that still generate acceptable results. But I'm wondering if there are some things I could do that aren't obvious to me (I'm just a user here, don't know how all this works).

For example, is manually changing the sample batch size going to give me an advantage or disadvantage in terms of speed (I get that it's mainly for vram usage, but less vram usage -> slower inference also?)

I just really need to juice out those last few seconds and I'm a bit unexperienced to how I can approach that, or if I can at all.

For example, I've read that the fast-tortoise repo uses a different diffusion sampler which speeds things up etc. Can we apply that knowledge to this repo? I'm bit confused about this because that repo's readme mentions to just use this repo instead etc.

I've found this repository to be suitably fast for my use case (automatically generating 30 sec sounds practically non-stop). But it would be perfect if it could generate it just a little bit faster. I've already set the settings as low as possible that still generate acceptable results. But I'm wondering if there are some things I could do that aren't obvious to me (I'm just a user here, don't know how all this works). For example, is manually changing the sample batch size going to give me an advantage or disadvantage in terms of speed (I get that it's mainly for vram usage, but less vram usage -> slower inference also?) I just really need to juice out those last few seconds and I'm a bit unexperienced to how I can approach that, or if I can at all. For example, I've read that the fast-tortoise repo uses a different diffusion sampler which speeds things up etc. Can we apply that knowledge to this repo? I'm bit confused about this because that repo's readme mentions to just use this repo instead etc.

Fine tune a model,~50-200 epochs.
If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained.
An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model.
Refresh voices
Calculate latents
Set samples to 2, iterations between 64 and 256
Click experimental and try condition free

The sampler is a huge bottleneck and fine tuning lets you sample from a smaller domain to get the same quality outcome.. sometimes... mostly.. if you're lucky.

Recalculate latents if the voice sounds slightly off, and note that the model will pick up any noise in the samples, and with small sampling batch can't filter them out, so noise will be emphasized

Fine tune a model,~50-200 epochs. If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained. An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model. Refresh voices Calculate latents Set samples to 2, iterations between 64 and 256 Click experimental and try condition free The sampler is a huge bottleneck and fine tuning lets you sample from a smaller domain to get the same quality outcome.. sometimes... mostly.. if you're lucky. Recalculate latents if the voice sounds slightly off, and note that the model will pick up any noise in the samples, and with small sampling batch can't filter them out, so noise will be emphasized

Fine tune a model,~50-200 epochs.

When training an american or british voice with a 10mins audio file(200 lines appx).
How many epochs do you recommend?
Whats the total loss and mel loss do you target?
Whats the learning rate do you prefer?
From my experience i have trained models at 0.0001 LR with a mel loss value of 0.2 to 0.5, Mostly the generated audio outputs from the finetuned models are good, but the problem comes when i try to convert longer text. What happens then is,i get repeat at the end of sentences, sometimes garbles and artifacts and sometimes also some sentences are completly ignored. While all these issue disappear when i use the default autoregressive model.

> Fine tune a model,~50-200 epochs. When training an american or british voice with a 10mins audio file(200 lines appx). How many epochs do you recommend? Whats the total loss and mel loss do you target? Whats the learning rate do you prefer? From my experience i have trained models at 0.0001 LR with a mel loss value of 0.2 to 0.5, Mostly the generated audio outputs from the finetuned models are good, but the problem comes when i try to convert longer text. What happens then is,i get repeat at the end of sentences, sometimes garbles and artifacts and sometimes also some sentences are completly ignored. While all these issue disappear when i use the default autoregressive model.

If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained.
An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model.
Refresh voices
Calculate latents

This doesn't seem to work. Still the UI is looking for the audio folder inside training directory. Got this error when tried to compute the latent after putting audio samples to the corresponding folder under voice folder.

Something went wrong Failed to open the input "./training/hayls-v2/audio/0.wav" (No such file or directory).

> If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained. > An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model. > Refresh voices > Calculate latents This doesn't seem to work. Still the UI is looking for the audio folder inside training directory. Got this error when tried to compute the latent after putting audio samples to the corresponding folder under voice folder. `Something went wrong Failed to open the input "./training/hayls-v2/audio/0.wav" (No such file or directory).`
Author

The sampler is a huge bottleneck and fine tuning lets you sample from a smaller domain to get the same quality outcome.. sometimes... mostly.. if you're lucky.

Thanks for the length response.

So I'm completely new to Tortoise and I don't have much knowledge of it all. I'm currently using the model that came with this repo, and have 15+ voices folders.

I've used a fine tuned model from someone else before but it was fine tuned to clone 1 specific voice. I would like to keep using all my 15 voices. Can I fine tune the model with different speaker voices, and then use those same voices to infere the fine tuned model? I ask mainly because I've only ever seen fine tuned models that were made for 1 specific voice.

> The sampler is a huge bottleneck and fine tuning lets you sample from a smaller domain to get the same quality outcome.. sometimes... mostly.. if you're lucky. Thanks for the length response. So I'm completely new to Tortoise and I don't have much knowledge of it all. I'm currently using the model that came with this repo, and have 15+ voices folders. I've used a fine tuned model from someone else before but it was fine tuned to clone 1 specific voice. I would like to keep using all my 15 voices. Can I fine tune the model with different speaker voices, and then use those same voices to infere the fine tuned model? I ask mainly because I've only ever seen fine tuned models that were made for 1 specific voice.

If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained.
An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model.
Refresh voices
Calculate latents

This doesn't seem to work. Still the UI is looking for the audio folder inside training directory. Got this error when tried to compute the latent after putting audio samples to the corresponding folder under voice folder.

Something went wrong Failed to open the input "./training/hayls-v2/audio/0.wav" (No such file or directory).

Move the metadata files out and/or restart the webui

> > If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained. > > An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model. > > Refresh voices > > Calculate latents > > This doesn't seem to work. Still the UI is looking for the audio folder inside training directory. Got this error when tried to compute the latent after putting audio samples to the corresponding folder under voice folder. > > `Something went wrong > Failed to open the input "./training/hayls-v2/audio/0.wav" (No such file or directory).` > > Move the metadata files out and/or restart the webui

If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained.
An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model.
Refresh voices
Calculate latents

This doesn't seem to work. Still the UI is looking for the audio folder inside training directory. Got this error when tried to compute the latent after putting audio samples to the corresponding folder under voice folder.

Something went wrong Failed to open the input "./training/hayls-v2/audio/0.wav" (No such file or directory).

Move the metadata files out and/or restart the webui

What are the metadata files specifically? I've hid the audio folder and moved the various yaml and txt files out of the way, but it's still looking for the training audio folder.

> > > If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained. > > > An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model. > > > Refresh voices > > > Calculate latents > > > > This doesn't seem to work. Still the UI is looking for the audio folder inside training directory. Got this error when tried to compute the latent after putting audio samples to the corresponding folder under voice folder. > > > > `Something went wrong > > Failed to open the input "./training/hayls-v2/audio/0.wav" (No such file or directory).` > > > > > > Move the metadata files out and/or restart the webui What are the metadata files specifically? I've hid the audio folder and moved the various yaml and txt files out of the way, but it's still looking for the training audio folder.
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#225
No description provided.