Requesting tips to make inference as fast as possible #225
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#225
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I've found this repository to be suitably fast for my use case (automatically generating 30 sec sounds practically non-stop). But it would be perfect if it could generate it just a little bit faster.
I've already set the settings as low as possible that still generate acceptable results. But I'm wondering if there are some things I could do that aren't obvious to me (I'm just a user here, don't know how all this works).
For example, is manually changing the sample batch size going to give me an advantage or disadvantage in terms of speed (I get that it's mainly for vram usage, but less vram usage -> slower inference also?)
I just really need to juice out those last few seconds and I'm a bit unexperienced to how I can approach that, or if I can at all.
For example, I've read that the fast-tortoise repo uses a different diffusion sampler which speeds things up etc. Can we apply that knowledge to this repo? I'm bit confused about this because that repo's readme mentions to just use this repo instead etc.
Fine tune a model,~50-200 epochs.
If you have a large dataset, go to your dataset and rename the audio folder so it doesnt get seen by the UI. Select 10-50 audio samples from the DS audio folder, put these in the voices folder corresponding to your voice name that you just trained.
An average of the entire DS latents takes longer and seems to perform worse than selectively sampling a handful of audio files and applying them against the FT model.
Refresh voices
Calculate latents
Set samples to 2, iterations between 64 and 256
Click experimental and try condition free
The sampler is a huge bottleneck and fine tuning lets you sample from a smaller domain to get the same quality outcome.. sometimes... mostly.. if you're lucky.
Recalculate latents if the voice sounds slightly off, and note that the model will pick up any noise in the samples, and with small sampling batch can't filter them out, so noise will be emphasized
When training an american or british voice with a 10mins audio file(200 lines appx).
How many epochs do you recommend?
Whats the total loss and mel loss do you target?
Whats the learning rate do you prefer?
From my experience i have trained models at 0.0001 LR with a mel loss value of 0.2 to 0.5, Mostly the generated audio outputs from the finetuned models are good, but the problem comes when i try to convert longer text. What happens then is,i get repeat at the end of sentences, sometimes garbles and artifacts and sometimes also some sentences are completly ignored. While all these issue disappear when i use the default autoregressive model.
This doesn't seem to work. Still the UI is looking for the audio folder inside training directory. Got this error when tried to compute the latent after putting audio samples to the corresponding folder under voice folder.
Something went wrong Failed to open the input "./training/hayls-v2/audio/0.wav" (No such file or directory).
Thanks for the length response.
So I'm completely new to Tortoise and I don't have much knowledge of it all. I'm currently using the model that came with this repo, and have 15+ voices folders.
I've used a fine tuned model from someone else before but it was fine tuned to clone 1 specific voice. I would like to keep using all my 15 voices. Can I fine tune the model with different speaker voices, and then use those same voices to infere the fine tuned model? I ask mainly because I've only ever seen fine tuned models that were made for 1 specific voice.
Move the metadata files out and/or restart the webui
What are the metadata files specifically? I've hid the audio folder and moved the various yaml and txt files out of the way, but it's still looking for the training audio folder.