Results, Retrospectives, and Recommendations #253
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#253
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I've been working with this repo for the past few weeks and successfully cloned mine target voice. I largely followed @nanonomad 's advice per his youtube video and conducted most of the training and generation between google colab and paperspace.
Last year, I had cloned my target voice using BenAAndrew's Voice Cloning App but the results were quite poor. This is probably in large part due to the poor dataset preparation prior to cloning. Similarly with this repo, my first attempt with yielded better, but inadequate results for my intended use-case (audiobook narration). While my dataset audio was professional quality, single-speaker, I did not prepare it appropriately to take full advantage.
My first model utilized ~300ish audio files indiscriminately cut every 7 secs of audio, uploaded in stereo channel 44k wavs. Trained following @nanonomad's recommendation:
First model:
google colab env
~300ish arbitrary 7s clips, stereo 44k,
prepared with whisper base
200 epochs
default learning rate and ratios
learning rate scheme: cos. annealing=4
batch size 64
gradient accumulation 32
save frequency 50
voice latents were generated from the same entire training data
The model quality was better than previous results, but the speakers accent was highly variable as well as the age and gruffness of the voice, between random seeds for the same prompt. Also, the quality of the speech itself carried that signature unpleasant resonance in most AI voices you hear. It's not the robot voice but a sharp painful quality when listening to the voice, caused by frequency resonance.
My second attempt I started over and preprocessed the dataset by splitting the left channel into a mono track, as well as resample it at 22050. This alone saved alot of time later when moving and processing the data in the notebook. Additionally, I sliced all the data myself into as many 10s chunks as possible. Another recommendation I would make here is to premaster the audio to whatever degree you can because and even subtle noises will be picked up on and regenerated later after training. Finally, in preparing the dataset within the webui, I opted for whisper-medium to for just that extra bit of improvement.
Second model:
paperspace env
~225 clips, 4-10s, 22k left-stereo->mono
prepared with whisper medium
200 epochs
default learning rate and ratios
learning rate scheme: cos. annealing=4
batch size 64
gradient accumulation 32
save frequency 15
voice latents were generated from half of training data
My results were close this time. However the model still wasn't overall consistent in tone or character, and did not possess the exact quality to be dead on a clone.
My thought was because it was only 35ish minutes of dataset audio, that I needed to training it either more epochs or with a larger dataset. In my trials with both approaches, each produced noticeably worse content. More epochs on the second dataset only skewed towards robotic, while the larger dataset model did not have equally good audio to add, so its model was also lesser quality.
By luck, though, I made an error when generating audio samples, I forgot to change the voice folder after changing the autoregressive model back to the second. In that folder I had roughly ~20 clips of some of the best audio, which in combination with best trained model, has produced extremely high quality outputs, both in terms matching the target speakers vocal tones and prosody, but actually fewer artifacts and such as well. I attribute this lifelike-ness of the actual pronunciation onto the general speed, tone, and diction from the trained model. To much audio, even on the same quality, seems to rapidly degrade voice latents quality.
Finally, I haven't dug as much into the actual settings for generation. Again, following @nanonomad 's example, I shoot for 2 samples when generating, and in my particular case, I max 512 iterations because I want the cleanest audio output I won't have to do much work on. I'm using this voice to narrate books, and as few artifacts as possible really helps. Because the results have been so good from the model, I find I can comfortably generate 1-2 minute lengths of speech (10-20minutes generation time) with very little need for detail rework, other than to bring it in line audiobook standards. Further, I'd recommend identifying your best outputs and reusing the seed(s) to limit your bad generations. That works for me. Also, depending on the average size of your clips, you should try to aim for each individual line to be around that long. Consistent training clip length will have the model be best for those similar lengths. And, oddly, enough, the model seems to lose quality when exceeding ~15-20lines overall, even when they are each generated as separate prompts. There must be carryover between them, because more lines in the prompt up to a certain point will actually create a very convincing sound, emotionally, in the generation.
I think that covers everything I can think of to share.. I'm now working on a different speaker, and optimizing my prompt generation so I can one-shot moreorless large blocks of text.
Key takeaways:
Prep your dataset as much as possible
Better to train on less but the best than more of the worse
Use your training samples to target the rough speed and speaking style
Prepare your latent samples with the actual voice sound in mind
Depending on your model, large lines of text can allow the voice to really develop character
Once you have acceptable outputs, reuse the best seeds
Look into remastering your audio before and after to get rid of bad artifacts
Speculations and ideas:
There's probably a ceiling of marginal quality enhancement as you raise iterations for generation. For me, I see big jumps in quality from 64->128->256 its for generation. Running at 2 samples, 512 its, generally takes 5x-10x speech length. Need to see if there is a happy medium
Taking a well-trained, consistent model and use its output to train on a faster tts model or platform. (RVC, so-vits-svc-fork, coquitts suite of models)
And/or, retrain a model on its synthetic corpus
Train a model on only sung audio and see what happens
Simple Audio remastering tools:
vocalremover.org (great for first pass)
Adobe Enhance Speech (great but tends to distort pitch)
Castofly (better than adobe imo)
UVR (best for removing music)
Spectrum matching to make outputs sound more consistent
It sorta depends on what you need to clean up but there's a tool for every situation. In general the paid services are all lacking compared to what's either commercially free or already open source. I haven't tried yet, but it might be possible to salvage some bad gens just by EQing them properly, saving your time and compute cost. Most of the magic is training on super high quality clips that are around 7-10s long. Training on exclusively those might be better than including shorter clips. My outputs are usually 1 shot, gtg after artifact clean, then eq and volume balance.
Yes as usual I have found the quality of the training data is very important, a test model I made had no considerable differences in the output quality whether it was using 12 samples or 50, whether it was 40 iterations or 400. The best results I've had have been from training data that had similar context, and therefore similar tones and emphasis, if the speaker was speaking about a number of different things in different ways then that seems to create inconsistency in the outputs, which makes sense, highly variable and fluctuating in = highly variable and fluctuating out.
One thing I haven't tried (yet) is combining models and latents from different speakers. Can I get the prosody of one speaker in the voice of another?
Edit - Some other notes:
This may vary depending on your dataset and choice of samples for latents... I'm getting better outputs limiting when total lines generated in 1 go to 10-15 lines, and no more than 150 characters per line, ideally 100-120 characters. I do believe that even though each line is a separate generation, that the entire prompt influences how each line is said. Too many/too few lines and characters based on your training dataset samples, will create worse quality output. Too many and there's too much data to be predicting on, too few and there's not enough data to make intelligent inferences. For my dataset, at least, the sweet spot is as listed.
It would be cool if we could provide context lines of speech that aren't generated, but nonetheless are tokenized to influence the output of the target line. I'm not sure if wrapping it in brackets would convey it the same. Maybe if there was a [skip] keyword of some kind.
I have a few questions for you or whoever else may be lurking.
I see you recommend accuracy over a larger dataset but do you still have a recommended dataset size?
Do you have a recommended size/recommended percentage for the validation set? Can you just make a validation set by setting aside some data from what you were going to use for the training? Or does the validation data need to be 'special' in some way?
Do you recommend a length of 4-10seconds for each clip?
Honestly I wing it using the recommended settings more or less and get good results. I've found no considerable difference between 1000 steps and 2500, seems to be an efficient training method. I usually use around 5-10 mins of clean audio 22500hz WAV files, never validate as it just seemed to complicate things, let whisper do its thing and auto slice segments. I make sure that the audio chosen is one of a level nature, eg: not picking 3 mins of rants or disjointed clips, however if you wanted to make some ranty models that could be a way to go but I like the audio book-lecturing long form style.
For generating I use 10sec clips from my dataset and as I mentioned before I continue to find no major difference when I Use Original Latents Method (AR) and
Use Original Latents Method (Diffusion) between using 4 samples and 50, 40 steps or 100, which may be due to DDIM being a solid sampler I'm a fan of it in stable diffusion and it usually produces decent results around 40-50steps so maybe thats whats happening here. It also seems the model does most of the heavy lifting and emotion and the finer tuning settings make the biggest difference when you are in the efficient range of iterations. My model could also be overfit I suppose however Ive had similar results from 1000 steps as 2500..
I've been quite happy with 6 samples and 40 iterations, DDIM, temp high (.7-1) Top P low (0.5-0,1) CVVP 0. For some reason P sampler is only doing half the sentence..until now where I reduced cond free K to low and it sounds quite good.
Usually takes around 30secs using a T4 15gb VRAM in free tier collab for roughly 20 words and 6 seconds. I'm sure that will get faster with more optimizations.
Ok we have a new winner , coming in with a nice quality at 23secs, see image attached
edit: Just tried a new passage, 232 words (spoken over 1m20secs) in 283.5 secs with those settings.
So that's 1 second of generated audio per 3.6 seconds of processing.
What loss_mel_ce and loss_text_ce do you shoot for? I assume they are measurements of the loss when comparing generated clips to the mel spectrograms and text of the validation sets?
It's all going to depend on what the needs of the audio are. As mentioned previously, my use case is audiobooks, which is a far cry from everyday AI raps and meme content on youtube (which it's actually "better" to be worse/more uncanny valley territory). So depending on what quality and length of audio you need, it will dictate what the standards of your parameters are. Unfortunately, there aren't any general guidelines across models in terms of loss rates and inference parameters simply because much of it builds off the quality of the training set and the difficulty of learning the speaker's prosody, intonations, and inflections. Overall, it's better to be testing the checkpoints along the way and until it no longer changes or begins to get worse. I will disagree with gforce on the autoslicing bit, I feel it's better to hand slice, particularly in getting the most accurate prosody by not splitting phrases halfway. All of this said, YMMV and it's better to just get started and give the most care to the dataset upfront.
Edit: One note I forgot to mention about the autoslicing: another key benefit from hand slicing is being able to keep your training data of a consistent length. Again, in my case, because I am creating audiobooks, keeping the vast majority of my clips to longer lengths helps with inferencing longer sentences with minimal hallucinations. Again, decide based on whatever you're trying to do with the audio.
Upon further testing, I've found much of the poorer audio quality is due to phasing in the output, which is a kind of audio artifact in this case because of the natural variance in voice frequencies across audio clips being smushed together when trained in aggregate. Removing that cleans up the output enormously, to the point of it being indistinguishable with target speech. I'm not sure how this could be controlled for in training data, but it should be a relatively easy and lightweight fix if applied on the output within tortoise itself.
First, many thanks for starting this wonderful thread. Your comments align well with my own experiments and observations so it's great to have another data point. You've done an amazing job of laying it out clearly and concisely.
I do still run into issues with output quality rather often and I work around it by generating a half dozen samples and picking the best one, but it's very time consuming.
Can you expand a bit on what you mean regarding phasing in the output? I'm curious to know if this is a similar issue. Could you perhaps upload some samples of output that exhibits the phasing artifacts?
When you say that "Removing that cleans up the output enormously," is that something that can be done by post processing in a program like Adobe Audition or Audacity?
Well, I don't have any examples on hand to share, but, truthfully, an untrained ear won't hear the nuances anyway. If you consume AI vocal content like I do, then you've probably picked up on the typical monotone and robotic/metallic/autotune/annoying elements of typical output. Aside from the monotone (which is just a bad underlying prosody model), the causes of the robotic sounds are varied but their effects sounds very similar.
Phasing - there is a slight reverb character to the speech, like an electronic echo. This is most likely from training with out-of-sync stereo channels mixed down into mono, creating two offset copies of the training speech that gets reflected in the final output. There are postprocessing techniques for removing this or aligning the phases but it's better to train on isolated channel audio instead (for a variety of other reasons as well).
Unpleasant frequencies - human speech is most typically within the range of 100 Hz to 10000 Hz, with the comfy spectrum generally in the 150-7000 region. Removing frequencies outside the comfort zone provides unconscious comfort to the listener.
Pitch - I notice that my outputs are very treble heavy and lack the 'body' of lower ranges. I'm not sure if this is a matter of the dataset or tortoise, but regardless needs to be addressed for any voice that isn't an anime waifu.
Harmonics/Resonances - The trickiest portion to correct is unpleasant harmonics caused by frequency resonances within the voice itself. These resonances tend to reinforce one another (in a painful, shrill way) and addressing it means dampening the power in one peak or the other. The problem with that however, is sacrificing either clarity or body of the sound and in some cases it's so bad that a regen of the target output is needed. It's not really anything that can be controlled at the dataset level and requires careful consideration to address. Unfortunately, this simply seems to be an aspect of the nature of machine learning as it currently stands.
Human Error - any post processing audio work itself may introduce some of these or other problems and having a target reference audio will help with decision making in that process. I anticipate the future will have improved generative models to address some of these problems before they make it to your download folder, or at the very least, automated dynamic remastering tools specifically for AI generative vocals.
Finally, the days of it being acceptable to make youtube raps from a series of Frankensteined clips is basically over. Similarly, the ever dissatisfied Joe Public will demand more and more perfection to compensate for their lack of imagination. Fuck em.
Oh yeah making more of an effort is a good idea, for me it was simply laziness and wanting to rush to the result, as well as some frustration with trying to get the training started in the first place. It does explain why some of the ends of sentences don't sound like they ended naturally, as if the speaker had been distracted and stopped talking. Will have a look at manually slicing next time now I've got a rough idea of what training parameters to use.
What a journey this has been so far! I can now hear bad recording practices in every audio I hear, particularly on youtube... There is a new web app clonedub.com (using what I assume is Google's Translatotron 3) for voice-to-voice multilingual translation, while retaining the original speaker's voice... essentially a universal translator from Star Trek, though it's not real-time (yet).
How do you go about manually changing the whisper.json to re-slice without messing with the metrics that the whisper json creates? Will that not cause some sort of conflict or is that extra data just for evaluation reasons?
I guess I obviously have to include the tokens from the segment I want to merge it with.
Eg:
"temperature": 0.0,
"avg_logprob": -0.28843376011524385,
"compression_ratio": 1.5317460317460319,
"no_speech_prob": 0.2593795359134674
I've also found looking through my training samples that there are many that are very short, probably because I didn't validate, its surprising the model turned out so well considering..
Idk, haven't edited the whisper.json
I started editting it but it was so tedious I decided to redo the sample set again in audacity and actually validate this time to cull the very short segments.
A smart trick I ran into is running generated audio through RVC with its own Harvest trained model, helps smooth out the audio and make it more consistent
Have you played around with this? https://github.com/ming024/FastSpeech2
I haven't... Is there some reason to? The repo hasn't been updated in 2 years. https://cmchien.ttic.edu/FastSpeech2/ The speech samples sound super metallic.
There are neat tricks. Sometimes a dataset/model for whatever reason can't pronounce a word correctly, or seems stuck on a particular "interpretation" of the prompt. Especially on unusual or complex multiple syllable words, and names.
For instance, if a prompt like "Giorgio Armani" wasn't being pronounced correctly, creating a line like this:
Giorgio Armani; Giorgio Armani. Giorgio Armani, Giorgio Armani:
or swapping the order of the punctuation, can nudge the prompt enough such that it miraculously pronounces it correctly.
Another tidbit is if a sentence itself is problematic for some reason, "While we sat in the car park, the ravens arrived and ate the bread bits we left on the windscreen." can be modified, broken up, and glued back together after:
While we sat in the car park: the ravens arrived;
and ate the bread bits,
we left on the windscreen.
something like that... basically trial and error, and be willing to subtly rewrite the sentence. Also, have a dedicated seed is pretty much mandatory (from experience)
Having worked with this repo ALOT at this point, it is possible to have reliable set-it-and-forget-it settings, provided there's post remastering that moreorless evens out variations between clips... long generations are basically a bunch of glued together clips, on a single seed+settings, but there will still be variance, just be sure to have some process in place to smooth it all out after.
Also, recently, having trained 3 new models on different speakers, there's a strong bias within the repo itself towards uppitching the model. Whether that is to do with something in the math more fundamentally or not shrugs but it's something to consider. This bias is way worse if the dataset audio has dynamic vocals. While training on song vocals hasn't been tried, even deep voiced male speakers delivering emphatic speeches will often come out sounding effeminate or outright female.