Can't train a single good model #160
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#160
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I've tried training multiple models with different voices for each, but I can never get it to actually produce a good result. The graphs always look like the attached one (never reaching low loss numbers), and using the model always just results in silence or nonsense. The attached graph is from a current model I'm training which is from 45 minutes of normal talking. Is something wrong with my training settings or am I just getting really unlucky with my dataset?
How closely does the transcription in train.txt match the content of the audio clips?
From a quick glance and comparison of a few lines it looks like it matches almost perfectly
Have you tried training a model with a single voice, for comparison?
By multiple different voices I mean I've tried making multiple models with single voices. So I've tried like 6 models with a different person on each model and none have come out well at all. My wording in the initial question was a bit shit tbf.
How big is your dataset size and how different is it from "standard" English speech?
The dataset from the images is about 45 minutes of regular volume talking in an American accent. It is just normal English speech.
Weird, that sounds just about ideal. Are there any complications like reverb or background music?
I'm having issues too. I trained a model with a single-voice dataset normalized to between 1-11 seconds per clip, using a recent version of the repo, and got a terrible voice that was way too deep.
I tried redoing it with commit
0231550287
from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too.I'm not sure why the training's become so significantly worse with the newer commits. I wonder if it's related to #103
Did redoing it include re-preparing the dataset using the old version? I've had terrible luck with the audio slicing in the newer versions.
Nope. I reused the exact same audio files and
train.txt
transcriptions.This repo itself doesn't contain any training code, just code that interfaces with the training scripts in DLAS. The only fundamental difference with using an older version of the web UI is the default value it gives for voice latent chunk sizes. Which goes back to the main thing I keep telling you all: play around with the damn voice latent chunk size slider. The defaults will never, ever be a catch-all size. You will always find a better value if you take the time and play around with it to find the value that produces the best results.
Shouldn't be. Finetuned models have inherently been flawed before that. Sure, some of my tests sounded fine after bruteforcing longer training, but those were for voices that already sounded fine in zero-shot with the base AR model. It's documented people were having issues before that regardless, and I would not suggest people to haphazardly revert to older commits, because then they'd be back at square 1 with bad models.
Anyways. I'm assuming you did not play around with the voice latent chunk slider, especially since the past two months it's been a recurring issue that people keep neglecting.
On the Wiki you wrote:
When using a prepared dataset does the value in the
Voice Chunks
field still matter? (Or are you referring toAuto-Calculate Voice Chunk Duration (in seconds)
in Settings? Neither are sliders.)It used to be a slider; I forgot I made it a number input because sliders have to have an arbitrary cap and number inputs don't:
Regardless of semantics, the same principle I've preached applies: play around with it, and on the wiki:
I suppose it's on me for not embiggening the emphasis enough to play with the voice chunk values, or for being too busy to keep up with documentation.
That's just a shortcut value for what gets suggested if a training dataset has not already been prepared.
Edit: After this experiment I rechecked
Embed Output Metadata
and now it's embedding every cond_latents file in the ./voices/<voice> into the .wav which I'm pretty sure it wasn't doing before I unchecked it....because you need to click
(Re)compute Voice Latents
when you want to regenerate them.I don't have ways to regenerate latents automatically when there's a change in chunk size, hence the button.
Because:
Voice
field changes)Must be something with DLAS; I actually used the same latents file with the old and new model when testing.
<face palm emoji>
Anyway, with regenerating the latents between each:
512 chunks: https://vocaroo.com/1nhNPGGaw7Cv
256 chunks: https://vocaroo.com/17jhdbhpjHA3
128 chunks: https://vocaroo.com/11CqV5kFNgJa
The file hashes are different but if you can spot the difference by listening to them you've got better ears than I.
Strange. I suppose I'll have my 2060 bake up a finetune throughout the day for regression tests; my 2x6800XTs will be occupied for a long while.
Too large. Start small and increase upwards.
Large data set, small values OOM.
Use a small subset then.
The other main problem, I imagine, is using too large of a dataset for latents and expecting things to be peachy keen when you're just muddying up shit when it's all averaged out. This is where the original TorToiSe thrives as it's only using the first 4 seconds of each sound file.
Although, I'm sure somewhere I've mentioned you should just use an audio clip that's as close to what you're generating to best capture the latents, but at this point my documentation doesn't seem to matter.
With a small subset (8 clips of ~4 seconds each):
1 chunk: https://vocaroo.com/15lY8pR1WRhb
2 chunks: https://vocaroo.com/19R30vtl8gjn
4 chunks: https://vocaroo.com/1g23prFUhQjG
8 chunks: https://vocaroo.com/17GWbY7IuIlL
16 chunks: https://vocaroo.com/1lBJiZQuDAh5
32 chunks: https://vocaroo.com/1akWsttveC6C
64 chunks: https://vocaroo.com/16YEcbVCm6EL
¯\_(ツ)_/¯
It's not like it sounds bad... Compared to the original it's fairly close (although the model could probably use a couple hundred more epochs to capture finer details of the accent), but I think that the qualitative difference made by varying the chunk count is being oversold: just eyeballing the spectrograms and fpcalc (chromaprint) signatures it looks like changing the seed makes far more of a difference to the output than chunk count. To quantify exactly how much I'll need to do some xor'ing and establish a baseline though.
Edit: After wider testing I've found that chunk count might have a far larger impact if the dataset you're using is one big file versus lots of smaller files. I had 632 clips of under 12 seconds each because I preprocessed the dataset for that model (there was more than one speaker so I used ffmpeg to segment it following the timestamps in the transcript). Testing on another model with a monolithic dataset showed greater variability.
After you've trained a model am I correct in saying that the voice chunks should be set to 0 when you're using that model?
AIUI when set to 0 it'll automatically choose a chunk count based on the value set for
Auto-Calculate Voice Chunk Duration
on the Settings tab unless there's already a matching cond_latents_<model_id>.pth in the folder for the voice you've using.psammites,
This checks the boxes for what I am trying to also duplicate using a different english accent (eastern european). You really generated that with 8 clips of 4 seconds each? I've got 25 clips of a bit longer length. My original attempt at training seemed to have yielded nothing - there wasn't even an accent.