Can't train a single good model #160

Open
opened 2023-03-20 21:38:45 +07:00 by dhume · 24 comments

I've tried training multiple models with different voices for each, but I can never get it to actually produce a good result. The graphs always look like the attached one (never reaching low loss numbers), and using the model always just results in silence or nonsense. The attached graph is from a current model I'm training which is from 45 minutes of normal talking. Is something wrong with my training settings or am I just getting really unlucky with my dataset?

I've tried training multiple models with different voices for each, but I can never get it to actually produce a good result. The graphs always look like the attached one (never reaching low loss numbers), and using the model always just results in silence or nonsense. The attached graph is from a current model I'm training which is from 45 minutes of normal talking. Is something wrong with my training settings or am I just getting really unlucky with my dataset?

How closely does the transcription in train.txt match the content of the audio clips?

How closely does the transcription in train.txt match the content of the audio clips?

From a quick glance and comparison of a few lines it looks like it matches almost perfectly

From a quick glance and comparison of a few lines it looks like it matches almost perfectly

Have you tried training a model with a single voice, for comparison?

Have you tried training a model with a single voice, for comparison?

By multiple different voices I mean I've tried making multiple models with single voices. So I've tried like 6 models with a different person on each model and none have come out well at all. My wording in the initial question was a bit shit tbf.

By multiple different voices I mean I've tried making multiple models with single voices. So I've tried like 6 models with a different person on each model and none have come out well at all. My wording in the initial question was a bit shit tbf.

How big is your dataset size and how different is it from "standard" English speech?

How big is your dataset size and how different is it from "standard" English speech?

The dataset from the images is about 45 minutes of regular volume talking in an American accent. It is just normal English speech.

The dataset from the images is about 45 minutes of regular volume talking in an American accent. It is just normal English speech.

Weird, that sounds just about ideal. Are there any complications like reverb or background music?

Weird, that sounds just about ideal. Are there any complications like reverb or background music?

I'm having issues too. I trained a model with a single-voice dataset normalized to between 1-11 seconds per clip, using a recent version of the repo, and got a terrible voice that was way too deep.

I tried redoing it with commit 0231550287 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too.

I'm not sure why the training's become so significantly worse with the newer commits. I wonder if it's related to #103

I'm having issues too. I trained a model with a single-voice dataset normalized to between 1-11 seconds per clip, using a recent version of the repo, and got a terrible voice that was way too deep. I tried redoing it with commit 0231550287f6e9245c7446fd8d462931aabd88e7 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too. I'm not sure why the training's become so significantly worse with the newer commits. I wonder if it's related to https://git.ecker.tech/mrq/ai-voice-cloning/issues/103

I tried redoing it with commit 0231550287 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too.

Did redoing it include re-preparing the dataset using the old version? I've had terrible luck with the audio slicing in the newer versions.

>I tried redoing it with commit 0231550287 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too. Did redoing it include re-preparing the dataset using the old version? I've had terrible luck with the audio slicing in the newer versions.

Did redoing it include re-preparing the dataset using the old version? I've had terrible luck with the audio slicing in the newer versions.

Nope. I reused the exact same audio files and train.txt transcriptions.

> Did redoing it include re-preparing the dataset using the old version? I've had terrible luck with the audio slicing in the newer versions. Nope. I reused the exact same audio files and `train.txt` transcriptions.

I tried redoing it with commit 0231550287 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too.

This repo itself doesn't contain any training code, just code that interfaces with the training scripts in DLAS. The only fundamental difference with using an older version of the web UI is the default value it gives for voice latent chunk sizes. Which goes back to the main thing I keep telling you all: play around with the damn voice latent chunk size slider. The defaults will never, ever be a catch-all size. You will always find a better value if you take the time and play around with it to find the value that produces the best results.

I wonder if it's related to #103

Shouldn't be. Finetuned models have inherently been flawed before that. Sure, some of my tests sounded fine after bruteforcing longer training, but those were for voices that already sounded fine in zero-shot with the base AR model. It's documented people were having issues before that regardless, and I would not suggest people to haphazardly revert to older commits, because then they'd be back at square 1 with bad models.


Anyways. I'm assuming you did not play around with the voice latent chunk slider, especially since the past two months it's been a recurring issue that people keep neglecting.

> I tried redoing it with commit 0231550287 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too. This repo itself doesn't contain any training code, just code that interfaces with the training scripts in DLAS. The only fundamental difference with using an older version of the web UI is the default value it gives for voice latent chunk sizes. Which goes back to the main thing I keep telling you all: *play around with the damn voice latent chunk size slider*. The defaults will never, ever be a catch-all size. You will always find a better value if you take the time and play around with it to find the value that produces the best results. > I wonder if it's related to #103 Shouldn't be. Finetuned models have inherently been flawed before that. Sure, some of my tests sounded fine after bruteforcing longer training, but those were for voices that already sounded fine in zero-shot with the base AR model. It's documented people were having issues before that regardless, and I would not suggest people to haphazardly revert to older commits, because then they'd be back at square 1 with bad models. --- Anyways. I'm assuming you did not play around with the voice latent chunk slider, especially since the past two months it's been a recurring issue that people keep neglecting.

Which goes back to the main thing I keep telling you all: play around with the damn voice latent chunk size slider. The defaults will never, ever be a catch-all size.

On the Wiki you wrote:

if you've created an LJSpeech dataset (Under Training > Prepare Dataset), this will automatically set to 0, hinting for the routine to use the dataset audio and padding them to a common size, for a little more accurate capturing of the latents.

When using a prepared dataset does the value in the Voice Chunks field still matter? (Or are you referring to Auto-Calculate Voice Chunk Duration (in seconds) in Settings? Neither are sliders.)

>Which goes back to the main thing I keep telling you all: play around with the damn voice latent chunk size slider. The defaults will never, ever be a catch-all size. On the Wiki you wrote: >if you've created an LJSpeech dataset (Under `Training` > `Prepare Dataset`), this will automatically set to 0, hinting for the routine to use the dataset audio and padding them to a common size, for a little more accurate capturing of the latents. When using a prepared dataset does the value in the `Voice Chunks` field still matter? (Or are you referring to `Auto-Calculate Voice Chunk Duration (in seconds)` in Settings? Neither are sliders.)

It used to be a slider; I forgot I made it a number input because sliders have to have an arbitrary cap and number inputs don't:
image

Regardless of semantics, the same principle I've preached applies: play around with it, and on the wiki:

Playing around with this will most definitely affect the output of your cloning, as some datasets will work better with different values.

I suppose it's on me for not embiggening the emphasis enough to play with the voice chunk values, or for being too busy to keep up with documentation.

(Or are you referring to Auto-Calculate Voice Chunk Duration (in seconds)

That's just a shortcut value for what gets suggested if a training dataset has not already been prepared.

It used to be a slider; I forgot I made it a number input because sliders have to have an arbitrary cap and number inputs don't: ![image](/attachments/7d6f1588-a99e-46dd-9d39-7df7290c91a0) Regardless of semantics, the same principle I've preached applies: [play](https://git.ecker.tech/mrq/ai-voice-cloning/issues/69#issuecomment-580) [around](https://git.ecker.tech/mrq/ai-voice-cloning/issues/41#issuecomment-450) [with it](https://git.ecker.tech/mrq/ai-voice-cloning/issues/113#issuecomment-821), and on the wiki: > Playing around with this will most definitely affect the output of your cloning, as some datasets will work better with different values. I suppose it's on me for not embiggening the emphasis enough to play with the voice chunk values, or for being too busy to keep up with documentation. > (Or are you referring to Auto-Calculate Voice Chunk Duration (in seconds) That's just a shortcut value for what gets suggested if a training dataset has not already been prepared.
3.2 KiB

Regardless of semantics, the same principle I've preached applies: play around with it

sneed@FMRLYCHKS:~/ai-voice-cloning/results/HyeonSeo$ ll
total 849648
drwxrwxrwx 1 sneed sneed     4096 Mar 22 15:47 ./
drwxrwxrwx 1 sneed sneed     4096 Mar 22 04:58 ../
-rwxrwxrwx 1 sneed sneed 72081290 Mar 22 15:39 HyeonSeo_00000_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:39 HyeonSeo_00000_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081291 Mar 22 15:40 HyeonSeo_00001_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:40 HyeonSeo_00001_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081290 Mar 22 15:40 HyeonSeo_00002_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:40 HyeonSeo_00002_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081291 Mar 22 15:41 HyeonSeo_00003_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:41 HyeonSeo_00003_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081291 Mar 22 15:42 HyeonSeo_00004_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:42 HyeonSeo_00004_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081291 Mar 22 15:42 HyeonSeo_00005_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:42 HyeonSeo_00005_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081292 Mar 22 15:43 HyeonSeo_00006_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:43 HyeonSeo_00006_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081292 Mar 22 15:44 HyeonSeo_00007_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:44 HyeonSeo_00007_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081293 Mar 22 15:44 HyeonSeo_00008_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:44 HyeonSeo_00008_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081293 Mar 22 15:44 HyeonSeo_00009_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:44 HyeonSeo_00009_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081293 Mar 22 15:46 HyeonSeo_00010_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:46 HyeonSeo_00010_fixed.wav*
-rwxrwxrwx 1 sneed sneed 72081294 Mar 22 15:47 HyeonSeo_00011_fixed.json*
-rwxrwxrwx 1 sneed sneed   421524 Mar 22 15:47 HyeonSeo_00011_fixed.wav*
sneed@FMRLYCHKS:~/ai-voice-cloning/results/HyeonSeo$ grep 'chunks' *.json
HyeonSeo_00000_fixed.json:      "voice_latents_chunks": 0,
HyeonSeo_00001_fixed.json:      "voice_latents_chunks": 1,
HyeonSeo_00002_fixed.json:      "voice_latents_chunks": 2,
HyeonSeo_00003_fixed.json:      "voice_latents_chunks": 4,
HyeonSeo_00004_fixed.json:      "voice_latents_chunks": 8,
HyeonSeo_00005_fixed.json:      "voice_latents_chunks": 16,
HyeonSeo_00006_fixed.json:      "voice_latents_chunks": 32,
HyeonSeo_00007_fixed.json:      "voice_latents_chunks": 64,
HyeonSeo_00008_fixed.json:      "voice_latents_chunks": 128,
HyeonSeo_00009_fixed.json:      "voice_latents_chunks": 256,
HyeonSeo_00010_fixed.json:      "voice_latents_chunks": 512,
HyeonSeo_00011_fixed.json:      "voice_latents_chunks": 1024,
sneed@FMRLYCHKS:~/ai-voice-cloning/results/HyeonSeo$ sha256sum *.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00000_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00001_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00002_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00003_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00004_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00005_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00006_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00007_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00008_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00009_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00010_fixed.wav
81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506  HyeonSeo_00011_fixed.wav

Edit: After this experiment I rechecked Embed Output Metadata and now it's embedding every cond_latents file in the ./voices/<voice> into the .wav which I'm pretty sure it wasn't doing before I unchecked it.

> Regardless of semantics, the same principle I've preached applies: play around with it ``` sneed@FMRLYCHKS:~/ai-voice-cloning/results/HyeonSeo$ ll total 849648 drwxrwxrwx 1 sneed sneed 4096 Mar 22 15:47 ./ drwxrwxrwx 1 sneed sneed 4096 Mar 22 04:58 ../ -rwxrwxrwx 1 sneed sneed 72081290 Mar 22 15:39 HyeonSeo_00000_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:39 HyeonSeo_00000_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081291 Mar 22 15:40 HyeonSeo_00001_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:40 HyeonSeo_00001_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081290 Mar 22 15:40 HyeonSeo_00002_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:40 HyeonSeo_00002_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081291 Mar 22 15:41 HyeonSeo_00003_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:41 HyeonSeo_00003_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081291 Mar 22 15:42 HyeonSeo_00004_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:42 HyeonSeo_00004_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081291 Mar 22 15:42 HyeonSeo_00005_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:42 HyeonSeo_00005_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081292 Mar 22 15:43 HyeonSeo_00006_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:43 HyeonSeo_00006_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081292 Mar 22 15:44 HyeonSeo_00007_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:44 HyeonSeo_00007_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081293 Mar 22 15:44 HyeonSeo_00008_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:44 HyeonSeo_00008_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081293 Mar 22 15:44 HyeonSeo_00009_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:44 HyeonSeo_00009_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081293 Mar 22 15:46 HyeonSeo_00010_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:46 HyeonSeo_00010_fixed.wav* -rwxrwxrwx 1 sneed sneed 72081294 Mar 22 15:47 HyeonSeo_00011_fixed.json* -rwxrwxrwx 1 sneed sneed 421524 Mar 22 15:47 HyeonSeo_00011_fixed.wav* sneed@FMRLYCHKS:~/ai-voice-cloning/results/HyeonSeo$ grep 'chunks' *.json HyeonSeo_00000_fixed.json: "voice_latents_chunks": 0, HyeonSeo_00001_fixed.json: "voice_latents_chunks": 1, HyeonSeo_00002_fixed.json: "voice_latents_chunks": 2, HyeonSeo_00003_fixed.json: "voice_latents_chunks": 4, HyeonSeo_00004_fixed.json: "voice_latents_chunks": 8, HyeonSeo_00005_fixed.json: "voice_latents_chunks": 16, HyeonSeo_00006_fixed.json: "voice_latents_chunks": 32, HyeonSeo_00007_fixed.json: "voice_latents_chunks": 64, HyeonSeo_00008_fixed.json: "voice_latents_chunks": 128, HyeonSeo_00009_fixed.json: "voice_latents_chunks": 256, HyeonSeo_00010_fixed.json: "voice_latents_chunks": 512, HyeonSeo_00011_fixed.json: "voice_latents_chunks": 1024, sneed@FMRLYCHKS:~/ai-voice-cloning/results/HyeonSeo$ sha256sum *.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00000_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00001_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00002_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00003_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00004_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00005_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00006_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00007_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00008_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00009_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00010_fixed.wav 81e2ae789cae4edfcfdb5b70b92d3c5bd372a8cdf9ffda8db0b51818ac9bf506 HyeonSeo_00011_fixed.wav ``` Edit: After this experiment I rechecked `Embed Output Metadata` and now it's embedding every cond_latents file in the ./voices/\<voice\> into the .wav which I'm pretty sure it wasn't doing before I unchecked it.

...because you need to click (Re)compute Voice Latents when you want to regenerate them.

I don't have ways to regenerate latents automatically when there's a change in chunk size, hence the button.

>b-but then why is it in the generation settings JSON???

Because:

  • it's an input passed to inference for when it needs to generate latents anyways
  • it's an input passed to inference to get dumped to disk as settings to load on next startup (despite it getting overwritten anyways when the Voice field changes)
  • all settings passed to the inference function gets saved due to passing them as kwargs to make my damned life easier when adding more arguments.
...because you need to click `(Re)compute Voice Latents` when you want to regenerate them. I don't have ways to regenerate latents automatically when there's a change in chunk size, hence the button. >\>b-but then why is it in the generation settings JSON??? Because: * it's an input passed to inference for when it needs to generate latents anyways * it's an input passed to inference to get dumped to disk as settings to load on next startup (despite it getting overwritten anyways when the `Voice` field changes) * ***all*** settings passed to the inference function gets saved due to passing them as kwargs to make my damned life easier when adding more arguments.

I tried redoing it with commit 0231550287 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too.

This repo itself doesn't contain any training code, just code that interfaces with the training scripts in DLAS. The only fundamental difference with using an older version of the web UI is the default value it gives for voice latent chunk sizes. Which goes back to the main thing I keep telling you all: play around with the damn voice latent chunk size slider. The defaults will never, ever be a catch-all size. You will always find a better value if you take the time and play around with it to find the value that produces the best results.

I wonder if it's related to #103

Shouldn't be. Finetuned models have inherently been flawed before that. Sure, some of my tests sounded fine after bruteforcing longer training, but those were for voices that already sounded fine in zero-shot with the base AR model. It's documented people were having issues before that regardless, and I would not suggest people to haphazardly revert to older commits, because then they'd be back at square 1 with bad models.


Anyways. I'm assuming you did not play around with the voice latent chunk slider, especially since the past two months it's been a recurring issue that people keep neglecting.

Must be something with DLAS; I actually used the same latents file with the old and new model when testing.

> > I tried redoing it with commit 0231550287 from about 2 weeks ago, and the output was much better; close to the dataset voice. The training ran much faster too. > > This repo itself doesn't contain any training code, just code that interfaces with the training scripts in DLAS. The only fundamental difference with using an older version of the web UI is the default value it gives for voice latent chunk sizes. Which goes back to the main thing I keep telling you all: *play around with the damn voice latent chunk size slider*. The defaults will never, ever be a catch-all size. You will always find a better value if you take the time and play around with it to find the value that produces the best results. > > > I wonder if it's related to #103 > > Shouldn't be. Finetuned models have inherently been flawed before that. Sure, some of my tests sounded fine after bruteforcing longer training, but those were for voices that already sounded fine in zero-shot with the base AR model. It's documented people were having issues before that regardless, and I would not suggest people to haphazardly revert to older commits, because then they'd be back at square 1 with bad models. > > --- > > Anyways. I'm assuming you did not play around with the voice latent chunk slider, especially since the past two months it's been a recurring issue that people keep neglecting. Must be something with DLAS; I actually used the same latents file with the old and new model when testing.

...because you need to click (Re)compute Voice Latents when you want to regenerate them.

<face palm emoji>

Anyway, with regenerating the latents between each:

512 chunks: https://vocaroo.com/1nhNPGGaw7Cv
256 chunks: https://vocaroo.com/17jhdbhpjHA3
128 chunks: https://vocaroo.com/11CqV5kFNgJa

The file hashes are different but if you can spot the difference by listening to them you've got better ears than I.

> ...because you need to click (Re)compute Voice Latents when you want to regenerate them. \<face palm emoji\> Anyway, with regenerating the latents between each: 512 chunks: https://vocaroo.com/1nhNPGGaw7Cv 256 chunks: https://vocaroo.com/17jhdbhpjHA3 128 chunks: https://vocaroo.com/11CqV5kFNgJa The file hashes are different but if you can spot the difference by listening to them you've got better ears than I.

Must be something with DLAS; I actually used the same latents file with the old and new model when testing.

Strange. I suppose I'll have my 2060 bake up a finetune throughout the day for regression tests; my 2x6800XTs will be occupied for a long while.

512
256
128

Too large. Start small and increase upwards.

> Must be something with DLAS; I actually used the same latents file with the old and new model when testing. Strange. I suppose I'll have my 2060 bake up a finetune throughout the day for regression tests; my 2x6800XTs will be occupied for a long while. > 512 > 256 > 128 Too large. Start small and increase upwards.

Too large. Start small and increase upwards.

Large data set, small values OOM.

> Too large. Start small and increase upwards. Large data set, small values OOM.

Use a small subset then.

The other main problem, I imagine, is using too large of a dataset for latents and expecting things to be peachy keen when you're just muddying up shit when it's all averaged out. This is where the original TorToiSe thrives as it's only using the first 4 seconds of each sound file.

Although, I'm sure somewhere I've mentioned you should just use an audio clip that's as close to what you're generating to best capture the latents, but at this point my documentation doesn't seem to matter.

Use a small subset then. The other main problem, I imagine, is using too large of a dataset for latents and expecting things to be peachy keen when you're just muddying up shit when it's all averaged out. This is where the original TorToiSe thrives as it's only using the first 4 seconds of each sound file. Although, I'm sure somewhere I've mentioned you *should* just use an audio clip that's as close to what you're generating to best capture the latents, but at this point my documentation doesn't seem to matter.

Use a small subset then.

With a small subset (8 clips of ~4 seconds each):

1 chunk: https://vocaroo.com/15lY8pR1WRhb
2 chunks: https://vocaroo.com/19R30vtl8gjn
4 chunks: https://vocaroo.com/1g23prFUhQjG
8 chunks: https://vocaroo.com/17GWbY7IuIlL
16 chunks: https://vocaroo.com/1lBJiZQuDAh5
32 chunks: https://vocaroo.com/1akWsttveC6C
64 chunks: https://vocaroo.com/16YEcbVCm6EL

¯\_(ツ)_/¯

It's not like it sounds bad... Compared to the original it's fairly close (although the model could probably use a couple hundred more epochs to capture finer details of the accent), but I think that the qualitative difference made by varying the chunk count is being oversold: just eyeballing the spectrograms and fpcalc (chromaprint) signatures it looks like changing the seed makes far more of a difference to the output than chunk count. To quantify exactly how much I'll need to do some xor'ing and establish a baseline though.

Edit: After wider testing I've found that chunk count might have a far larger impact if the dataset you're using is one big file versus lots of smaller files. I had 632 clips of under 12 seconds each because I preprocessed the dataset for that model (there was more than one speaker so I used ffmpeg to segment it following the timestamps in the transcript). Testing on another model with a monolithic dataset showed greater variability.

> Use a small subset then. With a small subset (8 clips of ~4 seconds each): 1 chunk: https://vocaroo.com/15lY8pR1WRhb 2 chunks: https://vocaroo.com/19R30vtl8gjn 4 chunks: https://vocaroo.com/1g23prFUhQjG 8 chunks: https://vocaroo.com/17GWbY7IuIlL 16 chunks: https://vocaroo.com/1lBJiZQuDAh5 32 chunks: https://vocaroo.com/1akWsttveC6C 64 chunks: https://vocaroo.com/16YEcbVCm6EL ¯\\\_(ツ)_/¯ It's not like it sounds *bad*... Compared [to the original](https://www.youtube.com/watch?v=Aoq5wDLbIqE) it's fairly close (although the model could probably use a couple hundred more epochs to capture finer details of the accent), but I think that the qualitative difference made by varying the chunk count is being oversold: just eyeballing the spectrograms and fpcalc (chromaprint) signatures it looks like changing the seed makes far more of a difference to the output than chunk count. To quantify exactly how much I'll need to do some xor'ing and establish a baseline though. Edit: After wider testing I've found that chunk count might have a far larger impact if the dataset you're using is one big file versus lots of smaller files. I had 632 clips of under 12 seconds each because I preprocessed the dataset for that model (there was more than one speaker so I used ffmpeg to segment it following the timestamps in the transcript). Testing on another model with a monolithic dataset showed greater variability.

After you've trained a model am I correct in saying that the voice chunks should be set to 0 when you're using that model?

After you've trained a model am I correct in saying that the voice chunks should be set to 0 when you're using that model?

After you've trained a model am I correct in saying that the voice chunks should be set to 0 when you're using that model?

AIUI when set to 0 it'll automatically choose a chunk count based on the value set for Auto-Calculate Voice Chunk Duration on the Settings tab unless there's already a matching cond_latents_<model_id>.pth in the folder for the voice you've using.

> After you've trained a model am I correct in saying that the voice chunks should be set to 0 when you're using that model? AIUI when set to 0 it'll automatically choose a chunk count based on the value set for `Auto-Calculate Voice Chunk Duration` on the Settings tab unless there's already a matching cond_latents_\<model_id\>.pth in the folder for the voice you've using.

Use a small subset then.

With a small subset (8 clips of ~4 seconds each):

1 chunk: https://vocaroo.com/15lY8pR1WRhb
2 chunks: https://vocaroo.com/19R30vtl8gjn
4 chunks: https://vocaroo.com/1g23prFUhQjG
8 chunks: https://vocaroo.com/17GWbY7IuIlL
16 chunks: https://vocaroo.com/1lBJiZQuDAh5
32 chunks: https://vocaroo.com/1akWsttveC6C
64 chunks: https://vocaroo.com/16YEcbVCm6EL

¯\_(ツ)_/¯

It's not like it sounds bad... Compared to the original it's fairly close (although the model could probably use a couple hundred more epochs to capture finer details of the accent), but I think that the qualitative difference made by varying the chunk count is being oversold: just eyeballing the spectrograms and fpcalc (chromaprint) signatures it looks like changing the seed makes far more of a difference to the output than chunk count. To quantify exactly how much I'll need to do some xor'ing and establish a baseline though.

Edit: After wider testing I've found that chunk count might have a far larger impact if the dataset you're using is one big file versus lots of smaller files. I had 632 clips of under 12 seconds each because I preprocessed the dataset for that model (there was more than one speaker so I used ffmpeg to segment it following the timestamps in the transcript). Testing on another model with a monolithic dataset showed greater variability.

psammites,
This checks the boxes for what I am trying to also duplicate using a different english accent (eastern european). You really generated that with 8 clips of 4 seconds each? I've got 25 clips of a bit longer length. My original attempt at training seemed to have yielded nothing - there wasn't even an accent.

> > Use a small subset then. > > With a small subset (8 clips of ~4 seconds each): > > 1 chunk: https://vocaroo.com/15lY8pR1WRhb > 2 chunks: https://vocaroo.com/19R30vtl8gjn > 4 chunks: https://vocaroo.com/1g23prFUhQjG > 8 chunks: https://vocaroo.com/17GWbY7IuIlL > 16 chunks: https://vocaroo.com/1lBJiZQuDAh5 > 32 chunks: https://vocaroo.com/1akWsttveC6C > 64 chunks: https://vocaroo.com/16YEcbVCm6EL > > ¯\\\_(ツ)_/¯ > > It's not like it sounds *bad*... Compared [to the original](https://www.youtube.com/watch?v=Aoq5wDLbIqE) it's fairly close (although the model could probably use a couple hundred more epochs to capture finer details of the accent), but I think that the qualitative difference made by varying the chunk count is being oversold: just eyeballing the spectrograms and fpcalc (chromaprint) signatures it looks like changing the seed makes far more of a difference to the output than chunk count. To quantify exactly how much I'll need to do some xor'ing and establish a baseline though. > > Edit: After wider testing I've found that chunk count might have a far larger impact if the dataset you're using is one big file versus lots of smaller files. I had 632 clips of under 12 seconds each because I preprocessed the dataset for that model (there was more than one speaker so I used ffmpeg to segment it following the timestamps in the transcript). Testing on another model with a monolithic dataset showed greater variability. psammites, This checks the boxes for what I am trying to also duplicate using a different english accent (eastern european). You really generated that with 8 clips of 4 seconds each? I've got 25 clips of a bit longer length. My original attempt at training seemed to have yielded nothing - there wasn't even an accent.
Sign in to join this conversation.
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#160
There is no content yet.