VALL-E Integration (and In Response To TorToiSe: a Quick Retrospective) #152

Open
opened 2023-03-18 04:21:02 +00:00 by mrq · 234 comments
Owner

As I may have hinted with my not-so-subtle commits, I'm working towards getting VALL-E integrated as an alternative TTS backend:

  • you can switch to it by passing --tts-backend="vall-e"
    • I might have to keep it this way, as not every option will also carry over for VALL-E.
  • currently, only training is integrated, as I need to get models cobbled together to inference against.
    • metric grabbing almost works, but I need to reparse the log files to fetch previous metrics like I did with DLAS.

I'm backing this implementation as my VALL-E implementation:

  • it's nice and clean, and requires little extra dependencies (phonemizer + a backend, like espeak-ng).
  • the other implementation:
    • isn't so nice and clean.
    • requires some nasty dependencies (lhotse + k2 + icecat just for dataset preparation).
    • the documentation is CBT.
    • doesn't even work on my machine desu.
  • integration is fairly straightforward.
  • dataset preparation is robust.
    • I can phonemize the input text myself as IPAs.
      • the default phonemizes to ARPABET using g2p-en, which ties it only to English.
    • the audio quantizer is easy to use.
    • similar to DLAS, I can just spawn a process to for training and parse output.

The training process is pretty intuitive too:

  • it leverages DeepSpeed, so any development on it will improve the training process too.
  • there's a """REPL""" to pass commands through stdin to do stuff like save, report metrics, and quit (saves on quit too).
  • training configuration is pretty much just setting the max iterations, batch size, and save/eval rates.
  • the defaults seem to be sane enough to rely on, so no need to squabble around with the LR and scheduler or gradient accumulation sizes.
  • validation/evaluation is actually meaningful, as it generates audio against the model you can actually listen to for gauging how well it's chugging along.
  • resuming is automagically handled.

However, I have some qualms with it so far:

  • no BitsAndBytes to save my hide, so it's quite the VRAM hog.
  • leveraging ZeRO seems to require editing the code itself that handles passing some configurations to DeepSeed.
    • it might work, but it can't do it when resuming.
  • probably even more sensitive than training with DLAS, as it will OOM on me easily without conservative batch sizes.
    • there's an autotuner within DeepSpeed, but it doesn't seem to want to work.
  • Linux only, as training with DeepSpeed is Linux only.
    • there's apparently support for inferencing on Windows, but anything that requires MSVC is major CBT.
  • doesn't seem to want to train under a CUDA backend (some unhelpful error message).
    • this also appeared when I tried the other implementation, so it might just be a CUDA issue, so ROCm only for me.
  • right now, I don't see an easy way to finetune models in the future. I'm sure there's a config setting to do so..

And other concerns with VALL-E:

  • no base model, so anyone wanting to try it out need to invest in a dataset, the compute, and the time.
    • gauging how well VALL-E performs is pretty much up to chance, as the demos don't seem all that great desu.
    • I can spend some money on renting GPUs and trying to create a general model, but it's grim when I can't get training to work on a paperspace instance.
  • requires two models, the AR and the NAR (despite it generating audio fine during evaluation with just the AR, but I might be neglecting something and need to read the paper again).
  • even for a specific voice, it took quite a while to get something even sounding like English (this is like an artifact from Tau Ceti V).

As for my thoughts on TorToiSe after everything from this being a silly rentry to using it, to now:

  • zero shot is definitely tied to how well the base model is, and the base model isn't quite up to par, and if the base model is going to be replaced, why bother sticking with TorToiSe when you can just devote the time and effort into a different system. The base model is still impressive, for what it's worth.
  • finetuning is a bit of an art, as I seem to get decent results with my go-to defaults, while it seems for others it's not up to par.
    • I want to say this is because I've effectively spent most of my existence the past month or so around this, but even then I am not an expert.
  • TorToiSe definitely has too many parts (tokenizer, AR model, CLVP/CVVP, diffuser, vocoder), and has a lot of bandaids to get it up to quality (smarter calculations for latents, kv_caching, bitsandbytes, resampling, BigVGAN, voicefixer).
  • desu I don't have any ill-will with the original dev, and I'm not even sure if I'm on his radar, but I rather not be on his radar by subjugating TorToiSe.
  • I'm starting to hit the limitations of finetuning the base TorToiSe model.
    • for non-English, a replaced tokenizer vocab is practically required for accuracy, and I have had terrible luck with a new tokenizer vocab.
      • for Japanese, I suppose I can just leverage the base vocab, romanize my inputs, and remove merges.

Above all, I just hope VALL-E proves to be my magic cure-all and I can just set up a machine to train LJSpeech or a reduced LibriTTS dataset, and come back to it after quite some time passed to get a good model. I honestly don't know how much steam I have left in me.


tl;dr: VALL-E soon, stay tuned.

As I may have hinted with my not-so-subtle commits, I'm working towards getting VALL-E integrated as an alternative TTS backend: * you can switch to it by passing `--tts-backend="vall-e"` - I might have to keep it this way, as not every option will also carry over for VALL-E. * currently, only training is integrated, as I need to get models cobbled together to inference against. - metric grabbing almost works, but I need to reparse the log files to fetch previous metrics like I did with DLAS. I'm backing [this implementation](https://github.com/enhuiz/vall-e) as my VALL-E implementation: * it's nice and clean, and requires little extra dependencies (phonemizer + a backend, like espeak-ng). * the [other implementation](https://github.com/lifeiteng/vall-e): - isn't so nice and clean. - requires some nasty dependencies (lhotse + k2 + icecat just for dataset preparation). - the documentation is CBT. - doesn't even work on my machine desu. * integration is fairly straightforward. * dataset preparation is robust. - I can phonemize the input text myself as IPAs. + the default phonemizes to ARPABET using g2p-en, which ties it only to English. - the audio quantizer is easy to use. - similar to DLAS, I can just spawn a process to for training and parse output. The training process is pretty intuitive too: * it leverages DeepSpeed, so any development on it will improve the training process too. * there's a """REPL""" to pass commands through stdin to do stuff like save, report metrics, and quit (saves on quit too). * training configuration is pretty much just setting the max iterations, batch size, and save/eval rates. * the defaults seem to be sane enough to rely on, so no need to squabble around with the LR and scheduler or gradient accumulation sizes. * validation/evaluation is actually meaningful, as it generates audio against the model you can actually listen to for gauging how well it's chugging along. * resuming is automagically handled. However, I have some qualms with it so far: * no BitsAndBytes to save my hide, so it's quite the VRAM hog. * leveraging ZeRO seems to require editing the code itself that handles passing some configurations to DeepSeed. - it might work, but it can't do it when resuming. * probably even more sensitive than training with DLAS, as it will OOM on me easily without conservative batch sizes. - there's an autotuner within DeepSpeed, but it doesn't seem to want to work. * Linux only, as training with DeepSpeed is Linux only. - there's apparently support for inferencing on Windows, but anything that requires MSVC is major CBT. * doesn't seem to want to train under a CUDA backend (some unhelpful error message). - this also appeared when I tried the other implementation, so it might just be a CUDA issue, so ROCm only for me. * right now, I don't see an easy way to finetune models in the future. I'm sure there's a config setting to do so.. And other concerns with VALL-E: * no base model, so anyone wanting to try it out *need* to invest in a dataset, the compute, and the time. - gauging how well VALL-E performs is pretty much up to chance, as the demos don't seem all that great desu. - I *can* spend some money on renting GPUs and trying to create a general model, but it's grim when I can't get training to work on a paperspace instance. * requires *two* models, the AR and the NAR (despite it generating audio fine during evaluation with just the AR, but I might be neglecting something and need to read the paper again). * even for a specific voice, it took quite a while to get something even sounding like English (this is like an [artifact from Tau Ceti V](https://static.wikia.nocookie.net/shodan/images/1/19/LOG0212-Strange_AI.ogg/revision/latest?cb=20190415140313)). - iteration 2700: https://vocaroo.com/13yvPFSDOEFa - iteration 3150: https://vocaroo.com/1lAxf7e7zkOm - iteration 6300: https://vocaroo.com/12Dr0BKyJJ72 - iteration 10800: https://vocaroo.com/1f0vftxBYHUi --- As for my thoughts on TorToiSe after everything from this being a silly rentry to using it, to now: * zero shot is definitely tied to how well the base model is, and the base model isn't quite up to par, and if the base model is going to be replaced, why bother sticking with TorToiSe when you can just devote the time and effort into a different system. The base model is still impressive, for what it's worth. * finetuning is a bit of an art, as I seem to get decent results with my go-to defaults, while it seems for others it's not up to par. - I want to say this is because I've effectively spent most of my existence the past month or so around this, but even then I am not an expert. * TorToiSe definitely has too many parts (tokenizer, AR model, CLVP/CVVP, diffuser, vocoder), and has a lot of bandaids to get it up to quality (smarter calculations for latents, kv_caching, bitsandbytes, resampling, BigVGAN, voicefixer). * desu I don't have any ill-will with the original dev, and I'm not even sure if I'm on his radar, but I rather not be on his radar by subjugating TorToiSe. * I'm starting to hit the limitations of finetuning the base TorToiSe model. - for non-English, a replaced tokenizer vocab is practically required for accuracy, and I have had terrible luck with a new tokenizer vocab. + for Japanese, I *suppose* I can just leverage the base vocab, romanize my inputs, and remove merges. Above all, I just hope VALL-E proves to be my magic cure-all and I can just set up a machine to train LJSpeech or a reduced LibriTTS dataset, and come back to it after quite some time passed to get a good model. I honestly don't know how much steam I have left in me. --- tl;dr: VALL-E soon, stay tuned.
mrq added the
enhancement
news
labels 2023-03-18 04:21:02 +00:00

no BitsAndBytes to save my hide, so it's quite the VRAM hog.

How bad is it? Is it still something that could run on HEDT graphics cards or should I be pricing out refab P40's on eBay?

Edit: Should rerunning setup-cuda.sh be sufficient to pull in whatever's required for VALL-E?

> no BitsAndBytes to save my hide, so it's quite the VRAM hog. How bad is it? Is it still something that could run on HEDT graphics cards or should I be pricing out refab P40's on eBay? Edit: Should rerunning setup-cuda.sh be sufficient to pull in whatever's required for VALL-E?
Author
Owner

How bad is it? Is it still something that could run on HEDT graphics cards or should I be pricing out refab P40's on eBay?

My batch size is pretty much pinned to 16 for my 2x6800XTs (2x16GiB) if I want stability. Granted, distributed training is different from DLAS, where DLAS will take your batch size and divide by GPU count, but DeepSpeed will use the batch size per GPU. I'm not sure of the bare minimum requirement, though.


Also, you can train half/quarter sized models with reduced parameters by specifing -half and -quarter in the model name (so ar-half/nar-quarter) for reduced VRAM requirements.

> How bad is it? Is it still something that could run on HEDT graphics cards or should I be pricing out refab P40's on eBay? My batch size is pretty much pinned to 16 for my 2x6800XTs (2x16GiB) if I want stability. Granted, distributed training is different from DLAS, where DLAS will take your batch size and divide by GPU count, but DeepSpeed will use the batch size per GPU. I'm not sure of the bare minimum requirement, though. --- Also, you *can* train half/quarter sized models with [reduced parameters](https://github.com/enhuiz/vall-e/blob/f6c6df00b5db3262e04e11f37d67b27bdbf1cecb/vall_e/vall_e/__init__.py#L16) by specifing `-half` and `-quarter` in the model name (so `ar-half`/`nar-quarter`) for reduced VRAM requirements.
  • I'm starting to hit the limitations of finetuning the base TorToiSe model.
  • for non-English, a replaced tokenizer vocab is practically required for accuracy, and I have had terrible luck with a new tokenizer vocab.

Small improvement, but if you've already committed to relying on phonemizer then using it to generate the IPA vocab list from the training dataset is near trivial:

sneed@FMRLYCHKS:~$ (IFS='|';for phoneme in  `echo "Own a musket for home defense, since that's what the founding fathers intended. Four ruffians break into my house. "What the devil?" As I grab my powdered wig and Kentucky rifle. Blow a golf ball sized hole through the first man, he's dead on the spot. Draw my pistol on the second man, miss him entirely because it's smoothbore and nails the neighbors dog. I have to resort to the cannon mounted at the top of the stairs loaded with grape shot, "Tally ho lads" the grape shot shreds two men in the blast, the sound and extra shrapnel set off car alarms. Fix bayonet and charge the last terrified rapscallion. He Bleeds out waiting on the police to arrive since triangular bayonet wounds are impossible to stitch up. Just as the founding fathers intended." | phonemize  -l en-us --words-mismatch ignore  -p'|'  2>/dev/null`; do echo $phoneme;done)  | sed -E 's/$/,/gm' | sed -E 's/\s//g' | sort | uniq | tr -d '\n'; echo

,aɪ,aɪɚ,aʊ,b,d,dʒ,eɪ,f,h,i,iə,iː,j,k,l,m,n,oʊ,oːɹ,p,s,t,tʃ,uː,v,w,z,æ,ð,ŋ,ɐ,ɑː,ɑːɹ,ɔ,ɔː,ɔːɹ,ə,əl,ɚ,ɛ,ɛɹ,ɜː,ɡ,ɪ,ɹ,ɾ,ʃ,ʊ,ʌ,θ,ᵻ,

Edit: In the majority of cases phonemizer is just acting as a wrapper for libespeak-ng so you could just call espeak_TextToPhonemes() yourself if you wanted.

>* I'm starting to hit the limitations of finetuning the base TorToiSe model. > * for non-English, a replaced tokenizer vocab is practically required for accuracy, and I have had terrible luck with a new tokenizer vocab. Small improvement, but if you've already committed to relying on `phonemizer` then using it to generate the IPA vocab list from the training dataset is near trivial: ```` sneed@FMRLYCHKS:~$ (IFS='|';for phoneme in `echo "Own a musket for home defense, since that's what the founding fathers intended. Four ruffians break into my house. "What the devil?" As I grab my powdered wig and Kentucky rifle. Blow a golf ball sized hole through the first man, he's dead on the spot. Draw my pistol on the second man, miss him entirely because it's smoothbore and nails the neighbors dog. I have to resort to the cannon mounted at the top of the stairs loaded with grape shot, "Tally ho lads" the grape shot shreds two men in the blast, the sound and extra shrapnel set off car alarms. Fix bayonet and charge the last terrified rapscallion. He Bleeds out waiting on the police to arrive since triangular bayonet wounds are impossible to stitch up. Just as the founding fathers intended." | phonemize -l en-us --words-mismatch ignore -p'|' 2>/dev/null`; do echo $phoneme;done) | sed -E 's/$/,/gm' | sed -E 's/\s//g' | sort | uniq | tr -d '\n'; echo ,aɪ,aɪɚ,aʊ,b,d,dʒ,eɪ,f,h,i,iə,iː,j,k,l,m,n,oʊ,oːɹ,p,s,t,tʃ,uː,v,w,z,æ,ð,ŋ,ɐ,ɑː,ɑːɹ,ɔ,ɔː,ɔːɹ,ə,əl,ɚ,ɛ,ɛɹ,ɜː,ɡ,ɪ,ɹ,ɾ,ʃ,ʊ,ʌ,θ,ᵻ, ```` Edit: In the majority of cases `phonemizer` is just acting as a wrapper for libespeak-ng so you could just call espeak_TextToPhonemes() yourself if you wanted.

Your doing a amazing job. This mainly beyond my understanding but it's impressive stuff. Do you have a donation page or something?

Your doing a amazing job. This mainly beyond my understanding but it's impressive stuff. Do you have a donation page or something?

I've tried training unofficial enhuiz Vall-E implementation but with my resources I wasn't going anywhere unfortunately, so I gave up.
Have you got any success in training it?

I think that it is a shame though to abandon Tortoise, I've been experimenting with lots of TTS in these past months and the quality of Tortoise is the best to me.
It has its problems: it's really slow, it is unreliable/unstable sometimes, with very strange noises and repetitions and outputs of zero seconds sometimes. But it is very remarkable when it works well, the best by a mile among what I've tried.

I think we should 'just' find a way to fine tune in a stable way without losing the zero shot multispeaker capability.

I have an idea, but I'm not that good in programming.
When we fine tune we lose the multi-speaker zero shot capability and we degrade the reliability of the original model, at least it seems to me. I have seen in image generation this model called ControlNet which allows conditioning on additional input modes other than text.
For example, you guide the image generation not only with the text prompt but also with whatever representation you want. For example you guide the input with a heatMap, with edge contours etc..
They don't want to train a new generative high quality text to image model, they want to leverage the high quality established stable diffusion model. They also don't want to fine tune and unfreeze the stable diffusion weight as this might lower the output quality, overfit over the small dataset or increase instability in the output.
So they actuate a smart strategy for which they use a hypernetwork (which is like a mirrored version of a part of the stable diffusion) whose activations are added to the activations of the stable diffusion model.
The stable diffusion model is freezed, and only the hypernetwork is trained.

Controlnet is just for diffusion image generation models, but in reality it proposes a new way of fine tuning which should ease the process and make it stable, while retaining what the orginal model has learned during the original training.

It would be nice to apply this idea to Tortoise fine tuning.
Here's some reference:'https://www.youtube.com/watch?v=fhIGt7QGg4w' this video talks about the more general idea about ControlNet.
I hope I can spark the creative idea of someone more skilled than me.

Having said that, I'm very curious about Vall-E as well.

I've tried training unofficial enhuiz Vall-E implementation but with my resources I wasn't going anywhere unfortunately, so I gave up. Have you got any success in training it? I think that it is a shame though to abandon Tortoise, I've been experimenting with lots of TTS in these past months and the quality of Tortoise is the best to me. It has its problems: it's really slow, it is unreliable/unstable sometimes, with very strange noises and repetitions and outputs of zero seconds sometimes. But it is very remarkable when it works well, the best by a mile among what I've tried. I think we should 'just' find a way to fine tune in a stable way without losing the zero shot multispeaker capability. I have an idea, but I'm not that good in programming. When we fine tune we lose the multi-speaker zero shot capability and we degrade the reliability of the original model, at least it seems to me. I have seen in image generation this model called ControlNet which allows conditioning on additional input modes other than text. For example, you guide the image generation not only with the text prompt but also with whatever representation you want. For example you guide the input with a heatMap, with edge contours etc.. They don't want to train a new generative high quality text to image model, they want to leverage the high quality established stable diffusion model. They also don't want to fine tune and unfreeze the stable diffusion weight as this might lower the output quality, overfit over the small dataset or increase instability in the output. So they actuate a smart strategy for which they use a hypernetwork (which is like a mirrored version of a part of the stable diffusion) whose activations are added to the activations of the stable diffusion model. The stable diffusion model is freezed, and only the hypernetwork is trained. Controlnet is just for diffusion image generation models, but in reality it proposes a new way of fine tuning which should ease the process and make it stable, while retaining what the orginal model has learned during the original training. It would be nice to apply this idea to Tortoise fine tuning. Here's some reference:'https://www.youtube.com/watch?v=fhIGt7QGg4w' this video talks about the more general idea about ControlNet. I hope I can spark the creative idea of someone more skilled than me. Having said that, I'm very curious about Vall-E as well.

I'd like to advise against using https://github.com/enhuiz/vall-e and would rather propose to take a second look at https://github.com/lifeiteng/vall-e

The enhuiz implementation seems dead, the author is unresponsive and going by the open issues there seem to be problems with the training process, with multiple people reporting it producing garbage results.
The biggest gripe here is that the author has gone completely silent to any queries or other questions regarding the implementation and has been seemingly absent for over two months.
The perceived quality of the code is irrelevant, if we can't guarantee the correctness of the code in the first place.

In contrast to this, the lifeiteng implementation seems to be actively managed, has the author chime in on issues and discussions and, most important of all, was able to present some promising results so far: https://github.com/lifeiteng/vall-e/issues/53
Considering the lhotse + k2 + icefall dependencies, I agree, they are certainly cancer, but they are only used for the dataset preparation. I am sure it should be possible to reverse engineer the process so far, to just be able to prepare our own datasets for the training process, instead to rely on the supplied recipes.

That being said, I managed to get the LibriTTS training process running on my WSL2 Arch Linux on a 3060 12GB (though it was only out of curiosity, so I never let it train for any amount of time), and the author managed to get promising results on the smaller dataset with only 8 hours of training on similar hardware.

As for Tortoise, it was a mixed bag for me. Finetuning refused to deliver any results and the base model was able to produce promising results for some voices, but overly british accents or voices completely different from the source for others.
Overall I'd consider it a deadend, so I am happy research is going into other backends.

I'd like to advise against using https://github.com/enhuiz/vall-e and would rather propose to take a second look at https://github.com/lifeiteng/vall-e The enhuiz implementation seems dead, the author is unresponsive and going by the open issues there seem to be problems with the training process, with multiple people reporting it producing garbage results. The biggest gripe here is that the author has gone completely silent to any queries or other questions regarding the implementation and has been seemingly absent for over two months. The perceived quality of the code is irrelevant, if we can't guarantee the correctness of the code in the first place. In contrast to this, the lifeiteng implementation seems to be actively managed, has the author chime in on issues and discussions and, most important of all, was able to present some promising results so far: https://github.com/lifeiteng/vall-e/issues/53 Considering the lhotse + k2 + icefall dependencies, I agree, they are certainly cancer, but they are only used for the dataset preparation. I am sure it should be possible to reverse engineer the process so far, to just be able to prepare our own datasets for the training process, instead to rely on the supplied recipes. That being said, I managed to get the LibriTTS training process running on my WSL2 Arch Linux on a 3060 12GB (though it was only out of curiosity, so I never let it train for any amount of time), and the author managed to get promising results on the smaller dataset with only 8 hours of training on similar hardware. As for Tortoise, it was a mixed bag for me. Finetuning refused to deliver any results and the base model was able to produce promising results for some voices, but overly british accents or voices completely different from the source for others. Overall I'd consider it a deadend, so I am happy research is going into other backends.
Author
Owner

The enhuiz implementation seems dead, the author is unresponsive

I'm no stranger to that, given how I'm pretty much fostering TorToiSe, whether I like it or not.

and going by the open issues there seem to be problems with the training process, with multiple people reporting it producing garbage results.

I think it's just chalked up to terrible defaults.

  • the default phonemizing process uses g2p-en to produce ARPABETs, which I imagine isn't as elegant as just using IPAs like the newer implementation (and what I prematurely swapped to anyways).
    • when indexing tokens, it's more than happy to use every word as a unique token if you don't split accordingly by spaces. I don't remember if the default phonemizing process doesn't run into this issue, but I know I did when I was careless with overriding it.
  • the optimizer defaults to adam and the LRs aren't really sane.
    • I suppose this is evident with my three separate datasets hitting walls where the loss refused to go down anymore (especially evident with my provided example earlier not getting any better).
    • I've modified the defaults to align with the training configuration in DLAS used for the tortoise model. I imagine if it's good for GPT2, it should be fine for my cursory tests.
  • the max dataset text length value will cull a good portion of any decent dataset.
    • the LJSpeech dataset was decimated from 15k to 2k, and there's nothing really telling you this is an issue outside of one print line among a bunch of noise.

I should be fine after correcting these things. I imagine anyone that tried to use it fell into the nodev trap and assumed the defaults were sane (and desu I fell for it too, only because I made bad assumptions).

I am sure it should be possible to reverse engineer the process so far, to just be able to prepare our own datasets for the training process, instead to rely on the supplied recipes.

From my cursory test, I'd rather not try again:

  • input text needs to be prepared as two separate, convoluted JSONs (I imagine lhotse is to blame).
  • then the JSONs need to be binned to training, dev (whatever the fuck that entails), and validation.
  • then phonemized (I think, I don't remember where it gets phonemized).
  • then gzip your datasets.
  • audio magically runs through what I imagine is either Fbank or Encodec, quantized, and exported into HDF5 (which I recall being a dependency hell ages ago).

Compared to the first implementation just dumping things similar to what DLAS does: bunch of files and parse the directory. Simple.

I could gut the newer implementation to have a simpler data loader, but I can't be assed.

I managed to get the LibriTTS training process running on my WSL2 Arch Linux on a 3060 12GB

About that. I got a similar error that I got with the newer implementation when trying to train with the first one (some assert and a CUBLAS enum thrown), but I tried again yesterday after doing some other things (I think using torch2.1.0 nightly worked), and from there I'm smooth sailing on a paperspace instance.

Although, the newer implementation refused to work on my 2x6800XT system somewhere along the pipeline (I already forgot what segfaults), while the first one did, so even if the newer implementation is favorable, if I can't train locally, I can't back it.

And desu, the first implementation using DeepSpeed feels like it'll mature over time by itself with any changes to DeepSpeed, while the newer one is up to the owner. Although, the newer implementation does leave room for me to work my magic and inject BitsAndBytes. DeepSpeed allegedly has int8 quantizing, but I can't seem to find how to use it, if it's even for training.


As for Tortoise, it was a mixed bag for me. Finetuning refused to deliver any results

Ironically, I've only been getting decent voice-finetune results on my 2x6800XTs. I'm not sure if it's some inherent nature about multi-GPUs, or something between ROCm and CUDA, but whatever placebo it is, any of my future finetunes will have to be done on those and not a paperspace instance.

and the base model was able to produce promising results for some voices, but overly british accents or voices completely different from the source for others.

Yeah, the base model is too inconsistent for zero-shot. A specific subset of male voices will work fine, but everything else won't.

Overall I'd consider it a deadend, so I am happy research is going into other backends.

I just hope I can get results with VALL-E. I can sort of understand a lack of a generalized model, but I feel I'm once again the only shot at getting something cobbled together,

> The enhuiz implementation seems dead, the author is unresponsive I'm no stranger to that, given how I'm pretty much fostering TorToiSe, whether I like it or not. > and going by the open issues there seem to be problems with the training process, with multiple people reporting it producing garbage results. I think it's just chalked up to terrible defaults. * the default phonemizing process uses g2p-en to produce ARPABETs, which I imagine isn't as elegant as just using IPAs like the newer implementation (and what I prematurely swapped to anyways). - when indexing tokens, it's more than happy to use every word as a unique token if you don't split accordingly by spaces. I don't remember if the default phonemizing process doesn't run into this issue, but I know I did when I was careless with overriding it. * the optimizer defaults to adam and the LRs aren't really sane. - I suppose this is evident with my three separate datasets hitting walls where the loss refused to go down anymore (especially evident with my provided example earlier not getting any better). - I've modified the defaults to align with [the training configuration in DLAS used for the tortoise model](https://github.com/152334H/DL-Art-School/blob/master/experiments/EXAMPLE_gpt.yml#L52). I imagine if it's good for GPT2, it should be fine for my cursory tests. * the max dataset text length value *will* cull a good portion of any decent dataset. - the LJSpeech dataset was decimated from 15k to 2k, and there's nothing really telling you this is an issue outside of one print line among a bunch of noise. I should be fine after correcting these things. I imagine anyone that tried to use it fell into the nodev trap and assumed the defaults were sane (and desu I fell for it too, only because I made bad assumptions). > I am sure it should be possible to reverse engineer the process so far, to just be able to prepare our own datasets for the training process, instead to rely on the supplied recipes. From my cursory test, I'd rather not try again: * input text needs to be prepared as two separate, convoluted JSONs (I imagine lhotse is to blame). * then the JSONs need to be binned to training, dev (whatever the fuck that entails), and validation. * then phonemized (I think, I don't remember where it gets phonemized). * *then* gzip your datasets. * audio magically runs through what I imagine is either Fbank or Encodec, quantized, and exported into HDF5 (which I recall being a dependency hell ages ago). Compared to the first implementation just dumping things similar to what DLAS does: bunch of files and parse the directory. Simple. I *could* gut the newer implementation to have a simpler data loader, but I can't be assed. > I managed to get the LibriTTS training process running on my WSL2 Arch Linux on a 3060 12GB About that. I got a similar error that I got with the newer implementation when trying to train with the first one (some assert and a CUBLAS enum thrown), but I tried again yesterday after doing some other things (I think using torch2.1.0 nightly worked), and from there I'm smooth sailing on a paperspace instance. Although, the newer implementation refused to work on my 2x6800XT system somewhere along the pipeline (I already forgot what segfaults), while the first one did, so even if the newer implementation is favorable, if I can't train locally, I can't back it. And desu, the first implementation using DeepSpeed feels like it'll mature over time by itself with any changes to DeepSpeed, while the newer one is up to the owner. *Although*, the newer implementation does leave room for me to work my magic and inject BitsAndBytes. DeepSpeed allegedly has int8 quantizing, but I can't seem to find how to use it, if it's even for training. --- > As for Tortoise, it was a mixed bag for me. Finetuning refused to deliver any results Ironically, I've only been getting decent voice-finetune results on my 2x6800XTs. I'm not sure if it's some inherent nature about multi-GPUs, or something between ROCm and CUDA, but whatever placebo it is, any of my future finetunes will have to be done on those and not a paperspace instance. > and the base model was able to produce promising results for some voices, but overly british accents or voices completely different from the source for others. Yeah, the base model is too inconsistent for zero-shot. A specific subset of male voices will work fine, but everything else won't. > Overall I'd consider it a deadend, so I am happy research is going into other backends. I just hope I can get results with VALL-E. I can sort of understand a lack of a generalized model, but I feel I'm once again the only shot at getting *something* cobbled together,
Author
Owner

I crammed BitsAndBytes into the first implementation using a similar "injection" I did with DirectML jerryrigging for CPU-only functions. In hindsight, I could have also used this method with DLAS, but oh well.

desu the gains aren't as large as adding it to DLAS, as I'm only able to slightly bump up my batch size from 16 to 18 before it gets unstable and occasionally OOMs on an A6000 (48GiB VRAM). I'm not sure why it spikes 4GiB of VRAM occasionally, or when it tries to save.

I can make some guesses as why it's not a huge improvement, but oh well. Bit of a bummer it didn't drastically let me cram a larger batch size in.


Got training to work on my 2x6800XTs with the newer implementation, and I'm a bit skeptical.

  • I think the defaults also aren't so sane, as the default settings only parses about 1300 lines out of 15k in the dataset.
  • it does its own batch sizing with no way to set it.
  • it definitely refuses to utilize more than 3/4ths of my VRAM.
  • it refuses to do distributed training when launching through torchrun. I'm sure I can slap a similar thing DLAS use to force it to use the other GPU, but I cannot be assed to right now.
  • training loss keeps bouncing around 4.2, so I think it fried too.
I crammed BitsAndBytes into the [first implementation](https://git.ecker.tech/mrq/vall-e) using a similar "injection" I did with DirectML jerryrigging for CPU-only functions. In hindsight, I could have also used this method with DLAS, but oh well. desu the gains aren't *as* large as adding it to DLAS, as I'm only able to slightly bump up my batch size from 16 to 18 before it gets unstable and occasionally OOMs on an A6000 (48GiB VRAM). I'm not sure why it spikes 4GiB of VRAM occasionally, or when it tries to save. I can make some guesses as why it's not a huge improvement, but oh well. Bit of a bummer it didn't drastically let me cram a larger batch size in. --- Got training to work on my 2x6800XTs with the newer implementation, and I'm a bit skeptical. * I think the defaults also aren't so sane, as the default settings only parses about 1300 lines out of 15k in the dataset. * it does its own batch sizing with no way to set it. * it definitely refuses to utilize more than 3/4ths of my VRAM. * it refuses to do distributed training when launching through torchrun. I'm sure I can slap a similar thing DLAS use to force it to use the other GPU, but I cannot be assed to right now. * training loss keeps bouncing around 4.2, so I think it fried too.
Author
Owner

I think I got my ducks in a row with the first implementation (these were at 10000 steps, dataset size 55, batch size 1, defaults for all the other knobs, trained on my 2x6800XTs for I think an hour? an epoch can be ripped through in about 10 seconds):

I realized a few things:

  • it'd be better to use a known working(ish) dataset, as this one is what a paperspace got some results from.
    • this setup script is included in my VALL-E fork just to quickly get it up and running
  • it'd be easier to just try the defaults instead of being impatient and wrangling with playing with the dials.
    • too many changes is bad since I won't be able to narrow things down.
    • especially with replacing the optimizer and scheduler with "borrowed" copies from the newer implementation.
  • trying to train a full model is foolish; start with a quarter first for testing against as those are much, much easier to train (smaller, steps are faster).
  • do not neglect the NAR, as that's equally as used as the AR.
    • I think a lot of people also would train the AR first and get turned off from it not being up to par before training the NAR.
    • this issue is alleviated anyways now, since I can now train both at the same time.
      • need to benchmark how much of a VRAM hit this is, but for the quarter model, I imagine it's not a big of a hit.

It just seems odd though, there's definitely something off, especially:

  • how much wrangling I did the past few days trying to get something (although, that could just be because I was looking at the AR only).
  • batch size 1 seems very backwards, as finetuning has taught me that as big of a batch size as you can, and to use gradient accumulation to make up for what you can't.
    • which is another odd thing, you can set your gradient accumulation to whatever you want, as it's just a counter on when to flush, rather than DLAS being a very sensitive list.
  • the default scheduler uses WarmupLR which seems entirely backwards from what I'm used to with LRs: it'll increase the LR over time. Although, there seems to be loss scaling and what-not, so it might all actually be scaled and the defacto LR isn't being reported.
  • VRAM use at batch size one with the quarter models is very small, both GPUs are sitting below 2GiB each. I suppose this is fine, as it would let anyone train this even at batch size 1.

I'll just have to wait and see how things shape up with baking the model. If it turns out decent, great. I'll be comfortable with renting out a GPU to do bigger training on (or cave and buy a 4090, as the prospect of renting for pennies sounds worse than just splurging $1500 on another GPU).

I think I got my ducks in a row with the first implementation (these were at 10000 steps, dataset size 55, batch size 1, defaults for all the other knobs, trained on my 2x6800XTs for I think an hour? an epoch can be ripped through in about 10 seconds): * reference: https://vocaroo.com/1lQlkrmlMOJx * AR eval'd: https://vocaroo.com/1axJfD7A24Df * NAR eval'd: https://vocaroo.com/1bixIsupVEZO I realized a few things: * it'd be better to use a known working(ish) dataset, as this one is what a [paperspace](https://blog.paperspace.com/training-vall-e-from-scratch-on-your-own-voice-samples/) got some results from. - this setup script is included in my VALL-E fork just to quickly get it up and running * it'd be easier to just try the defaults instead of being impatient and wrangling with playing with the dials. - too many changes is bad since I won't be able to narrow things down. - especially with replacing the optimizer and scheduler with "borrowed" copies from the newer implementation. * trying to train a full model is foolish; start with a quarter first for testing against as those are much, much easier to train (smaller, steps are faster). * do not neglect the NAR, as that's equally as used as the AR. - I think a lot of people also would train the AR first and get turned off from it not being up to par before training the NAR. - this issue is alleviated anyways now, since I can now train both at the same time. + need to benchmark how much of a VRAM hit this is, but for the quarter model, I imagine it's not a big of a hit. It just seems odd though, there's definitely something off, especially: * how much wrangling I did the past few days trying to get *something* (although, that could just be because I was looking at the AR only). * batch size 1 seems very backwards, as finetuning has taught me that as big of a batch size as you can, and to use gradient accumulation to make up for what you can't. - which is another odd thing, you can set your gradient accumulation to whatever you want, as it's just a counter on when to flush, rather than DLAS being a very sensitive list. * the default scheduler uses WarmupLR which seems entirely backwards from what I'm used to with LRs: it'll increase the LR over time. Although, there seems to be loss scaling and what-not, so it might all actually be scaled and the defacto LR isn't being reported. * VRAM use at batch size one with the quarter models is very small, both GPUs are sitting below 2GiB each. I suppose this is fine, as it would let anyone train this even at batch size 1. I'll just have to wait and see how things shape up with baking the model. If it turns out decent, great. I'll be comfortable with renting out a GPU to do bigger training on (or cave and buy a 4090, as the prospect of renting for pennies sounds worse than just splurging $1500 on another GPU).

I'll be comfortable with renting out a GPU to do bigger training on (or cave and buy a 4090, as the prospect of renting for pennies sounds worse than just splurging $1500 on another GPU).

There are cheaper ways to get 24GB of VRAM

> I'll be comfortable with renting out a GPU to do bigger training on (or cave and buy a 4090, as the prospect of renting for pennies sounds worse than just splurging $1500 on another GPU). [There are cheaper ways to get 24GB of VRAM](https://www.ebay.com/itm/353109332585)
Author
Owner

There are cheaper ways to get 24GB of VRAM

VRAM isn't my concern. In fact, I found both VALL-E implementations to be poor when it comes to VRAM. The one I'm backing just scales horribly (between an A6000 and an A100-80G, I could barely bump up the batch size), and the newer one never wanted to use more than 12GiB as it decides what batch size it wants.

I already have a collective 32GiB with my 2x6800XTs, so VRAM is most definitely not an issue. In the context of VRAM, a 4090 is a downgrade in capacity, and most definitely a Pascal card is a downgrade across the board.

It's just an idea to float about getting an actual card for ML for improved throughput if I'm going even more balls deep into it, rather than reusing cards I incidentally have that incidentally do okay. P*p*rsp*ce took a massive dump on me this morning, so I'm skeptical of using it (or any rentals) any more after being burned again.


Anyways, at step 27000 (after switching to bs=16, ga=2), the NAR sounds nearly the same as the reference: https://vocaroo.com/1jaGF5sduPQH. There's a bit of warble still, but I'm impressed. The AR still sounds iffy.


Step 40000 and the AR finally sounds better: https://vocaroo.com/18egrMwF6W4w

Still terrible, but it at least has audible speech.

> There are cheaper ways to get 24GB of VRAM VRAM isn't my concern. In fact, I found both VALL-E implementations to be poor when it comes to VRAM. The one I'm backing just scales horribly (between an A6000 and an A100-80G, I could barely bump up the batch size), and the newer one never wanted to use more than 12GiB as it decides what batch size it wants. I already have a collective 32GiB with my 2x6800XTs, so VRAM is most definitely not an issue. In the context of VRAM, a 4090 is a downgrade in capacity, and most definitely a Pascal card is a downgrade across the board. It's just an idea to float about getting an actual card for ML for improved throughput if I'm going even more balls deep into it, rather than reusing cards I incidentally have that incidentally do *okay*. P\*p\*rsp\*ce took a massive dump on me this morning, so I'm skeptical of using it (or any rentals) any more after being burned again. --- Anyways, at step 27000 (after switching to bs=16, ga=2), the NAR sounds nearly the same as the reference: https://vocaroo.com/1jaGF5sduPQH. There's a bit of warble still, but I'm impressed. The AR still sounds iffy. --- Step 40000 and the AR finally sounds better: https://vocaroo.com/18egrMwF6W4w Still terrible, but it at least has audible speech.

@mrq what dataset you use currently. I can try on my system to double check too if it helps.

@mrq what dataset you use currently. I can try on my system to double check too if it helps.
Author
Owner

what dataset you use currently.

Some ten hours of some LibriTTS labeled LibriSpeech-Finetuning that I nabbed off some P*p*rsp*c* article about VALL-E, except it includes everything in the archive and not the 9h subset. The link to I itself is under my VALL-E fork repo in ./scripts/prepare_librispeech.sh.

I can try on my system to double check too if it helps.

If you got a few days to kill, go right ahead. I have a small repo on HF with the data already quantized and phonemized to avoid going through the hoops of my rather-fragmented preparation process.

With your current working directory set to your ai-voice-cloning folder:

  • source ./venv/bin/activate
  • git clone https://git.ecker.tech/mrq/vall-e ./modules/vall-e/
  • pip3 install -e ./modules/vall-e/
  • git clone https://huggingface.co/datasets/ecker/libritts-small ./training/libritts-small/
  • modify ./training/libritts-small/config.yaml to your liking
  • set env vars with:
    • CUDA: export CUDA_HOME=PATH_TO_YOUR_CUDA
      • you might not need to do it for CUDA if you only have one CUDA version installed (for example, only /usr/local/cuda-11.8/ or something. For Docker images with CUDA-11.6, and you install cuda-nvcc-12.0 or something, you'll need to point to the newer one.
    • ROCm: export ROC_HOME=PATH_TO_YOUR_ROCM
      • you WILL need to do it under ROCm, because it can't be assed to infer the location right and defaults to a nasty bin/hipcc instead of /opt/rocm/
  • start training with deepspeed --module vall_e.train yaml='./training/libritts-small/config.yaml'
  • and wait, if there's no errors

I restarted training two nights ago and fiddled with some more settings yesterday, so progress restarted, as I didn't trust the initial dataset to be "right" in the sense of using the entire dataset, optimally.

I also manually validated if BitsAndBytes was even working (it's not).

  • Most of the embeddings aren't actually using torch.nn.Embedding, rather custom ones that inherit torch.nn.Module.
  • Slotting out torch.nn.Linear for bnb.nn.Linear8bitLt causes errors (not surprising, since it's not integrated with DLAS, and naturally).
  • The 8-bit Adams "works", but offers little VRAM gain compared to DeepSpeed's fused-AdamW; in fact 8-bit Adams actually performs worse because of explosive gradient norms, despite clipping.
    • The models being a fraction of the size of Tortoise's AR probably is why there's not much of a VRAM saving.

So I'm stumped with BitsAndBytes. I can give it more cracks at it later today, but even DeepSpeed's weight quantization doesn't give consistent VRAM savings (sometimes my GPUs will sit at 12GiB and then it'll sit at 15).

I will admit I did cave and get a 4070Ti, as:

  • some testings with one still had it offer a rather sexy throughput uplift over my 2x6800XTs (naturally).
  • I can't justify paying more than double for a 4090 that I still have to wait a week for it to arrive.
  • it's only a neg $800+tax+tip, not a big deal for cutting my training by 4X and my power draw in half.
  • I want to play around with bfloat16 and hardware int8.
  • renting is not value, at all, and is a pain wrestling with Docker containers.

The only caveat is:

  • 12GiB is small. A little too tight.
    • properly implementing quantization should help, but I worry about when scaling up to a full sized model it causing issues.
> what dataset you use currently. Some ten hours of some LibriTTS labeled LibriSpeech-Finetuning that I nabbed off some P\*p\*rsp\*c\* article about VALL-E, except it includes everything in the archive and not the 9h subset. The link to I itself is under my VALL-E fork repo in `./scripts/prepare_librispeech.sh`. > I can try on my system to double check too if it helps. If you got a few days to kill, go right ahead. I have a small repo on HF with the data already quantized and phonemized to avoid going through the hoops of my rather-fragmented preparation process. With your current working directory set to your `ai-voice-cloning` folder: * `source ./venv/bin/activate` * `git clone https://git.ecker.tech/mrq/vall-e ./modules/vall-e/` * `pip3 install -e ./modules/vall-e/` * `git clone https://huggingface.co/datasets/ecker/libritts-small ./training/libritts-small/` * modify `./training/libritts-small/config.yaml` to your liking * set env vars with: - CUDA: `export CUDA_HOME=PATH_TO_YOUR_CUDA` + you might not need to do it for CUDA if you only have one CUDA version installed (for example, only `/usr/local/cuda-11.8/` or something. For Docker images with CUDA-11.6, and you install `cuda-nvcc-12.0` or something, you'll need to point to the newer one. - ROCm: `export ROC_HOME=PATH_TO_YOUR_ROCM` + you ***WILL*** need to do it under ROCm, because it can't be assed to infer the location right and defaults to a nasty `bin/hipcc` instead of `/opt/rocm/` * start training with `deepspeed --module vall_e.train yaml='./training/libritts-small/config.yaml'` * and wait, if there's no errors --- I restarted training two nights ago and fiddled with some more settings yesterday, so progress restarted, as I didn't trust the initial dataset to be "right" in the sense of using the entire dataset, optimally. I also manually validated if BitsAndBytes was even working (it's not). * Most of the embeddings aren't actually using `torch.nn.Embedding`, rather custom ones that inherit `torch.nn.Module`. * Slotting out `torch.nn.Linear` for `bnb.nn.Linear8bitLt` causes errors (not surprising, since it's not integrated with DLAS, and naturally). * The 8-bit Adams "works", but offers little VRAM gain compared to DeepSpeed's fused-AdamW; in fact 8-bit Adams actually performs worse because of explosive gradient norms, despite clipping. - The models being a fraction of the size of Tortoise's AR probably is why there's not much of a VRAM saving. So I'm stumped with BitsAndBytes. I can give it more cracks at it later today, but even DeepSpeed's weight quantization doesn't give consistent VRAM savings (sometimes my GPUs will sit at 12GiB and then it'll sit at 15). I will admit I did cave and get a 4070Ti, as: * some testings with one still had it offer a rather sexy throughput uplift over my 2x6800XTs (naturally). * I can't justify paying more than double for a 4090 that I still have to wait a week for it to arrive. * it's only a neg $800+tax+tip, not a big deal for cutting my training by 4X and my power draw in half. * I want to play around with bfloat16 and hardware int8. * renting is not value, at all, and is a pain wrestling with Docker containers. The only caveat is: * 12GiB is small. A little too tight. - properly implementing quantization should help, but I worry about when scaling up to a full sized model it causing issues.
Author
Owner

I'm stupid. To spare the gory details:

  • BitsAndBytes obviously won't work on everything as inputs are already ints, I don't know what brain worms made me forget this
    • which begs the question on how it works fine with DLAS/Tortoise
    • float32/float16->int8 quantization only really is nice for model weights (which models are already somewhat small) and optimizer states (which depends on model size)
  • all input tensors are naively int64s
    • in theory, even just reducing to int32s will save 2x the VRAM consumed solely by parsing the datasets, and reducing to int16s will save 4x the VRAM consumed
      • some extra hoops are needed since they're treated as indices, but upcasting works ok for when they need to be index tensors
      • text tensors could very well be int8s, since I don't think I've had my token index cross index 105

Although:

  • VRAM use keeps creeping up
  • multi-GPUs might actually not work, as it seems I have the same throughput with one card compared to two, unless I'm getting metrics reported wrong.
  • I'll probably have to scrap my training again.
I'm stupid. To spare the gory details: * BitsAndBytes *obviously* won't work on everything as inputs are already ints, I don't know what brain worms made me forget this - which begs the question on how it works fine with DLAS/Tortoise - float32/float16->int8 quantization only really is nice for model weights (which models are already somewhat small) and optimizer states (which depends on model size) * *all* input tensors are naively int64s - in theory, even just reducing to int32s will save 2x the VRAM consumed solely by parsing the datasets, and reducing to int16s will save 4x the VRAM consumed + some extra hoops are needed since they're treated as indices, but upcasting works ok for when they need to be index tensors + text tensors could very well be int8s, since I don't think I've had my token index cross index 105 Although: * VRAM use keeps creeping up * multi-GPUs might actually not work, as it seems I have the same throughput with one card compared to two, unless I'm getting metrics reported wrong. * I'll probably have to scrap my training *again*.

Hello, I don't know if it's of any concern, but someone on the newer repository uploaded a trained model as detailed in this thread: https://github.com/lifeiteng/vall-e/issues/58

I managed to download it and ran some of my own tests, which I wanted to share in case it's of any interest.

From the first glance it seems to be running even slower than Tortoise.
Of my samples, only the Snake one seems to match the speaker's voice and even then he seems a bit too.. jolly?
The others don't fit the speaker's voice at all.

Sadly not the silver bullet I was hoping for, but I guess it all depends on what's in the model again.

Hello, I don't know if it's of any concern, but someone on the newer repository uploaded a trained model as detailed in this thread: https://github.com/lifeiteng/vall-e/issues/58 I managed to download it and ran some of my own tests, which I wanted to share in case it's of any interest. * Solid Snake: (Source) https://vocaroo.com/17lhindJicrD (Result) https://vocaroo.com/1b3Cqih5Lgtb * Exdeath (FF5): (Source) https://vocaroo.com/1fMlLIejplOt (Result) https://vocaroo.com/1oKsiTHfuTPH * Jecht (FF10): (Source) https://vocaroo.com/1jMNgguHCZE8 (Result) https://vocaroo.com/1zqjXWgg7Fzs * Vile (Megaman X): (Source) https://vocaroo.com/19KuGc5bMtd1 (Result) https://vocaroo.com/19J6GpGavMxy From the first glance it seems to be running even slower than Tortoise. Of my samples, only the Snake one seems to match the speaker's voice and even then he seems a bit too.. jolly? The others don't fit the speaker's voice at all. Sadly not the silver bullet I was hoping for, but I guess it all depends on what's in the model again.
Author
Owner

Hello, I don't know if it's of any concern, but someone on the newer repository uploaded a trained model as detailed in this thread: https://github.com/lifeiteng/vall-e/issues/58

Neato.

550 hours
100 epochs
8xA100 for 4 days

Yeesh. I'll be a pessimist and assume (cope) that a lot of that time seems to be just bruteforcing through unfavorable conditions with (most likely) zero optimizations:

  • full sized model (1024 dim, 12 layers, 16 heads)
    • I suppose it's fine at the end, but it definitely will decimate throughput by like, 3x?
    • the paper calls for it anyways, so I don't blame them
  • input tensors as int64 for absolutely no reason
    • definitely will eat up throughput from just moving the training data around
  • maybe naively prepared training audio?
    • I say maybe, since I don't know the full breadth of lhotse/k2/icefall, but I can't imagine it being anywhere near parity to how I'm preparing them (which is at near parity to painstakingly manually preparing it)
    • I don't think I have any good guesstimates on throughput from a naive dataset vs a properly prepared one, but I imagine it adds up in the long run
  • the optimizer/scheduler that implementation uses is a little sussy, but I don't have any concrete metrics comparing between them
    • slapping ScaledAdam and Eden into my fork didn't seem to have it perform any better in terms of the long haul
    • desu the default optimizer is needed for DeepSpeed's ZeRO, and its scheduler is fine it seems.

I feel like it has a similar problem the first implementation has: they're made by grad students with lab rigs who only know ML and nothing else. Don't get me wrong, they know it better than me, but they're not pragmatic (for lack of a better term) about how they go about it. I just can't really place my trust in either implementations after seeing the warts.

  • my excuse with being stupid is that I made blind assumptions to be desu.
  • it also makes me wonder what other warts are in DLAS.

I managed to download it and ran some of my own tests, which I wanted to share in case it's of any interest.

Thanks, I can't be assed to try and pick apart how to use the newer implementation for a third time for cursory tests.

I'm a little impressed from its results, a very small little. The model itself definitely isn't a tortoise replacement, but it at least shows it can provide something. My only concern with how little actual moving parts are in it, there wouldn't really be any room for bandaids like for TorToiSe.

There's something off about it outside of the audio quality, wrong pitches, and I suppose the general tone. I can't quite put my finger on it. I wonder if it's an issue with how the phonemes are processed, as I think it's only using the base settings for phonemizer (no stress symbols, no spaces). It sort of sounds like what https://ipa-reader.xyz/ spits out.

but I guess it all depends on what's in the model again.

Most definitely.

For zero-shot inferencing applications, diversity (ick) is a HUGE factor in having a good model. There's only so much data to sample from when trying to mimic voices. I worry that when I finally nail training a small model, that I'm going to be in a world of hurt trying to pick every piece of clean audio I can get and adequately processing it (although the web UI is rather decent at that now). The magic of TorToiSe is in the dataset it was trained against, as its author mentioned not using just audiobooks (a bit ironic, since it still has its biases on how well its zero shot inferencing performs).

I think the other issue is that, depending on how conforming that implementation is, the paper calls for using only three seconds of audio for zero-shot inferencing. I think I saw some commits about different "prefix" modes (I think an analog to how I was changing how I'm computing latents, and the 152334H fork having its different settings for it too), so it might do better with more than three seconds to play with.

However. TorToiSe has definitely shown that finetuning a model that's semi-semi-competent is viable. I could care less about a "10/10 amazeballs" model at zero-shot when you only really need a semi-semi-decent model to finetune from. That's more of what my goal is/should be: just get a good starting point so people can finetune off of it.

> Hello, I don't know if it's of any concern, but someone on the newer repository uploaded a trained model as detailed in this thread: https://github.com/lifeiteng/vall-e/issues/58 Neato. > 550 hours > 100 epochs > 8xA100 for 4 days Yeesh. I'll be a pessimist and assume (cope) that a lot of that time seems to be just bruteforcing through unfavorable conditions with (most likely) zero optimizations: * full sized model (1024 dim, 12 layers, 16 heads) - I *suppose* it's fine at the end, but it definitely will decimate throughput by like, 3x? - the paper calls for it anyways, so I don't blame them * input tensors as int64 for absolutely no reason - definitely will eat up throughput from just moving the training data around * *maybe* naively prepared training audio? - I say maybe, since I don't know the full breadth of lhotse/k2/icefall, but I can't imagine it being anywhere near parity to how I'm preparing them (which is at near parity to painstakingly manually preparing it) - I don't think I have any good guesstimates on throughput from a naive dataset vs a properly prepared one, but I imagine it adds up in the long run * the optimizer/scheduler that implementation uses is a little sussy, but I don't have any concrete metrics comparing between them - slapping ScaledAdam and Eden into my fork didn't seem to have it perform any better in terms of the long haul - desu the default optimizer is needed for DeepSpeed's ZeRO, and its scheduler is *fine* it seems. I feel like it has a similar problem the first implementation has: they're made by grad students with lab rigs who only know ML and nothing else. Don't get me wrong, they know it better than me, but they're not pragmatic (for lack of a better term) about how they go about it. I just can't really place my trust in either implementations after seeing the warts. * my excuse with being stupid is that I made blind assumptions to be desu. * it also makes me wonder what other warts are in DLAS. > I managed to download it and ran some of my own tests, which I wanted to share in case it's of any interest. Thanks, I can't be assed to try and pick apart how to use the newer implementation for a third time for cursory tests. I'm a little impressed from its results, a very small little. The model itself definitely isn't a tortoise replacement, but it at least shows it can provide *something*. My only concern with how little actual moving parts are in it, there wouldn't really be any room for bandaids like for TorToiSe. There's something off about it outside of the audio quality, wrong pitches, and I suppose the general tone. I can't quite put my finger on it. I wonder if it's an issue with how the phonemes are processed, as I think it's only using the base settings for phonemizer (no stress symbols, no spaces). It sort of sounds like what https://ipa-reader.xyz/ spits out. > but I guess it all depends on what's in the model again. Most definitely. For zero-shot inferencing applications, diversity (ick) is a ***HUGE*** factor in having a good model. There's only so much data to sample from when trying to mimic voices. I worry that when I finally nail training a small model, that I'm going to be in a world of hurt trying to pick every piece of clean audio I can get and adequately processing it (although the web UI is rather decent at that now). The magic of TorToiSe is in the dataset it was trained against, as its author mentioned not using just audiobooks (a bit ironic, since it still has its biases on how well its zero shot inferencing performs). I think the other issue is that, depending on how conforming that implementation is, the paper calls for using only three seconds of audio for zero-shot inferencing. I think I saw some commits about different "prefix" modes (I think an analog to how I was changing how I'm computing latents, and the 152334H fork having its different settings for it too), so it might do better with more than three seconds to play with. However. TorToiSe has definitely shown that finetuning a model that's semi-semi-competent is viable. I could care less about a "10/10 amazeballs" model at zero-shot when you only really need a semi-semi-decent model to finetune from. That's more of what my goal is/should be: just get a good starting point so people can finetune off of it.
Author
Owner

I suppose one last thing before I go back into my hole for another few days: training it is shaping up to be a real bitch and a half. I suppose it's only natural it is so, given it's a language model. I'm doing some sussy "warmup then decay with restarts" to quickly train it over painfully bruteforcing it with the LR decay that both the newer implementation used, and DLAS/TorToiSe does (for finetuning at least).


Rereading the paper trying to procrastinate sleeping, and there's some things I guess that would have been nice if it was disambiguated, rather than inferred from both implementations. The original VALL-E:

  • dataset is 60k unlabeled hours of LibriLight
    • the portion about 960 hours was just for training the ASR used to transcribe the 60k hours
    • average segment length was one minute
    • 7000 unique speakers
  • original VALL-E model is 12 layers, 16 attention heads, embedding dim of 1024, feed-forward layer dim of 4096, dropout of 0.1
  • during training (input wise):
    • each line for the AR was randomly cropped between 10 and 20 seconds (phoneme aligned)
    • each line for the NAR was the above, but 3 seconds
  • for traiing (hyperparameter wise):
    • 16xV100s
    • batch size of 6000 tokens, per GPU
      • in reality I should probably use a token-total approach for batching to make things consistent, this is probably what the newer implementation does
    • 800,000 total iterations
      • can't really give an epoch equivalence, the batch is on a token-count-basis
    • AdamW
    • LR warmup to 5.0e-4 for 32,000 steps, then linear decay

The training process seems fairly simple. It's just the first implementation does it the rather classical way of training by batches, but I'm not sure how much that'll be a problem. I worry I need to revamp the both the transcription process and the dataloader to replicate the original paper better.

I suppose one last thing before I go back into my hole for another few days: training it is shaping up to be a real bitch and a half. I suppose it's only natural it is so, given it's a language model. I'm doing some sussy "warmup then decay with restarts" to quickly train it over painfully bruteforcing it with the LR decay that both the newer implementation used, and DLAS/TorToiSe does (for finetuning at least). --- Rereading the paper trying to procrastinate sleeping, and there's some things I guess that would have been nice if it was disambiguated, rather than inferred from both implementations. The original VALL-E: * dataset is 60k unlabeled hours of LibriLight - the portion about 960 hours was just for training the ASR used to transcribe the 60k hours - average segment length was one minute - 7000 unique speakers * original VALL-E model is 12 layers, 16 attention heads, embedding dim of 1024, feed-forward layer dim of 4096, dropout of 0.1 * during training (input wise): - each line for the AR was randomly cropped between 10 and 20 seconds (phoneme aligned) - each line for the NAR was the above, but 3 seconds * for traiing (hyperparameter wise): - 16xV100s - batch size of 6000 tokens, per GPU + in reality I should probably use a token-total approach for batching to make things consistent, this is probably what the newer implementation does - 800,000 total iterations + can't really give an epoch equivalence, the batch is on a token-count-basis - AdamW - LR warmup to 5.0e-4 for 32,000 steps, then linear decay The training process seems fairly simple. It's just the first implementation does it the rather classical way of training by batches, but I'm not sure how much that'll be a problem. I worry I need to revamp the both the transcription process and the dataloader to replicate the original paper better.
Author
Owner

Progress report for whoever cares about my descent:

  • scrapped my -quarter tests; it's not worth the time testing a gimped model if I want to validate that I can even get decent results with this implementation.
    • a full sized model definitely shows that DeepSpeed's weight/activation quantization actually has an effect: less VRAM used for no perceptible performance hit.
    • I can confirm I can actually get decent results, at least given the evaluation output from a much narrower dataset.
    • throughput is reduced by 4x, but I feel it definitely helps with training, especially something astronomical like language.
    • I might even not conform to the implementation and try and use the specs for TorToiSe's AR (which actually just seems to be more layers, 12 => 30).
      • I think TorToiSe does a lot more to condition the inputs anyways, so it'd be silly to just add moar layers. I do have other ideas...
  • my "goals" are all over the place, so I'm focusing strictly on getting something that works, so I'm revisiting the initial test of just using one voice on a small dataset to get something working (especially after my modifications to the implementation):

My thoughts and concerns:

  • after fixing things and getting a much better grasp (that doesn't seem to waver every day), I think it really just amounts to some patience, as my first tests I lacked patience (and a 4070Ti definitely helps with my lack of patience).
  • you can definitely hear the issue that stem from it using Encodec:
    • to my understanding, TorToiSe at least has the luxury of errors getting "smoothed out" from the mel token => mel spectrogram => vocoder passes, while VALL-E will need to be very precise.
    • it captured the chess piece sound, not surprising. I guess it just shows that the model is getting very overtrained, as the validation output sounds terrible.
  • It's still pretty hard to trust the reported losses during training, and the patchworked losses returned during evaluation/validation (calculating a loss from comparing the actual audio in paper sounds better, but it requires trimming to match, and that alignment can very well be off).
  • The accuracy metric I really can't use as a metric; it reports a higher accuracy on the AR when it sounds completely unusuable, but reports a lower accuracy for the NAR when it sounds """better""".
  • I do not think that training models from scratch on a narrow dataset is a good idea for use, as there's a very, very fine line between it being okay to it being decent but overfitting.

Some whacky ideas I'll spitball that I won't probably do:

  • going back to what I think the newer implementation actually does, modifying the dataloader to instead batch by token size, rather than a naive batch size, in the hopes of making VRAM use consistent and not a pain and have some batches under-utilized.
  • transplanting shit from that trained model from the newer implementation into a model with the first implementation. It should be possible, but I don't think it'd be all that helpful.
  • patchworking DLAS into being able to train a VALL-E model
    • shouldn't really need to entertain this idea desu, as the first implementation is fine using DeepSpeed.
  • patchworking DLAS to replace the wav => mel spectrogram => VQVAE => mel token pass with a wav => encoded Encodec quantization blah blah, effectively being able to supplant DLAS too
    • again, shouldn't really entertain this bad idea, as it'd just be a huge nightmare to get working for no real gain.

Roadmap / Goals:

  • use this test model to do inference tests, and slap inferencing support for VALL-E into the web UI.
  • fiddle with seeing how well "finetuning" from one voice to another fares.
  • prepare and train on LJSpeech, again (24 hours and one speaker should be better than 10 hours and I think 7 speakers strictly from a learning standpoint).
  • finalize a guide and routine into training with VALL-E.
  • depending on things:
    • if "finetuning" proves to be adequate, finetune the LJSpeech model with a much larger/diverse dataset (LibriTTS/LibriSpeech/LibriWhatever, I think the 60 hour one with a lot of different speakers) for a zero-shot oriented model.
    • if "finetuning" doesn't prove to be adequate, experiment with implementing VALL-E X with an English + Japanese dataset (I have some ideas on how to implement it).

I just hope that things are smooth sailing from here on out now and I can use the time waiting to train to finally relax.

Progress report for whoever cares about my descent: * scrapped my `-quarter` tests; it's not worth the time testing a gimped model if I want to validate that I can even get decent results with this implementation. - a full sized model definitely shows that DeepSpeed's weight/activation quantization actually has an effect: less VRAM used for no perceptible performance hit. - I can confirm I can actually get decent results, at least given the evaluation output from a much narrower dataset. - throughput is reduced by 4x, but I feel it definitely helps with training, especially something astronomical like language. - I might even not conform to the implementation and try and use the specs for TorToiSe's AR (which actually just seems to be more layers, 12 => 30). + I think TorToiSe does *a lot* more to condition the inputs anyways, so it'd be silly to just add moar layers. I do have other ideas... * my "goals" are all over the place, so I'm focusing strictly on getting something that works, so I'm revisiting the initial test of just using one voice on a small dataset to get something working (especially after my modifications to the implementation): - somehow, I ended up using Joe Rogan, because somehow I have samples of him, and he seems to be the best fit for a test. + I think I had samples from him for testing TorToiSe's zero-shot, since I at least remember having impressive results from no finetuning. + my other datasets are Japanese, a pain to transcribe and slice from one whole file, or too big. - 4500 iterations (batch size 4) yielded ***relatively*** decent results: + AR eval: https://vocaroo.com/168qZvS0dkcb + NAR eval: https://vocaroo.com/1m4sKAxhmD1U - 5500 iterations: + AR eval: https://vocaroo.com/1bwwyG9rRZVG + NAR eval: https://vocaroo.com/12zC57mOUZSi - 8000 iterations: + AR eval: https://vocaroo.com/1983GZIY3ZIr + NAR eval: https://vocaroo.com/1cz5yKQZRj85 - I'd provide the graph for posterity, but the plots are too noisy. --- My thoughts and concerns: * after fixing things and getting a much better grasp (that doesn't seem to waver every day), I think it really just amounts to some patience, as my first tests I lacked patience (and a 4070Ti definitely helps with my lack of patience). * you can definitely hear the issue that stem from it using Encodec: - to my understanding, TorToiSe at least has the luxury of errors getting "smoothed out" from the mel token => mel spectrogram => vocoder passes, while VALL-E will need to be *very* precise. - it captured the chess piece sound, not surprising. I guess it just shows that the model is getting very overtrained, as the validation output sounds terrible. * It's still pretty hard to trust the reported losses during training, and the patchworked losses returned during evaluation/validation (calculating a loss from comparing the actual audio in paper sounds better, but it requires trimming to match, and that alignment can very well be off). * The accuracy metric I really can't use as a metric; it reports a higher accuracy on the AR when it sounds completely unusuable, but reports a lower accuracy for the NAR when it sounds """better""". * I do not think that training models from scratch on a narrow dataset is a good idea for use, as there's a very, very fine line between it being okay to it being decent but overfitting. --- Some whacky ideas I'll spitball that I won't probably do: * going back to what I think the newer implementation actually does, modifying the dataloader to instead batch by token size, rather than a naive batch size, in the hopes of making VRAM use consistent and not a pain and have some batches under-utilized. * transplanting shit from that trained model from the newer implementation into a model with the first implementation. It should be possible, but I don't think it'd be all that helpful. * patchworking DLAS into being able to train a VALL-E model - shouldn't really need to entertain this idea desu, as the first implementation is fine using DeepSpeed. * patchworking DLAS to replace the wav => mel spectrogram => VQVAE => mel token pass with a wav => encoded Encodec quantization blah blah, effectively being able to supplant DLAS too - again, shouldn't really entertain this bad idea, as it'd just be a huge nightmare to get working for no real gain. --- Roadmap / Goals: * use this test model to do inference tests, and slap inferencing support for VALL-E into the web UI. * fiddle with seeing how well "finetuning" from one voice to another fares. * prepare and train on LJSpeech, again (24 hours and one speaker should be better than 10 hours and I think 7 speakers strictly from a learning standpoint). * finalize a guide and routine into training with VALL-E. * depending on things: + if "finetuning" proves to be adequate, finetune the LJSpeech model with a much larger/diverse dataset (LibriTTS/LibriSpeech/LibriWhatever, I think the 60 hour one with a lot of different speakers) for a zero-shot oriented model. + if "finetuning" doesn't prove to be adequate, experiment with implementing VALL-E X with an English + Japanese dataset (I have some ideas on how to implement it). I just hope that things are smooth sailing from here on out now and I can use the time waiting to train to finally relax.

The japanese tortoise model is really cool. Would VALL-E X provide better results?

Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts.

Reading through the VALL-E X examples, it seems to be able to seamlessly switch between English and Mandarin, while preserving accents.

Does this mean that we could do something like train against Joe Rogan in only English, then have him speak in fluent Japanese?

Is the VALL-E implementation you are working on capable of Japanese?

The japanese tortoise model is really cool. Would VALL-E X provide better results? > Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. Reading through the VALL-E X examples, it seems to be able to seamlessly switch between English and Mandarin, while preserving accents. Does this mean that we could do something like train against Joe Rogan in only English, then have him speak in fluent Japanese? Is the VALL-E implementation you are working on capable of Japanese?

For zero-shot inferencing applications, diversity (ick) is a HUGE factor in having a good model. There's only so much data to sample from when trying to mimic voices. I worry that when I finally nail training a small model, that I'm going to be in a world of hurt trying to pick every piece of clean audio I can get and adequately processing it (although the web UI is rather decent at that now). The magic of TorToiSe is in the dataset it was trained against, as its author mentioned not using just audiobooks (a bit ironic, since it still has its biases on how well its zero shot inferencing performs).

Roadmap / Goals: prepare and train on LJSpeech, again

Forgive me for butting in, but howcome you haven't worked on building a more varied dataset then? There's hundreds of hours of video game dialogue & podcasts available for you to build a more diverse dataset from, not to mention other varied audio datasets that could be included.

This issue reminds me of an LLM paper I had seen here https://arxiv.org/abs/2203.15556, that seems to coincide with the dataset claims tortoise makes, and your woes. I think it would be worthwhile to try scaling your dataset size instead of trying to scale your model size in it's place? I would test the theory myself but lack the hardware that would actually be suitable for training to test this theory so feel free to call me out on it.

> For zero-shot inferencing applications, diversity (ick) is a HUGE factor in having a good model. There's only so much data to sample from when trying to mimic voices. I worry that when I finally nail training a small model, that I'm going to be in a world of hurt trying to pick every piece of clean audio I can get and adequately processing it (although the web UI is rather decent at that now). The magic of TorToiSe is in the dataset it was trained against, as its author mentioned not using just audiobooks (a bit ironic, since it still has its biases on how well its zero shot inferencing performs). >Roadmap / Goals: prepare and train on LJSpeech, again Forgive me for butting in, but howcome you haven't worked on building a more varied dataset then? There's hundreds of hours of video game dialogue & podcasts available for you to build a more diverse dataset from, not to mention other varied audio datasets that could be included. This issue reminds me of an LLM paper I had seen here https://arxiv.org/abs/2203.15556, that seems to coincide with the dataset claims tortoise makes, and your woes. I think it would be worthwhile to try scaling your dataset size instead of trying to scale your model size in it's place? I would test the theory myself but lack the hardware that would actually be suitable for training to test this theory so feel free to call me out on it.

Forgive me for butting in, but howcome you haven't worked on building a more varied dataset then? There's hundreds of hours of video game dialogue & podcasts available for you to build a more diverse dataset from, not to mention other varied audio datasets that could be included.

The Mozilla Common Voice Dataset is over 3000 hours and CC licensed. Podcasts, which one might have to transcribe (or at least proofread) manually, aren't a wise use of limited developer time by comparison.

>Forgive me for butting in, but howcome you haven't worked on building a more varied dataset then? There's hundreds of hours of video game dialogue & podcasts available for you to build a more diverse dataset from, not to mention other varied audio datasets that could be included. The [Mozilla Common Voice Dataset](https://commonvoice.mozilla.org/en/datasets/) is over 3000 hours and CC licensed. Podcasts, which one might have to transcribe (or at least proofread) manually, aren't a wise use of limited developer time by comparison.
Author
Owner

The japanese tortoise model is really cool. Would VALL-E X provide better results?

Hard to say.

I feel whatever base VALL-E puts out for Japanese is an indicator of how well VALL-E X will perform, as the only difference in implementation between the two would be annotating language during training, be it from additional tokens or another input. I'm not too sure how I would go about it, as there's no exisiting implementation for me to leech draw inspiration from.

Reading through the VALL-E X examples, it seems to be able to seamlessly switch between English and Mandarin, while preserving accents.

Very. I'm more impressed with the VALL-E X demos moreso than base VALL-E's demos.

Does this mean that we could do something like train against Joe Rogan in only English, then have him speak in fluent Japanese?

mmm. Should be. I imagine for testing a VALL-E X implementation, I would source a Japanese speaker that sounds like him and train against both of them. The limitation is getting a soundalike.

The magic of VALL-E X is being able to sample speech against an acoustic prompt (voice latents, in TorToiSe terms) similar to your subject. That's sort of what LM-based voice cloning does. I imagine the secret fixin is just providing tokens for language (like a start/stop token, but start-English/stop-English start-Japanese/stop-Japanese), and the ability to have multi-lingual speech synthesis is just an emergent property of basing it on an LM, or some gobbeldygook.

Is the VALL-E implementation you are working on capable of Japanese?

Mhm, should be. The only hurdle is trying to mend the phonemizer to work on Japanese again, as I remember it breaking at 2AM and I couldn't be assed to bandaid the phonemizer.


Forgive me for butting in, but howcome you haven't worked on building a more varied dataset then?

I will, don't worry. I just need to babystep through this and not throw in too many variables. My crux with my tests before were not having a clear focus in how I should go about testing and experimenting.

I'll probably get to sourcing a master dataset while I'm training models for the final step. For now though, I need narrower datasets for tests to ensure things can in fact scale up with the implementation before sinking in so much time for a bunk model.

I think it would be worthwhile to try scaling your dataset size instead of trying to scale your model size in it's place?

For zero-shot inferencing, of course a large/diverse dataset is necessary, but that won't do any good if you don't have a model big enough to learn all of it. I found the quarter sized one to cap out and lack the capacity to really learn anymore without a painfully astronomical training time to bruteforce it, if time would solve it.

I would test the theory myself but lack the hardware that would actually be suitable for training to test this theory so feel free to call me out on it.

That's definitely a worry I have that would "filter" a lot of people trying to roll out their own models. VRAM is no issue, as with enough optimizations, I can have a full sized AR and NAR and wiggle room on 12GiB, but the crux is a tradeoff between compute time and compute throughput; I can have all the speediest VRAM in the world, but it's no good if I can't use it fast enough. I have relatively the same throughput at bs=4 as I do bs=8 anyways.

And you can only really get good compute with Ada cards (4070Ti and up) or multiple Ampere cards. There's always """r*nting""", but it's not a good value proposition, at all, especially for testing.


The Mozilla Common Voice Dataset is over 3000 hours and CC licensed

86,942 voices

How convenient. I want to believe, for zero-shot inferencing, more speakers is better than more hours, so this is probably a great way to get varied voices.

Podcasts, which one might have to transcribe (or at least proofread) manually, aren't a wise use of limited developer time by comparison.

I feel any dataset is going to have the same amount of time to transcribe and validate desu. I can't really re-use existing transcriptions, as:

  • I would need to do wav2vec2 alignment on the transcription text, and do my own segmenting to pair down the audio. I suppose it's simple, as I can maybe subjugate whisperX to do just that, but I think rolling my own implementation would have an accuracy hit.
  • using the input sound as-is will pretty much force my hand to implement token-count-based batching rather than by-entry batching, to get consistent VRAM use. I found not segmenting my dataset would make it vary even harder and is very fickle to OOMing (yes I know moar VRAM would solve this, but the issue is still there that VRAM use is inconsistent).

It pretty much took 5 hours last night to re-transcribe LJSpeech in WhisperX large-v2, and probably an extra two this morning to quantize and babysit the phonemizing process (for a valid but still God forsaken reason, phonemizer will make a temp copy of the espeak lib on every call and only cleans it up when the process closes, so it'll crash after X amount of phonemizings). I suppose I could get a better transcription process, but WhisperX is probably the best I'll get.

> The japanese tortoise model is really cool. Would VALL-E X provide better results? Hard to say. I feel whatever base VALL-E puts out for Japanese is an indicator of how well VALL-E X will perform, as the only difference in implementation between the two would be annotating language during training, be it from additional tokens or another input. I'm not too sure how I would go about it, as there's no exisiting implementation for me to ~~leech~~ draw inspiration from. > Reading through the VALL-E X examples, it seems to be able to seamlessly switch between English and Mandarin, while preserving accents. Very. I'm more impressed with the VALL-E X demos moreso than base VALL-E's demos. > Does this mean that we could do something like train against Joe Rogan in only English, then have him speak in fluent Japanese? mmm. Should be. I imagine for testing a VALL-E X implementation, I would source a Japanese speaker that sounds like him and train against both of them. The limitation is getting a soundalike. The magic of VALL-E X is being able to sample speech against an acoustic prompt (voice latents, in TorToiSe terms) similar to your subject. That's sort of what LM-based voice cloning does. I imagine the secret fixin is just providing tokens for language (like a start/stop token, but start-English/stop-English start-Japanese/stop-Japanese), and the ability to have multi-lingual speech synthesis is just an emergent property of basing it on an LM, or some gobbeldygook. > Is the VALL-E implementation you are working on capable of Japanese? Mhm, should be. The only hurdle is trying to mend the phonemizer to work on Japanese again, as I remember it breaking at 2AM and I couldn't be assed to bandaid the phonemizer. --- > Forgive me for butting in, but howcome you haven't worked on building a more varied dataset then? I will, don't worry. I just need to babystep through this and not throw in too many variables. My crux with my tests before were not having a clear focus in how I should go about testing and experimenting. I'll probably get to sourcing a master dataset while I'm training models for the final step. For now though, I need narrower datasets for tests to ensure things can in fact scale up with the implementation before sinking in so much time for a bunk model. > I think it would be worthwhile to try scaling your dataset size instead of trying to scale your model size in it's place? For zero-shot inferencing, of course a large/diverse dataset is necessary, but that won't do any good if you don't have a model big enough to learn all of it. I found the quarter sized one to cap out and lack the capacity to really learn anymore without a painfully astronomical training time to bruteforce it, if time would solve it. > I would test the theory myself but lack the hardware that would actually be suitable for training to test this theory so feel free to call me out on it. That's definitely a worry I have that would "filter" a lot of people trying to roll out their own models. VRAM is no issue, as with enough optimizations, I can have a full sized AR and NAR and wiggle room on 12GiB, but the crux is a tradeoff between compute time and compute throughput; I can have all the speediest VRAM in the world, but it's no good if I can't use it fast enough. I have relatively the same throughput at bs=4 as I do bs=8 anyways. And you can only really get good compute with Ada cards (4070Ti and up) or multiple Ampere cards. There's always """r\*nting""", but it's not a good value proposition, at all, especially for testing. --- > The Mozilla Common Voice Dataset is over 3000 hours and CC licensed > 86,942 voices How convenient. I want to believe, for zero-shot inferencing, more speakers is better than more hours, so this is probably a great way to get varied voices. > Podcasts, which one might have to transcribe (or at least proofread) manually, aren't a wise use of limited developer time by comparison. I feel any dataset is going to have the same amount of time to transcribe and validate desu. I can't really re-use existing transcriptions, as: * I would need to do wav2vec2 alignment on the transcription text, and do my own segmenting to pair down the audio. I suppose it's *simple*, as I can maybe subjugate whisperX to do just that, but I think rolling my own implementation would have an accuracy hit. * using the input sound as-is will pretty much force my hand to implement token-count-based batching rather than by-entry batching, to get consistent VRAM use. I found not segmenting my dataset would make it vary even harder and is very fickle to OOMing (yes I know moar VRAM would solve this, but the issue is still there that VRAM use is inconsistent). It pretty much took 5 hours last night to re-transcribe LJSpeech in WhisperX large-v2, and probably an extra two this morning to quantize and babysit the phonemizing process (for a valid but still God forsaken reason, phonemizer will make a temp copy of the espeak lib on every call and only cleans it up when the process closes, so it'll crash after X amount of phonemizings). I suppose I could get a better transcription process, but WhisperX is probably the best I'll get.

I would need to do wav2vec2 alignment on the transcription text, and do my own segmenting to pair down the audio. I suppose it's simple, as I can maybe subjugate whisperX to do just that, but I think rolling my own implementation would have an accuracy hit.

I've had outstanding results with WhisperX once I started running it with --align_model WAV2VEC2_ASR_LARGE_LV60K_960H. The downside is that it doesn't support many languages out of the box (but Japanese is one of them, IIRC).

However, I don't think you need to bother with that though because in the sample I downloaded it's already segmented. I grabbed the latest delta of the Indonesian corpus (only 110 MB) and the longest clip is only 10 seconds.

> I would need to do wav2vec2 alignment on the transcription text, and do my own segmenting to pair down the audio. I suppose it's simple, as I can maybe subjugate whisperX to do just that, but I think rolling my own implementation would have an accuracy hit. I've had outstanding results with WhisperX once I started running it with `--align_model WAV2VEC2_ASR_LARGE_LV60K_960H`. The downside is that it doesn't support many languages out of the box (but Japanese is one of them, IIRC). However, I don't think you need to bother with that though because in the sample I downloaded it's already segmented. I grabbed the latest delta of the Indonesian corpus (only 110 MB) and the longest clip is only 10 seconds.

And you can only really get good compute with Ada cards (4070Ti and up) or multiple Ampere cards

I mean if that's the case I'll have a second 3090 with NVLINK sometime next month, so maybe that'll make the difference.

painfully astronomical training time to bruteforce it, if time would solve it.

In regards to that paper, it basically showed that most LLMs where underfitted , the cure was more data and training st the same model size. It's probably going to be a necessity given the results, so maybe it's going to be more beneficial optimising the training itself before devoting to the training.

> And you can only really get good compute with Ada cards (4070Ti and up) or multiple Ampere cards I mean if that's the case I'll have a second 3090 with NVLINK sometime next month, so maybe that'll make the difference. > painfully astronomical training time to bruteforce it, if time would solve it. In regards to that paper, it basically showed that most LLMs where underfitted , the cure was more data and training st the same model size. It's probably going to be a necessity given the results, so maybe it's going to be more beneficial optimising the training itself before devoting to the training.
Author
Owner

I've had outstanding results with WhisperX once I started running it with --align_model WAV2VEC2_ASR_LARGE_LV60K_960H. The downside is that it doesn't support many languages out of the box (but Japanese is one of them, IIRC).

Yeah, I have that model load for English only just so I can try and get something together when reintegrating back to the web UI. I should make it another option, but I can't be assed to at the moment.

It's enough of an improvement getting everything together + VAD filtering that I can finally rely on it for transcription after being scarred.


I'll have a second 3090 with NVLINK sometime next month, so maybe that'll make the difference.

A single 3090 allegedly has similar throughput to a 4070Ti, but I haven't validated it myself after all the optimizations I cobbled together. It just feels like I'm having nicer throughput with my 4070Ti over using a P*p*rsp*c*e A100-80G before they fucked me in the ass again.

the cure was more data and training st the same model size. It's probably going to be a necessity given the results

Ah, I guess that would explain some things. Pretty much most of my tests just seemed like in the worst spots where it's too big to overtrain and get results like the Joe Rogan tests, but nowhere near large enough to get it to not be so grossly underfitting. The LJSpeech one seems to be going a little better than the LibriWhatever subset I had over the weekend, but doesn't seem to be improving any more.

If the valley between the "too little and you overtrain" and the "not enough and you underfit" is just that large, I suppose I'll start working towards getting a ginormous dataset together.

so maybe it's going to be more beneficial optimising the training itself before devoting to the training.

I'm pretty much all out of my magic tricks to make training better.

Well, there's reusing/sharing the text / acoustic prompts embeddings between the AR and NAR, but that doesn't seem like it's much of an issue for training.

> I've had outstanding results with WhisperX once I started running it with --align_model WAV2VEC2_ASR_LARGE_LV60K_960H. The downside is that it doesn't support many languages out of the box (but Japanese is one of them, IIRC). Yeah, I have that model load for English only just so I can try and get something together when reintegrating back to the web UI. I should make it another option, but I can't be assed to at the moment. It's enough of an improvement getting everything together + VAD filtering that I can finally rely on it for transcription after being scarred. --- > I'll have a second 3090 with NVLINK sometime next month, so maybe that'll make the difference. A single 3090 *allegedly* has similar throughput to a 4070Ti, but I haven't validated it myself after all the optimizations I cobbled together. It just *feels like* I'm having nicer throughput with my 4070Ti over using a P\*p\*rsp\*c\*e A100-80G before they fucked me in the ass again. > the cure was more data and training st the same model size. It's probably going to be a necessity given the results Ah, I guess that would explain some things. Pretty much most of my tests just seemed like in the worst spots where it's too big to overtrain and get results like the Joe Rogan tests, but nowhere near large enough to get it to not be so grossly underfitting. The LJSpeech one seems to be going a little better than the LibriWhatever subset I had over the weekend, but doesn't seem to be improving any more. If the valley between the "too little and you overtrain" and the "not enough and you underfit" is just that large, I suppose I'll start working towards getting a ginormous dataset together. > so maybe it's going to be more beneficial optimising the training itself before devoting to the training. I'm pretty much all out of my magic tricks to make training better. Well, there's reusing/sharing the text / acoustic prompts embeddings between the AR and NAR, but that doesn't seem like it's much of an issue for training.

I suppose I'll start working towards getting a ginormous dataset together.

What would such a dataset entail?

If you are not to shy away from using materials with legally uncertain licensing issues, then
I'd happily contribute my growing collection of video game character samples (extracted straight from game files) + transcriptions.

> I suppose I'll start working towards getting a ginormous dataset together. What would such a dataset entail? If you are not to shy away from using materials with legally uncertain licensing issues, then I'd happily contribute my growing collection of video game character samples (extracted straight from game files) + transcriptions.
Author
Owner

What would such a dataset entail?

Not too sure. It'd probably be a mix between:

  • some already open speech collections.
  • samples of decent quality from the list of sample collections I have on the wiki.

My only qualm with using sources from the collections on the wiki is that almost all the voices there are one single file, so transcription/segmenting will be a bit of a pain.

If you are not to shy away from using materials with legally uncertain licensing issues

Of course not, I'm fine using whatever. It's just slightly more convenient to use "open" speech corpora as they're usually cleaned up well enough.

then I'd happily contribute my growing collection of video game character samples (extracted straight from game files) + transcriptions.

Sure.

> What would such a dataset entail? Not too sure. It'd probably be a mix between: * some already open speech collections. * samples of decent quality from the list of sample collections I have on [the wiki](https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Collecting-Samples#sourcing). My only qualm with using sources from the collections on the wiki is that almost all the voices there are one single file, so transcription/segmenting will be a bit of a pain. > If you are not to shy away from using materials with legally uncertain licensing issues Of course not, I'm fine using whatever. It's just slightly more convenient to use "open" speech corpora as they're usually cleaned up well enough. > then I'd happily contribute my growing collection of video game character samples (extracted straight from game files) + transcriptions. Sure.
Author
Owner

I'm so mad. I had a decently lengthed followup, but because I used a stupid emoji that the text entry field suggested, it ate it all up.

Pah, oh well. It was mostly outlining a path I should take by using actual audio from vidya rather than audiobooks as audiobooks have the inherent problem of not being varied enough.

I'm so mad. I had a decently lengthed followup, but because I used a stupid emoji that the text entry field suggested, it ate it all up. Pah, oh well. It was mostly outlining a path I should take by using actual audio from vidya rather than audiobooks as audiobooks have the inherent problem of not being varied enough.

I should take by using actual audio from vidya rather than audiobooks

Vidya also has the added benefit of coming already segmented in small 2-5 second chunks, as well as being clean studio recordings.

Plus, a lot of them can be easily extracted from game files as well.

> I should take by using actual audio from vidya rather than audiobooks Vidya also has the added benefit of coming already segmented in small 2-5 second chunks, as well as being clean studio recordings. Plus, a lot of them can be easily extracted from game files as well.
Author
Owner

Progress report:

Inferencing with VALL-E is integrated into the web UI, with some warts.

  • not very many knobs to play with, only a maximum step count and sampling temperatures. I'm sure I have to implement more tunables myself.
  • the included inferencing code seems sussy, as it does an einops rearrange that isn't present for the evaluation/validation step.
  • I suppose naturally as well, the inferencing output is nowhere near as nice as the evaluation output and even the validation output. I say naturally, because inferencing will pass the output from the AR into the NAR, while evaluation/validation will just pass the prompts into each model separately.

Training wise, I found it's "relatively" """easy""" to """""add""""" additional voices with an existing dataset on an existing model, and I imagine finetuning a single voice might work nicer. I'm doing some funny shenanigans by quickly warming up the LR to 2.0e-4 for an epoch's worth, then decaying down to 1.0e-6 for 1000 iterations (which in my last set it seemed to be for 9 epochs).

  • I add the emphasis quotes, because the model is still overfitting. It definitely is showing it's replicating the new voices from the eval/val output, but still not a quality improvement.

I might need to continue the above procedure by adding in more and more voices to the dataset to find a point where it'll stop overfitting. I'm just worried how many voices I'll need, and I worry about risking frying everything by doing these pseudo-restarts too much.

Aside from that, there's not really much else for me to do besides re-read the VALL-E paper / both implementations while I keep baking models and shoving in more voices, since there's some things I just don't quite "get".

  • I'm still not too sure how the acoustic prompts are provided. I see in both implementations they're derived from the quantized audio itself (the Encodec tokens), but I'm not sure how/why that gets results. I suppose it's feasible to do so that way, as you're effectively quantizing a voice anyways with Encodec, and the range in values for the token only goes up to 1024, so something like representing traits of a voice can emerge with such a low resolution. But I am not an expert.
  • the paper reiterates the "three-second enrolled recording", but I'm pretty sure the implementation I forked takes no care of that, and I don't believe the newer implementation does so either. I suppose the paper calls for it to boast how you can clone a voice with only three seconds, but I imagine longer inputs would be better, which is why neither VALL-E implementations added it in.
  • I'm still iffy on the manner the losses are computed, as the implementation I forked will report a loss labeled as an NLL loss, but it was actually using Cross entropy. This loss value tends to be lower than the reported loss value from DeepSpeed itself, so I guess there's no harm if it's overcompensating rather than undercompensating with the lower loss value.

Oh well. I shouldn't sweat over the issues so much; as long as the evaluation/validation output sounds fine during training, then I can just keep training a model and eventually resolve any issues with inferencing. It'd be nice if I had someone to come swoop in and call me a dumbass and other quasi-derogatory-slurs for neglecting details, but I imagine there's no one else that really has an idea about VALL-E outside of the people behind the M$-backed paper, and the two implementation writers, all of which seems to be behind a prohibitive language barrier (or in the case of the one I forked, unresponsive). I'm in no rush to nail out a working model, after all.


I imagine my issues, once again, stem from an extremely small dataset size. As mentioned:

In regards to that paper, it basically showed that most LLMs where underfitted , the cure was more data and training st the same model size. It's probably going to be a necessity given the results

And the paper mentions it several times:

Compared to previous TTS training datasets, such as LibriTTS, our data contain more noisy speech and inaccurate transcriptions but provide diverse speakers and prosodies. We believe the proposed approach is robust to the noise and generalize well by leveraging large data. It is worth noting that existing TTS systems are always trained with dozens of hours of single-speaker data or hundreds of hours of multi-speaker data, which is over hundreds of times smaller than VALL-E
We build a generalized TTS system in the speaker dimension by leveraging a huge amount
of semi-supervised data, suggesting that simple scaling up semi-supervised data has been
underestimated for TTS.

I suppose the strength of VALL-E is that given an extremely large dataset (60k hours, 7000 speakers), the LM properties it boasts over other TTS systems emerge. So I guess I'll just keep shoving in more and more and more and more and more data.

I'll need to come up with a way to try and crowdsource a bunch of data then. Even trying to come up with what vidya to source my voices from is leaving me feeling overwhelmed already (both from even trying to think of where to pool from, and storing it all).

Progress report: Inferencing with VALL-E is integrated into the web UI, with some warts. * not very many knobs to play with, only a maximum step count and sampling temperatures. I'm sure I have to implement more tunables myself. * the included inferencing code seems sussy, as it does an einops rearrange that isn't present for the evaluation/validation step. * I suppose naturally as well, the inferencing output is nowhere near as nice as the evaluation output and even the validation output. I say naturally, because inferencing will pass the output from the AR into the NAR, while evaluation/validation will just pass the prompts into each model separately. Training wise, I found it's "relatively" """easy""" to """""add""""" additional voices with an existing dataset on an existing model, and I imagine finetuning a single voice might work nicer. I'm doing some funny shenanigans by quickly warming up the LR to 2.0e-4 for an epoch's worth, then decaying down to 1.0e-6 for 1000 iterations (which in my last set it seemed to be for 9 epochs). * I add the emphasis quotes, because the model is still overfitting. It definitely is showing it's replicating the new voices from the eval/val output, but still not a quality improvement. I might need to continue the above procedure by adding in more and more voices to the dataset to find a point where it'll stop overfitting. I'm just worried how many voices I'll need, and I worry about risking frying everything by doing these pseudo-restarts too much. Aside from that, there's not really much else for me to do besides re-read the VALL-E paper / both implementations while I keep baking models and shoving in more voices, since there's some things I just don't quite "get". * I'm still not too sure how the acoustic prompts are provided. I see in both implementations they're derived from the quantized audio itself (the Encodec tokens), but I'm not sure how/why that gets results. I *suppose* it's feasible to do so that way, as you're effectively quantizing a voice anyways with Encodec, and the range in values for the token only goes up to 1024, so something like representing traits of a voice can emerge with such a low resolution. But I am not an expert. * the paper reiterates the "three-second enrolled recording", but I'm pretty sure the implementation I forked takes no care of that, and I don't believe the newer implementation does so either. I *suppose* the paper calls for it to boast how you can clone a voice with only three seconds, but I imagine longer inputs would be better, which is why neither VALL-E implementations added it in. * I'm still iffy on the manner the losses are computed, as the implementation I forked will report a loss labeled as an [NLL loss](https://pytorch.org/docs/stable/generated/torch.nn.functional.nll_loss.html#torch.nn.functional.nll_loss), but it was actually using [Cross entropy](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html). This loss value tends to be lower than the reported loss value from DeepSpeed itself, so I guess there's no harm if it's overcompensating rather than undercompensating with the lower loss value. Oh well. I shouldn't sweat over the issues so much; as long as the evaluation/validation output sounds fine during training, then I can just keep training a model and eventually resolve any issues with inferencing. It'd be nice if I had someone to come swoop in and call me a dumbass and other quasi-derogatory-slurs for neglecting details, but I imagine there's no one else that really has an idea about VALL-E outside of the people behind the M$-backed paper, and the two implementation writers, all of which seems to be behind a prohibitive language barrier (or in the case of the one I forked, unresponsive). I'm in no rush to nail out a working model, after all. --- I imagine my issues, once again, stem from an extremely small dataset size. As mentioned: > In regards to that paper, it basically showed that most LLMs where underfitted , the cure was more data and training st the same model size. It's probably going to be a necessity given the results And the paper mentions it several times: > Compared to previous TTS training datasets, such as LibriTTS, our data contain more noisy speech and inaccurate transcriptions but provide diverse speakers and prosodies. We believe the proposed approach is robust to the noise and generalize well by leveraging large data. It is worth noting that existing TTS systems are always trained with dozens of hours of single-speaker data or hundreds of hours of multi-speaker data, which is over hundreds of times smaller than VALL-E > We build a generalized TTS system in the speaker dimension by leveraging a huge amount of semi-supervised data, suggesting that simple scaling up semi-supervised data has been underestimated for TTS. I suppose the strength of VALL-E is that given an extremely large dataset (60k hours, 7000 speakers), the LM properties it boasts over other TTS systems emerge. So I guess I'll just keep shoving in more and more and more and more and more data. I'll need to come up with a way to try and crowdsource a bunch of data then. Even trying to come up with what vidya to source my voices from is leaving me feeling overwhelmed already (both from even trying to think of where to pool from, and storing it all).

Bear in mind, I am merely a weeb with no knowledge whatsoever...

With that being said...

tldr You should enforce high example standards, and pool the efforts of the community/other people, and use anime transcriptions instead of/in addition to vidya

I suppose the strength of VALL-E is that given an extremely large dataset (60k hours, 7000 speakers), the LM properties it boasts over other TTS systems emerge. So I guess I'll just keep shoving in more and more and more and more and more data.

I'll need to come up with a way to try and crowdsource a bunch of data then. Even trying to come up with what vidya to source my voices from is leaving me feeling overwhelmed already (both from even trying to think of where to pool from, and storing it all).

To preface, based on reading your commit descriptions, and paying attention to your mannerisms, I assume you don't give a fuck about morality. So i have taken that into consideration when writing...

Even trying to come up with what vidya to source my voices from is leaving me feeling overwhelmed

One of the biggest things, is that you have to leverage yourself correctly

People like Roukanken can immensely help you, if you let them

There are many people who would love to contribute to this project, but who are unable to from the coding side of things...

And lets be honest, its definitely not worth your time to collect/process data, as opposed to developing the code...

But there is a way to utilize this manpower effectively...


You need a way to publicly delegate tasks, and collect the data in an efficient matter...

We could create a google form for something like this...

Anime User Assigned Status Link to prepared dataset
Boku No Hero Academia S02(Japanese) Unassigned 12/25 episodes (Google drive,mega,mediafire,dropbox)
Game User Assigned Status Link
Metal Gear Rising (English) Text Text Text

It would be a public form, that anyone can sign up for, and you can delegate certain materials to different people, with varying metrics(ie I need someone to get these files, I need someone to transcribe this files, I need someone to split these files, I need someone to clean these files, or you can simply assign the whole request to one person)

The biggest thing that would be needed for this to work effectively is...

  1. Good management system
  2. Good tutorial(more below)...

There are more complexities as well, such as managing voices per character, background music/noise, but as far as scale, anime may have you covered.

I am unsure of the best way to store the files but once something is completed, it is proabably best for you to simply download the files, then be done with it(after all, there really is no need for anyone to have the final,
perfect cut files aside from you, especially if they meet quality standards)...

Advantages to using this method

  • You can use your time far more effectively
  • You can get data from anything/anywhere
  • You can request data that you'd like someone to try and get
  • You can get consistent, high quality Japanese using this method....(more on this below)
  • Anime will generally maintain the same VA for many episodes, not to mention OVA's, movies, etc, meaning that for all the audio that is "bad/unusable", there should be enough "good" audio to create a general set for a speaker. I'll admit; vidya voice acting is far cleaner, due to the files being "raw", but in terms of general data, anime audio is far more accessible, and in far more abundance. It simply requires far more post process work.

Disadvantages...

  • You will have to do some "light" inspections. This is why enforcing/being explicit about your standards will be important, because it will cut down on the amount of garbage you get, vs shit that is actually usable.
  • If someone messes something up, then you will have incorrect data ie they assigned the wrong transcription to a line( I don't know how bad/how many lines you can mess up before there are major issues?)
  • Parts of the process will/may need automation
  • People need to explicitly understand what types of data should not be used ex. audio with heavy background music
  • To use anime, people will have to match the transcriptions, then split their data according to each character, which may be a pain in the ass

To mitigate some of the disadvantages, you could "assign" someone to help with this, ie it would be easier to train one to three people to understand what you need in an audio dataset, then have them actually verify the incoming data/audio for you,then simply trust their judgement, as opposed
to you manually verifying each dataset. That can significantly free up your time investment...

I am unsure of your current process of getting data ( I know that you use certain game rips and libraries),
but for fueling a massive, unethical dataset, desu I think this is the way...
The only thing, is that there is no pre-existing "anime library", so in a lot of ways, this would be the first of its kind...

If there is an existing library, ie where people have ripped audio VA from characters and transcribed them, it would be far easier, but to my knowledge this does not exist.


Where do we get the material?

Fortunately, anime is very easy to download out in the wild

Various websites offer download utilities, and there are some that allow for downloading anime in bulk...

However there is still problem... how to prep the data?


Transcriptions and community service

There are various "transcriptions"/anime fan transcriptions, as well as subtitles for various animes...

These files provide the ENTIRE dialouge for a given episode of an anime, with both English and Japanese transcriptions. This means the accuracy is pretty good (provided the authors were accurate initially. For Japanese I believe your software uses Hirigana/Katakana? That would be one issue, but I am assuming we could just throw the transcription into one of those online websites that would simplify the kanji into hiragana/katakana)

But...


How to split?

This process would be a lot of work for one person. Unless there is a superior method.

This is where a little investment would be needed to create a proper tutorial...

Essentially, we could teach community members, and incoming members how to contribute to this project by...

  1. Following the datasheet
  2. split and label files correctly
  3. Properly vet and clean files....

Generally, most of this can be done through audacity
(If I am being honest, there is probably a way to automate the "cleaning" aspect of audacity, ex. autorun a script that will take a batch of files, and apply a (NoiseRemoval>compressor>Equalizer))

You will need to invest a little bit of time into making the process as straight forward as possible, and being CLEAR as to what you need, and do not need, but it would get you brand new, high quality audio files

You will proabably have to "inspect" the audio somewhat, but that's what you can train someone for....

Part of me honestly wonders if there would be some way to match the transcriptions to the audio, then split it automatically. In theory, if you could do this, this would actually reduce the need for community involvement,
because it would be as simple as getting the transcript, audio, splitting, cleaning, then using. You could basically create a one click system, that gives you the data you need (albeit, with some light inspection)

However, my concern is that some transcripts do not include timestamps, making this somewhat difficult... maybe someone has a creative solution?


Where to store?

Harddrives? cloud services? Exactly how much space you need? Im sure people would be willing to pool some shit together for you...


So to recap

  • We can leverage manpower/your time by outsourcing your needs to the community...
  • We can fulfill your data needs by using anime, and transcribed anime...there are pitfalls to this method, but it can help
  • Additionally, we can also get a fair amount of Japanese AND English off of these resources.
  • Cleaning/splitting/inspecting will be a challenge as to maintain your needs for the quality of data, but there are various ways we can keep this simple on the user side of things, and on your side
  • Additionally, there may be ways to find users who HAVE ALREADY DONE THIS FOR US (ie. Goku may already have all his voicelines available for download with transcriptions)
  • Also, if you are feeling too lazy, you can simply train 1-3 people to do manage this for you, so that all you have to do is simply "use" the datasets...
  • Storage - I don't know exactly what you need, but whatever it is, fuck it, lets get it done...

I would say in terms of collecting data in this fashion, you would have to shift to being a little more of a manager and a coder, as opposed to straight doing everything yourself, but fuck it...

There are also other ways to get audio that is HIGH QUALITY, PROPERLY TRANSCRIBED, but that is less "morally acceptable", but if you would like to talk on these, you can hit me back...

Lmk what you think...

I'm willing to help out a bit more if you are interested...

Bear in mind, I am merely a weeb with no knowledge whatsoever... With that being said... tldr You should enforce high example standards, and pool the efforts of the community/other people, and use anime transcriptions instead of/in addition to vidya > I suppose the strength of VALL-E is that given an extremely large dataset (60k hours, 7000 speakers), the LM properties it boasts over other TTS systems emerge. So I guess I'll just keep shoving in more and more and more and more and more data. > > I'll need to come up with a way to try and crowdsource a bunch of data then. Even trying to come up with what vidya to source my voices from is leaving me feeling overwhelmed already (both from even trying to think of where to pool from, and storing it all). To preface, based on reading your commit descriptions, and paying attention to your mannerisms, I assume you don't give a fuck about morality. So i have taken that into consideration when writing... > Even trying to come up with what vidya to source my voices from is leaving me feeling overwhelmed One of the biggest things, is that you have to leverage yourself correctly People like Roukanken can immensely help you, if you let them There are many people who would love to contribute to this project, but who are unable to from the coding side of things... And lets be honest, its definitely not worth your time to collect/process data, as opposed to developing the code... But there is a way to utilize this manpower effectively... ----------------------------------------------------------------------------------- **You need a way to publicly delegate tasks, and collect the data in an efficient matter...** We could create a google form for something like this... | Anime | User Assigned | Status |Link to prepared dataset| | -------- | -------- | -------- |--| | Boku No Hero Academia S02(Japanese) | Unassigned | 12/25 episodes |(Google drive,mega,mediafire,dropbox)| | Game | User Assigned | Status |Link| | -------- | -------- | -------- | -------- | | Metal Gear Rising (English) | Text | Text | Text | It would be a public form, that anyone can sign up for, and you can delegate certain materials to different people, with varying metrics(ie I need someone to get these files, I need someone to transcribe this files, I need someone to split these files, I need someone to clean these files, or you can simply assign the whole request to one person) The biggest thing that would be needed for this to work effectively is... 1. Good management system 2. Good tutorial(more below)... **There are more complexities as well, such as managing voices per character, background music/noise, but as far as scale, anime may have you covered.** ~~I am unsure of the best way to store the files but once something is completed, it is proabably best for you to simply download the files, then be done with it(after all, there really is no need for anyone to have the final, perfect cut files aside from you, especially if they meet quality standards)...~~ ***Advantages to using this method*** * You can use your time far more effectively * You can get data from anything/anywhere * You can request data that you'd like someone to try and get * You can get consistent, high quality Japanese using this method....(more on this below) * Anime will generally maintain the same VA for many episodes, not to mention OVA's, movies, etc, meaning that for all the audio that is "bad/unusable", there should be enough "good" audio to create a general set for a speaker. I'll admit; vidya voice acting is far cleaner, due to the files being "raw", but in terms of general data, anime audio is far more accessible, and in far more abundance. It simply requires far more post process work. ***Disadvantages...*** * You will have to do some "light" inspections. This is why enforcing/being explicit about your standards will be important, because it will cut down on the amount of garbage you get, vs shit that is actually usable. * If someone messes something up, then you will have incorrect data ie they assigned the wrong transcription to a line( I don't know how bad/how many lines you can mess up before there are major issues?) * Parts of the process will/may need automation * People need to explicitly understand what types of data should not be used ex. audio with heavy background music * To use anime, people will have to match the transcriptions, then split their data according to each character, which may be a pain in the ass To mitigate some of the disadvantages, you could "assign" someone to help with this, ie it would be easier to train one to three people to understand what you need in an audio dataset, then have them actually verify the incoming data/audio for you,then simply trust their judgement, as opposed to you manually verifying each dataset. That can significantly free up your time investment... I am unsure of your current process of getting data ( I know that you use certain game rips and libraries), but for fueling a massive, unethical dataset, desu I think this is the way... The only thing, is that there is no pre-existing "anime library", so in a lot of ways, this would be the first of its kind... If there is an existing library, ie where people have ripped audio VA from characters and transcribed them, it would be far easier, but to my knowledge this does not exist. ----------------------------------------------------------------------------------- ***Where do we get the material?*** Fortunately, anime is very easy to download out in the wild Various websites offer download utilities, and there are some that allow for downloading anime in bulk... However there is still problem... how to prep the data? ----------------------------------------------------------------------------------- ***Transcriptions and community service*** There are various "transcriptions"/anime fan transcriptions, as well as subtitles for various animes... These files provide the ENTIRE dialouge for a given episode of an anime, with both English and Japanese transcriptions. This means the accuracy is pretty good (provided the authors were accurate initially. For Japanese I believe your software uses Hirigana/Katakana? That would be one issue, but I am assuming we could just throw the transcription into one of those online websites that would simplify the kanji into hiragana/katakana) But... ----------------------------------------------------------------------------------- ***How to split?*** This process would be a lot of work for one person. Unless there is a superior method. This is where a little investment would be needed to create a proper tutorial... Essentially, we could teach community members, and incoming members how to contribute to this project by... 1. Following the datasheet 2. split and label files correctly 3. Properly vet and clean files.... Generally, most of this can be done through audacity (If I am being honest, there is probably a way to automate the "cleaning" aspect of audacity, ex. autorun a script that will take a batch of files, and apply a (NoiseRemoval>compressor>Equalizer)) You will need to invest a little bit of time into making the process as straight forward as possible, and being CLEAR as to what you need, and do not need, but it would get you brand new, high quality audio files You will proabably have to "inspect" the audio somewhat, but that's what you can train someone for.... Part of me honestly wonders if there would be some way to match the transcriptions to the audio, then split it automatically. In theory, if you could do this, this would actually reduce the need for community involvement, because it would be as simple as getting the transcript, audio, splitting, cleaning, then using. You could basically create a one click system, that gives you the data you need (albeit, with some light inspection) However, my concern is that some transcripts do not include timestamps, making this somewhat difficult... maybe someone has a creative solution? ----------------------------------------------------------------------------------------- ***Where to store?*** Harddrives? cloud services? Exactly how much space you need? Im sure people would be willing to pool some shit together for you... ----------------------------------------------------------------------------------------- ***So to recap*** * We can leverage manpower/your time by outsourcing your needs to the community... * We can fulfill your data needs by using anime, and transcribed anime...there are pitfalls to this method, but it can help * Additionally, we can also get a fair amount of Japanese AND English off of these resources. * Cleaning/splitting/inspecting will be a challenge as to maintain your needs for the quality of data, but there are various ways we can keep this simple on the user side of things, and on your side * Additionally, there may be ways to find users who HAVE ALREADY DONE THIS FOR US (ie. Goku may already have all his voicelines available for download with transcriptions) * Also, if you are feeling too lazy, you can simply train 1-3 people to do manage this for you, so that all you have to do is simply "use" the datasets... * Storage - I don't know exactly what you need, but whatever it is, fuck it, lets get it done... I would say in terms of collecting data in this fashion, you would have to shift to being a little more of a manager and a coder, as opposed to straight doing everything yourself, but fuck it... There are also other ways to get audio that is HIGH QUALITY, PROPERLY TRANSCRIBED, but that is less "morally acceptable", but if you would like to talk on these, you can hit me back... Lmk what you think... I'm willing to help out a bit more if you are interested...

This process would be a lot of work for one person. Unless there is a superior method.

Almost all of what you've proposed above can be done automatically. whisperx produces millisecond-granular timestamps per word (in ASS format, like Anime subbers use), those timestamps can then be fed into ffmpeg to produce segments of the appropriate length. Cross-check against fansubs (or the official ones, if available) and throw out any that don't match.

> This process would be a lot of work for one person. Unless there is a superior method. Almost all of what you've proposed above can be done automatically. `whisperx` produces millisecond-granular timestamps per word (in ASS format, like Anime subbers use), those timestamps can then be fed into `ffmpeg` to produce segments of the appropriate length. Cross-check against fansubs (or the official ones, if available) and throw out any that don't match.

Almost all of what you've proposed above can be done automatically.

I wasn't all that familiar with Whisper, but it does seem quite awesome.

I guess at that point, if what you say can produce the segments, then all we would need to do is feed him the data/anime?

Can WhisperX detect differences in speakers/be able to "sort" multiple speakers? i.e. for a full anime episode, multiple characters.

There are proabably more efficient ways to clean the data, as well, I presume.

> Almost all of what you've proposed above can be done automatically. I wasn't all that familiar with Whisper, but it does seem quite awesome. I guess at that point, if what you say can produce the segments, then all we would need to do is feed him the data/anime? Can WhisperX detect differences in speakers/be able to "sort" multiple speakers? i.e. for a full anime episode, multiple characters. There are proabably more efficient ways to clean the data, as well, I presume.
Author
Owner

tldr You should enforce high example standards

I'm not even sure if I need high standards. WhisperX does a damn decent job now at transcription and timestamping, and the VALL-E paper says just having a ginormous dataset is robust enough to noisy audio and some inaccuracies. Sure, it'd be preferable to have the data as accurate as possible, but it doesn't need to be 99.99999% accurate.

and pool the efforts of the community/other people

desu I just need ideas on what to source from (and a bit of where, as sounds-resource.com seems to be missing a decent amount of stuff that crossed my mind). Sure, it'd be nice if it was all handed to me transcribed and neatly prepared, but asking for it to be transcribed is a bit of a big ask, as I'd pretty much require only transcriptions from WhisperX + large-v2 + VAD filtering enabled, which requires a HF token. It's not a huge deal for me to do the transcription process itself, as a few hundred lines can be crunched through relatively fast on my 4070Ti.

and use anime transcriptions instead of/in addition to vidya

My qualm with anime (dubs) is that there's a considerable amount of extra effort needed to get decent audio. I imagine the best case scenario are BD releases with the vocals on a separate audio track, and you can just segment the audio by subtitles and it's all good, but the worst case is aired anime won't have that. I also don't think any of the few anime I have watched were dubbed anyways, so I won't have much of anything to source from myself.

I assume you don't give a fuck about morality

In terms of """ethically sourcing""" a dataset, I don't really have an qualms about that.

And lets be honest, its definitely not worth your time to collect/process data, as opposed to developing the code...

The only thing left really code-wise is just to make my own VALL-E implementation rather than rely on an existing one and continue working around its design desicions, but even then that's pretty low priority.

We could create a google form for something like this

Pretty much what I had in mind. I'd settle just with something to submit character name + source and a link to the audio (or at the very least, where to get it).

I am unsure of the best way to store the files but once something is completed, it is proabably best for you to simply download the files, then be done with it(after all, there really is no need for anyone to have the final,

Actually, the final, quantized audio that gets trained against doesn't take all that much space, so something ginormous won't actually be all that much of a detriment. It's just the source audio that becomes a bit of a pickle if I keep it on disk. Worst case, I do have several, several drives (and could always buy another 10TiB one), but I'd just have to do bookkeeping as I'm quite a datawhore.

You can get consistent, high quality Japanese

desu my concern over VALL-E X is quite a ways off (or at the least, even having a Japanese model). Incorporating Japanese would have to be when I do get something cobbled together, as I'm really not sure how much of a problem it would pose with training with a multi-lingual dataset, as much as it would definitely help increase my voice variety with including my sixty-or-so voices I already have transcribed.


From here I'll just generally address the rest of it.

I appreciate the thought-out planning on it all, but at the end of the day, as long as the samples are somewhat-put-together, I'll accept it: anime, TV shows, movies, what-have-you. Just anything that isn't an audiobook reading, as that's where I feel is the less likely to provide much of any variety. I'm not strictly restricting it to just muh vidya for the dataset; it's just both the best and easiest to source from, and what I'm 99% likely to source from myself.

On the flipside though:

  • I'm still not sure how much more "effort" is needed, as
    • I'm still trying to really min/max training to see what's the bare minimum I can get by to have it stop overfitting. It could be a few more voices, it could be a shit ton more. I most definitely can always add in more data in-post without needing to start from scratch, so that itself isn't a concern.
    • the code itself is pretty much as complete as I can get it now. I just need to get a model to verify everything is in order and there's no inherent issues with the code, and from what I can get so far, it seems solid. The only other thing I'd really care to do is write my own implementation to truly own it, rather than subjugating an existing codebase and deal with working around the warts.
  • I'm not really sure how much of a reach having a crowdsourcing form will have. I suppose it's better than not having it, but I do expect it to not really garner much traction (and I fear it getting too much traction).

For now though, anyone's free to drop a link to what they would like for me to train the model against. Between my usual weekend rituals and RE4make, I'll probably try and cobble together more voices to feed the training model, as I fed it the rest of what I had on my training server and it seems to already have peaked, given the little improvement from doing another LR restart.

> tldr You should enforce high example standards I'm not even sure if I need high standards. WhisperX does a damn decent job now at transcription and timestamping, and the VALL-E paper says just having a ginormous dataset is robust enough to noisy audio and some inaccuracies. Sure, it'd be preferable to have the data as accurate as possible, but it doesn't need to be 99.99999% accurate. > and pool the efforts of the community/other people desu I just need ideas on what to source from (and a bit of where, as sounds-resource.com seems to be missing a decent amount of stuff that crossed my mind). Sure, it'd be nice if it was all handed to me transcribed and neatly prepared, but asking for it to be transcribed is a bit of a big ask, as I'd pretty much require only transcriptions from WhisperX + large-v2 + VAD filtering enabled, which requires a HF token. It's not a huge deal for me to do the transcription process itself, as a few hundred lines can be crunched through relatively fast on my 4070Ti. > and use anime transcriptions instead of/in addition to vidya My qualm with anime (dubs) is that there's a considerable amount of extra effort needed to get decent audio. I imagine the best case scenario are BD releases with the vocals on a separate audio track, and you can just segment the audio by subtitles and it's all good, but the worst case is aired anime won't have that. I also don't think any of the few anime I have watched were dubbed anyways, so I won't have much of anything to source from myself. > I assume you don't give a fuck about morality In terms of """ethically sourcing""" a dataset, I don't really have an qualms about that. > And lets be honest, its definitely not worth your time to collect/process data, as opposed to developing the code... The only thing left really code-wise is just to make my own VALL-E implementation rather than rely on an existing one and continue working around its design desicions, but even then that's pretty low priority. > We could create a google form for something like this Pretty much what I had in mind. I'd settle just with something to submit character name + source and a link to the audio (or at the very least, where to get it). > I am unsure of the best way to store the files but once something is completed, it is proabably best for you to simply download the files, then be done with it(after all, there really is no need for anyone to have the final, Actually, the final, quantized audio that gets trained against doesn't take all that much space, so something ginormous won't actually be all that much of a detriment. It's just the source audio that becomes a bit of a pickle if I keep it on disk. Worst case, I do have several, several drives (and could always buy another 10TiB one), but I'd just have to do bookkeeping as I'm quite a datawhore. > You can get consistent, high quality Japanese desu my concern over VALL-E X is quite a ways off (or at the least, even having a Japanese model). Incorporating Japanese would have to be when I do get something cobbled together, as I'm really not sure how much of a problem it would pose with training with a multi-lingual dataset, as much as it would definitely help increase my voice variety with including my sixty-or-so voices I already have transcribed. --- From here I'll just generally address the rest of it. I appreciate the thought-out planning on it all, but at the end of the day, as long as the samples are somewhat-put-together, I'll accept it: anime, TV shows, movies, what-have-you. Just anything that isn't an audiobook reading, as that's where I feel is the less likely to provide much of any variety. I'm not strictly restricting it to just muh vidya for the dataset; it's just both the best and easiest to source from, and what I'm 99% likely to source from myself. On the flipside though: * I'm still not sure how much more "effort" is needed, as - I'm still trying to really min/max training to see what's the bare minimum I can get by to have it stop overfitting. It could be a few more voices, it could be a shit ton more. I most definitely can always add in more data in-post without needing to start from scratch, so that itself isn't a concern. - the code itself is pretty much as complete as I can get it now. I just need to get a model to verify everything is in order and there's no inherent issues with the code, and from what I can get so far, it seems solid. The only other thing I'd really care to do is write my own implementation to truly own it, rather than subjugating an existing codebase and deal with working around the warts. * I'm not really sure how much of a reach having a crowdsourcing form will have. I suppose it's better than not having it, but I do expect it to not really garner much traction (and I fear it getting *too* much traction). For now though, anyone's free to drop a link to what they would like for me to train the model against. Between my usual weekend rituals and RE4make, I'll probably try and cobble together more voices to feed the training model, as I fed it the rest of what I had on my training server and it seems to already have peaked, given the little improvement from doing another LR restart.
Author
Owner

I guess at that point, if what you say can produce the segments, then all we would need to do is feed him the data/anime?

Yeah. I pretty much just need the audio, and WhisperX / the transcription part of the web UI will handle the rest.

Can WhisperX detect differences in speakers/be able to "sort" multiple speakers? i.e. for a full anime episode, multiple characters.

With diarization, yeah. It's not something I've tested, but the web UI's integration with WhisperX can use it, although I'll need to uncomment one line.

> I guess at that point, if what you say can produce the segments, then all we would need to do is feed him the data/anime? Yeah. I pretty much just need the audio, and WhisperX / the transcription part of the web UI will handle the rest. > Can WhisperX detect differences in speakers/be able to "sort" multiple speakers? i.e. for a full anime episode, multiple characters. With diarization, yeah. It's not something I've tested, but the web UI's integration with WhisperX can use it, although I'll need to uncomment one line.
Hey @mrq have you seen [this one](https://www.youtube.com/watch?v=-9Ado8D3A-w)? Thoughts? https://github.com/svc-develop-team/so-vits-svc
Author
Owner

Thoughts?

Unless I'm misunderstanding it:

  • isn't it primarily for speech-to-speech?
  • isn't it also only really best for singing?

I suppose it can fill a specific niche, but those two things for VITS (whatever they're classified as) are kinda not in my scope. Even if not for those things, I wouldn't really dip my toes into it since it seems it has its own hands working on it. I only really dipped my toes into TorToiSe (and by extension this VALL-E implementation) it was pretty much abandoned (at least, at the time, I don't remember the timeframe between me slapping a web UI and finding out about the 152334H
fork) but with lots of room for improvement.

Also I don't think I can have another ecosystem under my belt.


On the other side of things, the gravity of how much data I'm going to need to feed this beast is starting to weigh down on me more. I greatly increased both the speaker count and lines being fed (I think I'm up to 51 speakers, 18484 lines, not sure about total duration) and, while I'm not so concerned about the throughput rate in terms of the entire epoch, it only seems to amount a minor amount of an increase in output quality.

The AR's evaluation output is sounding better, but the validation output is only really sounding somewhat passable for English with the old-ish (as of a couple of days ago) voices, before I threw in the rest of Persona 3's voice lines into it. The NAR sounds better as usual, but there's no point in the NAR being good if the AR it gets fed isn't up to par.

I guess I'll just keep feeding the beast more data to train against. I'll let the Persona 4 lines (non Golden, unfortunately) bake for a day or two before throwing in more voices.

I could give up and feed it a LibreWhatever dataset, but I really don't want to feed it audiobook readings; I'm already getting better results it feels by feeding it muh vidya audio.

If you don't mind sharing your collection @Roukanken (or anyone I suppose), I'll be happy to nab it and dump it into the hungering beast. The audio itself is fine, as I'll only really be comfortable with the transcription if it was ran through WhisperX large-v2 + the VAD filter (as I haven't tested the segmentation quality on anything else).

> Thoughts? Unless I'm misunderstanding it: * isn't it primarily for speech-to-speech? * isn't it also only really best for singing? I suppose it can fill a specific niche, but those two things for VITS (whatever they're classified as) are kinda not in my scope. Even if not for those things, I wouldn't really dip my toes into it since it seems it has its own hands working on it. I only really dipped my toes into TorToiSe (and by extension this VALL-E implementation) it was pretty much abandoned (at least, at the time, I don't remember the timeframe between me slapping a web UI and finding out about the 152334H fork) but with lots of room for improvement. Also I don't think I can have another ecosystem under my belt. --- On the other side of things, the gravity of how much data I'm going to need to feed this beast is starting to weigh down on me more. I greatly increased both the speaker count and lines being fed (I think I'm up to 51 speakers, 18484 lines, not sure about total duration) and, while I'm not so concerned about the throughput rate in terms of the entire epoch, it only seems to amount a minor amount of an increase in output quality. The AR's evaluation output is sounding better, but the validation output is only really sounding somewhat passable for English with the old-ish (as of a couple of days ago) voices, before I threw in the rest of Persona 3's voice lines into it. The NAR sounds better as usual, but there's no point in the NAR being good if the AR it gets fed isn't up to par. I guess I'll just keep feeding the beast more data to train against. I'll let the Persona 4 lines (non Golden, unfortunately) bake for a day or two before throwing in more voices. I could give up and feed it a LibreWhatever dataset, but I ***really*** don't want to feed it audiobook readings; I'm already getting better results it feels by feeding it muh vidya audio. If you don't mind sharing your collection @Roukanken (or anyone I suppose), I'll be happy to nab it and dump it into the hungering beast. The audio itself is fine, as I'll only really be comfortable with the transcription if it was ran through WhisperX large-v2 + the VAD filter (as I haven't tested the segmentation quality on anything else).
Author
Owner

So, I was letting the FUD get to me about whether or not I should have backed the newer implementation instead. I was getting the model code transplanted into my fork, and as I was stitching up the dataloader to the forward pass, I realized something.

The first implementation (enhuiz) will:

  • given a speaker, sample a random line, getting the associated phonemes and quantized audio
  • within that same speaker, randomly grab a difference utterance (quantized audio), and use that as the input prompt.
  • the associated audio with the text is used as the "target" to compute loss against the logits.

In hindsight, it makes sense, as it'll train in a way that reflects the way it's inferencing. This should have great zero-shot capabilities, but at the "cost" of it being stubborn to train, and terrible to try and have as non-zero-shot TTS systems (like traditional TTS).

The newer implementation (lifeiteng), doesn't seem to do that. It'll pull "pre-computed features" (I imagine it's just the quantized audio but abstracted through Lhotse's API), and will use the "input prompt" as the target. Contrary to the above, it's quicker to train, but harms zero-shot-ability, as it's not reflective of how it's inferenced against. It's not truly leveraging the capabilities of an LM.

However, I can't really knock it for that, as it at least has a "working" (albeit, not up to par) model, while the first implementation doesn't seem to have one yet.


However, one flaw is that I'm required to keep similar voices together, and not mix speakers within a folder. It's being a pain since a lot of more voices I'm sourcing from are all one incestuous mess of uncategorized filenames (pretty much everything I either have to rip myself or I've found ripped already; I only got lucky with being able to categorize the Persona 3 and 4 voice files).

So, I was letting the FUD get to me about whether or not I should have backed the newer implementation instead. I was getting the model code transplanted into my fork, and as I was stitching up the dataloader to the forward pass, I realized something. The first implementation (enhuiz) will: * given a speaker, sample a random line, getting the associated phonemes and quantized audio * within that same speaker, randomly grab a difference utterance (quantized audio), and use that as the input prompt. * the associated audio with the text is used as the "target" to compute loss against the logits. In hindsight, it makes sense, as it'll train in a way that reflects the way it's inferencing. This should have great zero-shot capabilities, but at the "cost" of it being stubborn to train, and terrible to try and have as non-zero-shot TTS systems (like traditional TTS). The newer implementation (lifeiteng), doesn't seem to do that. It'll pull "pre-computed features" (I imagine it's just the quantized audio but abstracted through Lhotse's API), and will use the "input prompt" as the target. Contrary to the above, it's quicker to train, but harms zero-shot-ability, as it's not reflective of how it's inferenced against. It's not truly leveraging the capabilities of an LM. However, I can't really knock it for that, as it at least has a "working" (albeit, not up to par) model, while the first implementation doesn't seem to have one yet. --- However, one flaw is that I'm *required* to keep similar voices together, and not mix speakers within a folder. It's being a pain since a lot of more voices I'm sourcing from are all one incestuous mess of uncategorized filenames (pretty much everything I either have to rip myself or I've found ripped already; I only got lucky with being able to categorize the Persona 3 and 4 voice files).

"For now though, anyone's free to drop a link to what they would like for me to train the model against."

So what are the guidelines?

How many segments minimum?

Ideal clip length?

Just English?

Any particular way you want it labeled?

However, I can't really knock it for that, as it at least has a "working" (albeit, not up to par) model, while the first implementation doesn't seem to have one yet.

Also, does this mean you are including another implementation, or are you sticking with the one you are currently using?

@mrq

> "For now though, anyone's free to drop a link to what they would like for me to train the model against." So what are the guidelines? How many segments minimum? Ideal clip length? Just English? Any particular way you want it labeled? > However, I can't really knock it for that, as it at least has a "working" (albeit, not up to par) model, while the first implementation doesn't seem to have one yet. Also, does this mean you are including another implementation, or are you sticking with the one you are currently using? @mrq
Author
Owner

How many segments minimum?

Not sure; I still have about 20 voices that have sub-50 lines that I'm not too sure how much would help shape things, but I imagine at least 10 lines would be fine.

Ideal clip length?

Ideal would be between 3 and 12 seconds, but whatever the transcription tab the web UI will spit out seems decent enough. The paper mentions 10 to 20 seconds, but it's better to not have audio lengths too long.

Just English?

Mhm. I'm worried adding Japanese voices might be too much right now. Phonetically it should be fine, but I need to wait for this next batch of voices to cook before trying it.

Any particular way you want it labeled?

Nothing in particular. One big folder of all of a character's dialogue is good enough, and I'll just feed it through WhisperX to transcribe and timestamp it adequately.


It would be a big help when I further scale up the dataset again. As of now I've fed it:

  • SA2 Shadow, Rouge, and Knuckles
  • Persona 3 voiced story lines
  • Persona 4 voiced story lines
  • FFXII lines
  • Kingdom Hearts 1 lines
  • Half-Life 1 lines
  • lines from that Westwood Studios Blade Runner game
  • Stanley Parable narrator
  • other minor datasets I've had left over that are still in from my intial go-around

I also have Demon's Souls, Elden Ring, Tales of Symphonia (and I need to extract Vesperia skits), and FFX, but they're all uncategorized so I can't really do anything with them outside of some clever tricks with modifying the dataloader process.

> How many segments minimum? Not sure; I still have about 20 voices that have sub-50 lines that I'm not too sure how much would help shape things, but I imagine at least 10 lines would be fine. > Ideal clip length? Ideal would be between 3 and 12 seconds, but whatever the transcription tab the web UI will spit out seems decent enough. The paper mentions 10 to 20 seconds, but it's better to not have audio lengths too long. > Just English? Mhm. I'm worried adding Japanese voices might be *too* much right now. Phonetically it should be fine, but I need to wait for this next batch of voices to cook before trying it. > Any particular way you want it labeled? Nothing in particular. One big folder of all of a character's dialogue is good enough, and I'll just feed it through WhisperX to transcribe and timestamp it adequately. --- It would be a big help when I further scale up the dataset again. As of now I've fed it: * SA2 Shadow, Rouge, and Knuckles * Persona 3 voiced story lines * Persona 4 voiced story lines * FFXII lines * Kingdom Hearts 1 lines * Half-Life 1 lines * lines from that Westwood Studios Blade Runner game * Stanley Parable narrator * other minor datasets I've had left over that are still in from my intial go-around I also have Demon's Souls, Elden Ring, Tales of Symphonia (and I need to extract Vesperia skits), and FFX, but they're all uncategorized so I can't really do anything with them outside of some clever tricks with modifying the dataloader process.

Here are three links that I think would be a good fit.

https://www.youtube.com/watch?v=XkMoingR1p0 - JFK

https://www.youtube.com/watch?v=hzW-h_Rm8Jo - Dempsey, really good emotion

https://www.youtube.com/watch?v=1S48jHXh44U - Misty, female variant

In general, Call of Duty has really good voice acting for the zombies portion, and all their characters have 10-20 minutes blocks of audio, all of which is clean.

Between all the games, there is proabably a large amount of characters we could use.

Would you like me to download and link these for you? Or can you do it on your end? (These are larger chunks of just one character each, so I figure it be easier enough to just run through whisper?)


Also...

  1. Female voices? Male voices? Any and all?
  2. Accents? Do you want flat american, or would more variation be better?
  3. Reading the above, you seem to want more "emotion" from voices?
Here are three links that I think would be a good fit. https://www.youtube.com/watch?v=XkMoingR1p0 - JFK https://www.youtube.com/watch?v=hzW-h_Rm8Jo - Dempsey, really good emotion https://www.youtube.com/watch?v=1S48jHXh44U - Misty, female variant In general, Call of Duty has really good voice acting for the zombies portion, and all their characters have 10-20 minutes blocks of audio, all of which is clean. Between all the games, there is proabably a large amount of characters we could use. Would you like me to download and link these for you? Or can you do it on your end? (These are larger chunks of just one character each, so I figure it be easier enough to just run through whisper?) ------------------- Also... 1. Female voices? Male voices? Any and all? 2. Accents? Do you want flat american, or would more variation be better? 3. Reading the above, you seem to want more "emotion" from voices?
Author
Owner

Would you like me to download and link these for you? Or can you do it on your end?

I can rip it with yt-dlp and transcribe from there. I'll add them into a next batch after this one gets chewed through (at this rate, I think another few days).

Female voices? Male voices? Any and all?
Accents? Do you want flat american, or would more variation be better?
Reading the above, you seem to want more "emotion" from voices?

No preferences.

If it were on rather monotonous audiobooks, I would probably try and have "uniformity", but because it's on real-er data, I don't think I should have uniformity like specific accents.

> Would you like me to download and link these for you? Or can you do it on your end? I can rip it with yt-dlp and transcribe from there. I'll add them into a next batch after this one gets chewed through (at this rate, I think another few days). > Female voices? Male voices? Any and all? > Accents? Do you want flat american, or would more variation be better? > Reading the above, you seem to want more "emotion" from voices? No preferences. If it were on rather monotonous audiobooks, I would probably try and have "uniformity", but because it's on real-er data, I don't think I should have uniformity like specific accents.

Have you checked out the Pony Preservation Project Datasets You can found them here:

https://mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig/folder/OloAmDqZ

and here (These are non-MLP datasets):
https://docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.6jgcpmrwa3fq

all of them are already filtered, organized, cut, and transcribed for you, so that could hopefully make it easier for you.

Have you checked out the Pony Preservation Project Datasets You can found them here: https://mega.nz/folder/jkwimSTa#_xk0VnR30C8Ljsy4RCGSig/folder/OloAmDqZ and here (These are non-MLP datasets): https://docs.google.com/document/d/1y1pfS0LCrwbbvxdn3ZksH25BKaf0LaO13uYppxIQnac/edit#heading=h.6jgcpmrwa3fq all of them are already filtered, organized, cut, and transcribed for you, so that could hopefully make it easier for you.
Author
Owner

Have you checked out the Pony Preservation Project Datasets You can found them here:

Oh right, I forgot I can leverage /mlp/'s autism. I'll nab them too for the next feeding time as well. I'm sure they'd also appreciate if I did train it on their technicolor horses.

and here (These are non-MLP datasets):

I was going to say it's a bit of a shame that most of it is already Persona 4, but if it's Golden, then that's golden. It does have S. Links too, which my rips from sounds-resource doesn't have, so I might as well grab those regardless and replace what I have.

all of them are already filtered, organized, cut, and transcribed for you, so that could hopefully make it easier for you.

Sugoi. My only worry is if a cut might be too long and either get culled anyways or will cause OOMs when the stars align. However, I might be able to remedy this with having my fork conform to the paper better (have the input audio prompt trimmed to 3 seconds; it might also have a training throughput at a cost of not being able to having as-strong variable length input prompts)

> Have you checked out the Pony Preservation Project Datasets You can found them here: Oh right, I forgot I can leverage /mlp/'s autism. I'll nab them too for the next feeding time as well. I'm sure they'd also appreciate if I did train it on their technicolor horses. > and here (These are non-MLP datasets): I was going to say it's a bit of a shame that most of it is already Persona 4, but if it's Golden, then that's golden. It does have S. Links too, which my rips from sounds-resource doesn't have, so I might as well grab those regardless and replace what I have. > all of them are already filtered, organized, cut, and transcribed for you, so that could hopefully make it easier for you. Sugoi. My *only* worry is if a cut might be too long and either get culled anyways or will cause OOMs when the stars align. However, I might be able to remedy this with having my fork conform to the paper better (have the input audio prompt trimmed to 3 seconds; it might also have a training throughput at a cost of not being able to having as-strong variable length input prompts)
Author
Owner

I swear every time I grow a wild hair and dive back into the implementation I forked, there's another twist with how it behaves.

To be brief, I was trying to both have a way to calculate duration from the encoded/quantized audio and then see about trimming the quantized audio down to 3 seconds to feed for training (to see how much used VRAM I can reduce and how much of a throughput increase I can get).

Turns out, that not only does the implementation randomly selects a different utterance to use as the input prompt, by default, it can use up to three random utterances and combine them. I say up to, because it does a probability check to see if it should continue. This most definitely will explain the wild variation in VRAM use between steps, so I should be able to make this a more sensible amount.

I'm pretty sure this is overkill, but in theory it should help with try and dissociate input length from inference quality, but at the same time, I think it'd be much, much better to just have it poll one utterance.


Enforcing a maximum of 3 seconds for training has let me set a batch size of 4 to a batch size of 16, for the same overall iteration rate (so I effectively 4x'd my training now, I guess I've been bottlenecking my 4070Ti). I think I'll just keep it that way.

I swear every time I grow a wild hair and dive back into the implementation I forked, there's another twist with how it behaves. To be brief, I was trying to both have a way to calculate duration from the encoded/quantized audio and then see about trimming the quantized audio down to 3 seconds to feed for training (to see how much used VRAM I can reduce and how much of a throughput increase I can get). Turns out, that not only does the implementation randomly selects a different utterance to use as the input prompt, by default, it can use up to *three* random utterances and combine them. I say up to, because it does a probability check to see if it should continue. This most definitely will explain the wild variation in VRAM use between steps, so I should be able to make this a more sensible amount. I'm pretty sure this is overkill, but in theory it should help with try and dissociate input length from inference quality, but at the same time, I think it'd be much, much better to just have it poll one utterance. --- Enforcing a maximum of 3 seconds for training has let me set a batch size of 4 to a batch size of 16, for the same overall iteration rate (so I effectively 4x'd my training now, I guess I've been bottlenecking my 4070Ti). I think I'll just keep it that way.

I was going to say it's a bit of a shame that most of it is already Persona 4, but if it's Golden, then that's golden.

Yeah the one on the doc are the golden version! At least according to the anon who ripped them (I didn't check it myself)...

> I was going to say it's a bit of a shame that most of it is already Persona 4, but if it's Golden, then that's golden. Yeah the one on the doc are the golden version! At least according to the anon who ripped them (I didn't check it myself)...

Can whisper run through a batch of single files with the same level of convenience? There are some clips I have where it is like 40 .mp3 files, all unlabeled, but for the same character. I figure I would just stitch them together into 1 file anyways, but I am just curious.

Can whisper run through a batch of single files with the same level of convenience? There are some clips I have where it is like 40 .mp3 files, all unlabeled, but for the same character. I figure I would just stitch them together into 1 file anyways, but I am just curious.

Can whisper run through a batch of single files with the same level of convenience?

Kind of. You can specify multiple files when you run it, ex: whisperx --model large ---task transcribe --language en file1.wav file2.wav file3.wav ...

> Can whisper run through a batch of single files with the same level of convenience? Kind of. You can specify multiple files when you run it, ex: `whisperx --model large ---task transcribe --language en file1.wav file2.wav file3.wav ...`
Author
Owner

Yeah the one on the doc are the golden version! At least according to the anon who ripped them (I didn't check it myself)...

Yeah, they're from Golden. I tried adding in the Chie lines the other day (since I will admit the Golden VA grew on me despite initially preferring the original Chie for some time) and I couldn't for the life of me get it processed through WhisperX; it would kill the entire process when it got passed the first three or so lines. I tried remuxing it in ffmpeg but no luck. Oh well. I was only doing that since I had to revert from a checkpoint the other day, as I completely botched something when I was moving my data around (more on that specifically later).


Can whisper run through a batch of single files with the same level of convenience? There are some clips I have where it is like 40 .mp3 files, all unlabeled, but for the same character. I figure I would just stitch them together into 1 file anyways, but I am just curious.

Yeah. My naive way about it is to just throw it all into one audio file (I can't recall if I mentioned the steps to do it with Audacity Tenacity on the wiki somewhere, but that's what I would do to stitch them into one audio file), as I have some voices that are unfortunately one audio file. That approach seems to work almost just as fine as having audio separated for a single voice.

I imagine it might get me out of a pickle with completely unlabeled multi-speaker rips with diarization, but I haven't bothered trying it yet.

Kind of. You can specify multiple files when you run it, ex: whisperx --model large ---task transcribe --language en file1.wav file2.wav file3.wav ...

It seems to be effectively the same as programmatically doing it through the web UI (in the sense the models are still loaded during each iteration).

I think technically you might be able to get better throughput if you process one mono-file instead of separated files if you have VAD filtering enabled, as the VAD filter "pipeline" allows for batching to increase throughput, and larger audios can be batched "better" (I don't think there's much of a throughput uplift if I have larger batch sizes set for sub-30 second segments).

> Yeah the one on the doc are the golden version! At least according to the anon who ripped them (I didn't check it myself)... Yeah, they're from Golden. I tried adding in the Chie lines the other day (since I will admit the Golden VA grew on me despite initially preferring the original Chie for some time) and I couldn't for the life of me get it processed through WhisperX; it would kill the entire process when it got passed the first three or so lines. I tried remuxing it in ffmpeg but no luck. Oh well. I was only doing that since I had to revert from a checkpoint the other day, as I completely botched something when I was moving my data around (more on that specifically later). --- > Can whisper run through a batch of single files with the same level of convenience? There are some clips I have where it is like 40 .mp3 files, all unlabeled, but for the same character. I figure I would just stitch them together into 1 file anyways, but I am just curious. Yeah. My naive way about it is to just throw it all into one audio file (I can't recall if I mentioned the steps to do it with ~~Audacity~~ Tenacity on the wiki somewhere, but that's what I would do to stitch them into one audio file), as I have some voices that are unfortunately one audio file. That approach seems to work almost just as fine as having audio separated for a single voice. I imagine it might get me out of a pickle with completely unlabeled multi-speaker rips with diarization, but I haven't bothered trying it yet. > Kind of. You can specify multiple files when you run it, ex: whisperx --model large ---task transcribe --language en file1.wav file2.wav file3.wav ... [It seems](https://github.com/m-bain/whisperX/blob/main/whisperx/transcribe.py#L156) to be effectively the same as programmatically doing it through the web UI (in the sense the models are still loaded during each iteration). I think *technically* you might be able to get better throughput if you process one mono-file instead of separated files if you have VAD filtering enabled, as the VAD filter "pipeline" allows for batching to increase throughput, and larger audios can be batched "better" (I don't think there's much of a throughput uplift if I have larger batch sizes set for sub-30 second segments).
Author
Owner

Anywho, I'm blasting ropes from how training is shaping up now. It was a really rocky start, but it seems to be smooth sailing now, as I'm getting actual clean real output now from utilizing both the AR and NAR to produce output, rather than playing by ear output from each model separately.

After my serendipitous sniffing the other day though the implementation I forked, I:

  • added configuration options to be able to limit the input acoustic prompt to a specific duration (I'm training on three seconds, per the VALL-E paper)
    • this not only reduces the amount of VRAM consumed per batch, also reduces the amount of input data to really work on, so training isn't so "strained". I'm not sure of any penalty in accuracy, but the paper really boasts being able to use three seconds for inferencing, so it should be fine.
  • added NOT invoking GC every iteration (and instead at non-critical points like checkpointing and evaluation), as it incurs about a 0.2s/it penalty on my 4070Ti
  • unfortunately had to revert to a past checkpoint after a day or two of tainted training. I was reorganizing where I had my data (./training/{voice}/valle/ => ./training/valle/data/{voice}/), and because the speaker name getter lambda was fetching the 2nd-to-last folder name instead of the last folder name, all lines were treated as the same speaker (data), effectively making the input prompt random data. After fixing my issue, reverting, and the above throughput increases

I was able to squeeze out some more "optimizations" to increase my batch size from 4 to 16 while having an even faster iteration rate (bs=4 yielded an average of 1.4s/it rate, while bs=16 and disabling GC per iteration yields an average of 1.04s/it). I was wrong to assume my 4070Ti was not bottlenecked and that batch size wouldn't starve it of throughput. Unfortunately, I should have gotten a 4080 instead for more VRAM, despite it being the worst Ada card at the time (at the time, because everything 4070 and below is just as bad).

Additionally, I realized I can also test actual inferencing during evaluation (RVQ layer 1 of the AR, RVQ layers 2 through 8 through the NAR), and hoo boy, it's actually decent output, unlike the monstrosity of my initial inference test (for some reason my evaluation/validation datasets are either Westwood Blade Runner or Kingdom Hearts):

I picked the ones with noticeable flaws in them so they're more apparent they're not just the reference clip. There's still a sizeable amount of the evaluation output that doesn't sound quite right, and the AR+NAR validation output is pretty rough.

It's extremely relieving to hear that it actually can work, and it's probably just the provided inference method being a bit sussy. It's also relieving that I don't need to keep shoveling more, and more, and more data, but I might as well keep doing it, as it still has issues fitting just right for outside data, at least, given the validation output.

And my current graph (epoch count is related to the current dataset, I usually will do a soft-reset by loading the weights and not the optimizer state when I change the dataset): image

I haven't added reporting the loss for the AR+NAR yet to the graph (it should be simple), as it's a recent addition so it wouldn't thoroughly be reflected in the graph yet.

I still have a lot more baking to do for it to be "just right", but for it to give quasi-decent output now gives me hope instead of FUD with it being a fools errand.

Anywho, I'm blasting ropes from how training is shaping up now. It was a ***really*** rocky start, but it seems to be smooth sailing now, as I'm getting actual clean real output now from utilizing both the AR and NAR to produce output, rather than playing by ear output from each model separately. After my serendipitous sniffing the other day though the implementation I forked, I: * added configuration options to be able to limit the input acoustic prompt to a specific duration (I'm training on three seconds, per the VALL-E paper) - this not only reduces the amount of VRAM consumed per batch, also reduces the amount of input data to really work on, so training isn't so "strained". I'm not sure of any penalty in accuracy, but the paper really boasts being able to use three seconds for inferencing, so it should be fine. * added NOT invoking GC every iteration (and instead at non-critical points like checkpointing and evaluation), as it incurs about a 0.2s/it penalty on my 4070Ti * unfortunately had to revert to a past checkpoint after a day or two of tainted training. I was reorganizing where I had my data (`./training/{voice}/valle/` => `./training/valle/data/{voice}/`), and because the speaker name getter lambda was fetching the 2nd-to-last folder name instead of the last folder name, *all* lines were treated as the same speaker (`data`), effectively making the input prompt random data. After fixing my issue, reverting, and the above throughput increases I was able to squeeze out some more "optimizations" to increase my batch size from 4 to 16 while having an even faster iteration rate (bs=4 yielded an average of 1.4s/it rate, while bs=16 and disabling GC per iteration yields an average of 1.04s/it). I was wrong to assume my 4070Ti was not bottlenecked and that batch size wouldn't starve it of throughput. Unfortunately, I should have gotten a 4080 instead for more VRAM, despite it being the worst Ada card at the time (at the time, because everything 4070 and below is just as bad). Additionally, I realized I can also test actual inferencing during evaluation (RVQ layer 1 of the AR, RVQ layers 2 through 8 through the NAR), and hoo boy, it's actually decent output, unlike the monstrosity of my initial inference test (for some reason my evaluation/validation datasets are either Westwood Blade Runner or Kingdom Hearts): * https://vocaroo.com/1lrKKbecR6FA ([reference](https://vocaroo.com/1cGNMj79pVWk)) * https://vocaroo.com/14Mmp5mAWAFP ([reference](https://vocaroo.com/1dZNMKsP4DMi)) * https://vocaroo.com/1nYf2vCdcAKX ([reference](https://vocaroo.com/1kQ0MH1VGxlN)) I picked the ones with noticeable flaws in them so they're more apparent they're not just the reference clip. There's still a sizeable amount of the evaluation output that doesn't sound quite right, and the AR+NAR validation output is pretty rough. It's extremely relieving to hear that it actually can work, and it's probably just the provided inference method being a bit sussy. It's also relieving that I don't *need* to keep shoveling more, and more, and more data, but I might as well keep doing it, as it still has issues fitting just right for outside data, at least, given the validation output. And my current graph (epoch count is related to the current dataset, I usually will do a soft-reset by loading the weights and not the optimizer state when I change the dataset): ![image](/attachments/2428dca1-379a-4524-9944-ceee7f333486) I haven't added reporting the loss for the AR+NAR yet to the graph (it should be simple), as it's a recent addition so it wouldn't thoroughly be reflected in the graph yet. I still have a lot more baking to do for it to be "just right", but for it to give quasi-decent output now gives me hope instead of FUD with it being a fools errand.

Thats awesome to hear.

It's also relieving that I don't need to keep shoveling more, and more, and more data, but I might as well keep doing it, as it still has issues fitting just right for outside data, at least, given the validation output.

So do you still want data? I've got some awesome clips lined up. Both Japanese and English.

Mhm. I'm worried adding Japanese voices might be too much right now. Phonetically it should be fine, but I need to wait for this next batch of voices to cook before trying it.

Are we still at this stage?

I still have a lot more baking to do for it to be "just right"

How much more improvement do you think you can get out of the VALL-E implementation? Do you think it surpasses/will surpass your tortoise model? Also, VALL-Ex?

Thats awesome to hear. > It's also relieving that I don't need to keep shoveling more, and more, and more data, but I might as well keep doing it, as it still has issues fitting just right for outside data, at least, given the validation output. So do you still want data? I've got some awesome clips lined up. Both Japanese and English. > Mhm. I'm worried adding Japanese voices might be too much right now. Phonetically it should be fine, but I need to wait for this next batch of voices to cook before trying it. Are we still at this stage? > I still have a lot more baking to do for it to be "just right" How much more improvement do you think you can get out of the VALL-E implementation? Do you think it surpasses/will surpass your tortoise model? Also, VALL-Ex?

So whats the process looking like now? Is it just to keep training it and adding more voices until its perfect?

So whats the process looking like now? Is it just to keep training it and adding more voices until its perfect?
Author
Owner

Are we still at this stage?

mmm

It's hard to say still. I think my new-er understanding of how the implementation works, and how VALL-E itself works, wants to say it should be pretty resistant to any cross-lingual tainting, but I'm having a really hard time trying to express how it might be. I guess it's just better to try it and see how it behaves.

I'm not sure if I should aim for it now, since I know the implementation is perfectly fine and works (albeit the dedicated inference routine seemed a bit flawed, but that can be easily remedied). Outside of cross-lingual/VALL-E X, the last thing on the metaphysical list is getting a decent pre-trained model together.

But, if I'm doing that, I might as well get it baking on Japanese too. If I'm lucky, there could be some crossover with either languages bolstering the other up in training.

How much more improvement do you think you can get out of the VALL-E implementation?

Performance wise, I'm very sure this time I'm all tapped out on how much I can squeeze, outside of playing with fire and trying 4-bit training (I worry about accuracy issues at that much quantizing). I genuinely can't think of any other avenues for improvement.

Output quality wise, definitely can get more improvement, but it's a matter of how long it will take for it to get there. My training method is still pretty naive and flawed, so I can always refine that aspect.

Quality of life wise, definitely more room for improvement. I'd like to slot in explicitily providing a validation dataset (easy, but it's low priority), and there's probably some other things to muck around with but I can't quite recall.

Do you think it surpasses/will surpass your tortoise model?

I think in terms of:

  • raw audio quality
  • inference speed
  • the entire stack being pretty clean and not (as) wart-y
  • it being on actual data rather than audiobooks (and whatever else TorToiSe was trained on)
  • the text tokens being IPA phonemes make things a lot easier

it definitely can outperform TorToiSe. It just is pretty much now up to how the model itself is trained.

Also, VALL-Ex?

In terms of subjugating TorToiSe to try and have a cross-lingual model, definitely. Non-English TorToiSe will always be restricted by both its tokenizer and the CLVP/CVVP.


So do you still want data? I've got some awesome clips lined up. Both Japanese and English.
So whats the process looking like now? Is it just to keep training it and adding more voices until its perfect?

Mhm.

The beast must be fed until it starts exhibiting decent zero-shot capabilities (through decent validation output).

> Are we still at this stage? mmm It's hard to say still. I think my new-er understanding of how the implementation works, and how VALL-E itself works, wants to say it should be pretty resistant to any cross-lingual tainting, but I'm having a really hard time trying to express *how* it might be. I guess it's just better to try it and see how it behaves. I'm not sure if I should aim for it now, since I *know* the implementation is perfectly fine and works (albeit the dedicated inference routine seemed a bit flawed, but that can be easily remedied). Outside of cross-lingual/VALL-E X, the last thing on the metaphysical list is getting a decent pre-trained model together. But, if I'm doing that, I might as well get it baking on Japanese too. If I'm lucky, there could be some crossover with either languages bolstering the other up in training. > How much more improvement do you think you can get out of the VALL-E implementation? Performance wise, I'm very sure this time I'm all tapped out on how much I can squeeze, outside of playing with fire and trying 4-bit training (I worry about accuracy issues at that much quantizing). I genuinely can't think of any other avenues for improvement. Output quality wise, definitely can get more improvement, but it's a matter of how *long* it will take for it to get there. My training method is still pretty naive and flawed, so I can always refine that aspect. Quality of life wise, definitely more room for improvement. I'd like to slot in explicitily providing a validation dataset (easy, but it's low priority), and there's probably some other things to muck around with but I can't quite recall. > Do you think it surpasses/will surpass your tortoise model? I think in terms of: * raw audio quality * inference speed * the entire stack being pretty clean and not (as) wart-y * it being on actual data rather than audiobooks (and whatever else TorToiSe was trained on) * the text tokens being IPA phonemes make things a lot easier it definitely *can* outperform TorToiSe. It just is pretty much now up to how the model itself is trained. > Also, VALL-Ex? In terms of subjugating TorToiSe to try and have a cross-lingual model, definitely. Non-English TorToiSe will always be restricted by both its tokenizer and the CLVP/CVVP. --- > So do you still want data? I've got some awesome clips lined up. Both Japanese and English. > So whats the process looking like now? Is it just to keep training it and adding more voices until its perfect? Mhm. The beast must be fed until it starts exhibiting decent zero-shot capabilities (through decent validation output).

Are there's still any advantages to tortoise after playing around with VALL-E?

Are there's still any advantages to tortoise after playing around with VALL-E?

@mrq Feeding the beast. Here is batch 1 of some stuff I've been trying collect. This batch has 10-20 hrs.

The only problems...

  1. Certain clips, the audio might be "tight" ie there might be a really small delay between clips
  2. Some have occasional sound effect.
  3. Some have grunts

But overall, these seem pretty clean. Let me know if they would work for you, and if there is anything I can do to prepare future lists better.

@mrq Feeding the beast. Here is batch 1 of some stuff I've been trying collect. This batch has 10-20 hrs. The only problems... 1. Certain clips, the audio might be "tight" ie there might be a really small delay between clips 2. Some have occasional sound effect. 3. Some have grunts But overall, these seem pretty clean. Let me know if they would work for you, and if there is anything I can do to prepare future lists better.
Author
Owner

I'll probe through them whenever I get the chance next. I incidentally worked on the dataset preparation process to be more cleaner and not a pain (stuff like fixing the phonemizer memory leak, batch processing all voices, allowing subdir voices, etc.).

Sadly, I may have had some flawed reference audio, as it seems I've trimmed them a little too tight at the end. I've been noticing whatever evaluation output makes it out that it ends a bit too abrupt at the end, so I had to reslice and reencode all my audio again with a trim end offset of 0.2s instead of 0.05, for safety.

I'm doing another "restart" with resetting the LR and iteration count so it goes through another LR cycle just to jostle up the weights again. I noticed a lot of the older data doesn't sound too great (P3 mostly is what I've been catching), while the newer audio (some Westwood Blade Runner, Kingdom Hearts, a few P4) will sound pretty decent. I'm not too sure why there's the disparity.

Not much of a progress report, since it still boils down to how much time I'm putting into baking the model. I've been wanting to at least release the model currently, but there's no point when it's doodoo asscheeks for zero-shot AND still a good majority of the voices it's training on; the validation output is still penis, and I'm very, very sure whatever validation output that does sound great was secretly trained against previously, as the dataset sets aside 5% of a speaker aside for validation (which depends on shuffling the list with a 0-seed for each voice, so it could very well change every time I'm recreating datasets).


I lied, I added more data, namely the CoD lines and the English lines from those BNHA/MHOJ2 YouTube voice clip compilations linked earlier, and a few other personal add-ins from YouTube voice line compilations to further put my trust in how it works. It seems decent, so I suppose if you can't be assed to source the raw audio files, but it exists as one conglomerate, feel free to share that.

I'll probe through them whenever I get the chance next. I incidentally worked on the dataset preparation process to be more cleaner and not a pain (stuff like fixing the phonemizer memory leak, batch processing all voices, allowing subdir voices, etc.). Sadly, I may have had some flawed reference audio, as it seems I've trimmed them a little too tight at the end. I've been noticing whatever evaluation output makes it out that it ends a bit too abrupt at the end, so I had to reslice and reencode all my audio again with a trim end offset of 0.2s instead of 0.05, for safety. I'm doing another "restart" with resetting the LR and iteration count so it goes through another LR cycle just to jostle up the weights again. I noticed a lot of the older data doesn't sound too great (P3 mostly is what I've been catching), while the newer audio (some Westwood Blade Runner, Kingdom Hearts, a few P4) will sound pretty decent. I'm not too sure why there's the disparity. Not much of a progress report, since it still boils down to how much time I'm putting into baking the model. I've been wanting to at least release the model currently, but there's no point when it's doodoo asscheeks for zero-shot AND still a good majority of the voices it's training on; the validation output is still penis, and I'm very, very sure whatever validation output that does sound great was secretly trained against previously, as the dataset sets aside 5% of a speaker aside for validation (which depends on shuffling the list with a 0-seed for each voice, so it could very well change every time I'm recreating datasets). --- I lied, I added more data, namely the CoD lines and the English lines from those BNHA/MHOJ2 YouTube voice clip compilations linked earlier, and a few other personal add-ins from YouTube voice line compilations to further put my trust in how it works. It seems decent, so I suppose if you can't be assed to source the raw audio files, but it exists as one conglomerate, feel free to share that.

the English lines from those BNHA/MHOJ2 YouTube voice clip compilations linked earlier

I'm glad those worked. There are a bunch of similar games that would have really good sources as well...

Fighter Z
dragon ball tenkaichi
Attack on Titan (1 and 2)
Naruto
DB kakarot
Scarlet Nexus
One piece
Demon Slayer

Just to name a few. (I picked these, because they have a japanese component, for if and when you start adding those.)

I'm doing another "restart"
I lied,

? Are you doing a reset?

I've been wanting to at least release the model currently

How big is the model?

Also, I noticed that GIT was down for a moment this morning. Made me realize, there is no apparent way to contact you if shit went south? Do you happen to have some type of link, fan email, or alternative in case of?


@mrq

60k hours, 7000 speakers

Most importantly, where are you at with data? I realized that of the links I provided you, it only in total amounted to about 5 hrs, which, if we need massive amounts, is pratically nothing. Do you have any goals for how much data you want ie a tracker of sorts? Maybe make an issue?

Also, in relation to this problem, have you heard of Spleeter, or relevant software? Basically, it seperates vocals from an audio track. Based on Hgt1778's suggestion, I was wondering if we could take anime, and run it through something like Spleeter, to then have clean audio. I figure that this might help fill the data void? I am playing around with software at the moment, and will let you know how well it works. The only flaws I see are that the voices for a particular anime would have to be "sorted" out, but I believe whisper can do?

Anyways, good shit. I've got more voicelines on the way.

> the English lines from those BNHA/MHOJ2 YouTube voice clip compilations linked earlier I'm glad those worked. There are a bunch of similar games that would have really good sources as well... Fighter Z dragon ball tenkaichi Attack on Titan (1 and 2) Naruto DB kakarot Scarlet Nexus One piece Demon Slayer Just to name a few. (I picked these, because they have a japanese component, for if and when you start adding those.) > I'm doing another "restart" > I lied, ? Are you doing a reset? > I've been wanting to at least release the model currently How big is the model? Also, I noticed that GIT was down for a moment this morning. Made me realize, there is no apparent way to contact you if shit went south? Do you happen to have some type of link, fan email, or alternative in case of? -------------------------------------- @mrq > 60k hours, 7000 speakers Most importantly, where are you at with data? I realized that of the links I provided you, it only in total amounted to about 5 hrs, which, if we need massive amounts, is pratically nothing. Do you have any goals for how much data you want ie a tracker of sorts? Maybe make an issue? Also, in relation to this problem, have you heard of Spleeter, or relevant software? Basically, it seperates vocals from an audio track. Based on Hgt1778's suggestion, I was wondering if we could take anime, and run it through something like Spleeter, to then have clean audio. I figure that this might help fill the data void? I am playing around with software at the moment, and will let you know how well it works. The only flaws I see are that the voices for a particular anime would have to be "sorted" out, but I believe whisper can do? Anyways, good shit. I've got more voicelines on the way.

https://vocaroo.com/19wD5o3Lvsz4 - Original

https://vocaroo.com/14xGpSo2ivbr - Vocals separated

For very little finetuning, it is actually VERY impressive. Do you think this would work? (It would allow you to utilize not only any anime, but even beyond that, any show...

This was run using Spleeter

https://vocaroo.com/19wD5o3Lvsz4 - Original https://vocaroo.com/14xGpSo2ivbr - Vocals separated For very little finetuning, it is actually VERY impressive. Do you think this would work? (It would allow you to utilize not only any anime, but even beyond that, any show... This was run using Spleeter
Author
Owner

Are you doing a reset?

An LR/optimizer reset just discards the metadata used for training while retaining the actual model. This way, the iteration count is reset to zero, and the LR schedule restarts from the beginning, hence the LR restart.

LR restarts help jostle any stuck weights that may not get resolved from just bruteforce training with really low LR rates, especially when adding in more data to a model that cannot yet generalize.

How big is the model?

Can't really check, since the model + optimizer states from DeepSpeed are bfloat16 as well, and not the default fp32 weights typical with a torch model. They're 2.2GiB each for the NAR and the AR, but I think exporting them will get it down to 500MiB each? I can't remember.

Also, I noticed that GIT was down for a moment this morning.

There's a weird issue that seems to come here and there where either Gitea itself, or the VM that Gitea is running in, will shit the bed. It's such a niche thing that I can't really diagnose, but restarting the VM fixes it.

Most importantly, where are you at with data?

64138 samples, 250 speakers, 139865 seconds of audio.

have you heard of Spleeter, or relevant software? Basically, it seperates vocals from an audio track.
For very little finetuning, it is actually VERY impressive. Do you think this would work?

I think in a very, very, very narrow pinch, it's fine, but training is very sensitive to any and all audio quirks in a speaker's voice. I imagine if the quirks itself are isolated to a specific voice it wouldn't be too big of a deal, but if those quirks are present in a significant portion of the dataset, then it more than likely will taint the model in some shape or form.

Maybe when the dataset itself is larger I can consider it, as there'd be less fear of it muddying things up, but it should be used sparingly for now.

Right now though, I'm just letting the model bake again with the new data, and seeing if the older portions of the dataset catches up finally.

> Are you doing a reset? An LR/optimizer reset just discards the metadata used for training while retaining the actual model. This way, the iteration count is reset to zero, and the LR schedule restarts from the beginning, hence the LR restart. LR restarts help jostle any stuck weights that may not get resolved from just bruteforce training with really low LR rates, especially when adding in more data to a model that cannot yet generalize. > How big is the model? Can't really check, since the model + optimizer states from DeepSpeed are bfloat16 as well, and not the default fp32 weights typical with a torch model. They're 2.2GiB each for the NAR and the AR, but I think exporting them will get it down to 500MiB each? I can't remember. > Also, I noticed that GIT was down for a moment this morning. There's a weird issue that seems to come here and there where either Gitea itself, or the VM that Gitea is running in, will shit the bed. It's such a niche thing that I can't really diagnose, but restarting the VM fixes it. > Most importantly, where are you at with data? 64138 samples, 250 speakers, 139865 seconds of audio. > have you heard of Spleeter, or relevant software? Basically, it seperates vocals from an audio track. > For very little finetuning, it is actually VERY impressive. Do you think this would work? I think in a very, very, very narrow pinch, it's fine, but training is very sensitive to any and all audio quirks in a speaker's voice. I imagine if the quirks itself are isolated to a specific voice it wouldn't be too big of a deal, but if those quirks are present in a significant portion of the dataset, then it more than likely will taint the model in some shape or form. Maybe when the dataset itself is larger I can consider it, as there'd be less fear of it muddying things up, but it should be used sparingly for now. Right now though, I'm just letting the model bake again with the new data, and seeing if the older portions of the dataset catches up finally.
Author
Owner

I'm biting the bullet and dumping in LibriTTS clean-100 (247 speakers, 30k unsegmented lines, don't have an idea about duration yet or final line count).

I'm getting really worried that I'm going to have to dump the weights and start from scratch due to it overtraining solely from the text phonemes itself. From evaluation output, I had something sourced from SA2 Knuckles outputted with SA2 Rouge's voice, and the only explanation I have for it is that it's overtraining on the text phonemes itself.

iunno, I'm probably just overreacting over a flaw, but I either need to take the loss and dump three weeks of training or risk it getting worse and having to dump it later in the line. I honestly don't remember how long it did take for it to even get to something with a semblance of speech with a tiny dataset, so that's probably the only reason I'm against dumping it, since even bad weights are better than no weights.

I'm biting the bullet and dumping in LibriTTS `clean-100` (247 speakers, 30k unsegmented lines, don't have an idea about duration yet or final line count). I'm getting really worried that I'm going to have to dump the weights and start from scratch due to it overtraining solely from the text phonemes itself. From evaluation output, I had something sourced from SA2 Knuckles outputted with SA2 Rouge's voice, and the only explanation I have for it is that it's overtraining on the text phonemes itself. iunno, I'm probably just overreacting over a flaw, but I either need to take the loss and dump three weeks of training or risk it getting worse and having to dump it later in the line. I honestly don't remember how long it did take for it to even get to something with a semblance of speech with a tiny dataset, so that's probably the only reason I'm against dumping it, since even bad weights are better than no weights.

@mrq

I honestly don't remember how long it did take for it to even get to something with a semblance of speech with a tiny dataset

Is the size the general concern atm?

Would you rather have more clean game data?

If so, try https://www.sounds-resource.com/ . It has extracted audio assets from proabably 70% of games you could think of,which means all clean. And it is sorted by language, character. If it meets your standards, it probably has more than you could use.

For a more specific example, look at one like https://www.sounds-resource.com/xbox/cars/
Check out lightning mcqueens voice pack. Most of the packs will be organized as such (and as you can see, each game should have a decent abundance.)

@mrq > I honestly don't remember how long it did take for it to even get to something with a semblance of speech with a tiny dataset Is the size the general concern atm? Would you rather have more clean game data? If so, try https://www.sounds-resource.com/ . It has extracted audio assets from proabably 70% of games you could think of,which means all clean. And it is sorted by language, character. If it meets your standards, it probably has more than you could use. For a more specific example, look at one like https://www.sounds-resource.com/xbox/cars/ Check out lightning mcqueens voice pack. Most of the packs will be organized as such (and as you can see, each game should have a decent abundance.)

I'm biting the bullet and dumping in LibriTTS clean-100 (247 speakers, 30k unsegmented lines, don't have an idea about duration yet or final line count).

If you are going to use LibriTTS then you should also check out HiFi-TTS which is a smaller but higher quality (sampled at 44.1 kHz) dataset as that's what tortoise also uses in addition to LibriTTS since that might be better for higher quality output.

also if you were going to train a multi-lingural model like Valle-X than this has a lot of dataset for various different languages.

> I'm biting the bullet and dumping in LibriTTS clean-100 (247 speakers, 30k unsegmented lines, don't have an idea about duration yet or final line count). If you are going to use LibriTTS then you should also check out [HiFi-TTS](https://www.openslr.org/109/) which is a smaller but higher quality (sampled at 44.1 kHz) dataset as that's what tortoise also uses in addition to LibriTTS since that might be better for higher quality output. also if you were going to train a multi-lingural model like Valle-X than [this](https://www.openslr.org/resources.php) has a lot of dataset for various different languages.
Author
Owner

Twice I had what I was going to say eaten away. I already spent maybe 45 minutes to an hour, so I'm keeping it as brief as I can. Apologies if it comes off rather curt.


Is the size the general concern atm?

Yes, in an odd way.

My concern for the "old" data not being up to par with the new one seems to be moreso from those speakers' line counts being much bigger than the line counts the "new" data. I'm not sure where the fix for it lies, so I'm just going to ignore it for now.

try https://www.sounds-resource.com/

Already have.


check out HiFi-TTS
10 speakers, 291.6 hours

Yeesh.

I'll keep it in mind. At that point though, I might as well dump in more LibreTTS.


Training with the clean-100 portion seems to be doing fine; it didn't outright decimate my loss/accuracy metrics, so I guess the model itself is in good standings with not overfitting. The evaluation output even at an epoch doesn't seem completely terrible; definite room for improvement, but it at least is trying.

Twice I had what I was going to say eaten away. I already spent maybe 45 minutes to an hour, so I'm keeping it as brief as I can. Apologies if it comes off rather curt. --- > Is the size the general concern atm? Yes, in an odd way. My concern for the "old" data not being up to par with the new one seems to be moreso from those speakers' line counts being much bigger than the line counts the "new" data. I'm not sure where the fix for it lies, so I'm just going to ignore it for now. > try https://www.sounds-resource.com/ Already have. --- > check out HiFi-TTS > 10 speakers, 291.6 hours Yeesh. I'll keep it in mind. At that point though, I might as well dump in more LibreTTS. --- Training with the `clean-100` portion seems to be doing fine; it didn't outright decimate my loss/accuracy metrics, so I guess the model itself is in good standings with not overfitting. The evaluation output even at an epoch doesn't seem completely terrible; definite room for improvement, but it at least is trying.

Sort of off-topic but Microsoft just published Natural Speech 2 which seems to be a significant improvement over VALLE architecture. A short skim through of the paper it seems to be a latent diffusion model which might make it slower than VALLE(?). It also seems that zero-shot prompting would be much easier and better since it only require audio and tortoise like 11Labs.

The biggest innovation in this paper is that they use a continuous vector audio codec compared to discrete tokens.

It seems to be simpler since the diffusion model replaces the two stage approach of VALLE. It also can do singing (though not as natural as regular speech) which is pretty neat (though it needs to be trained with singing in its dataset obviously).

It probably be awhile before any good open-source reproduction will come out like VALLE is right now but it seems useful to an eye on it for now :)

https://speechresearch.github.io/naturalspeech2/

though lucidrain already started with his pytorch implementation because he's insane lol

Sort of off-topic but Microsoft just published Natural Speech 2 which seems to be a significant improvement over VALLE architecture. A short skim through of the paper it seems to be a latent diffusion model which might make it slower than VALLE(?). It also seems that zero-shot prompting would be much easier and better since it only require audio and tortoise like 11Labs. The biggest innovation in this paper is that they use a continuous vector audio codec compared to discrete tokens. It seems to be simpler since the diffusion model replaces the two stage approach of VALLE. It also can do singing (though not as natural as regular speech) which is pretty neat (though it needs to be trained with singing in its dataset obviously). It probably be awhile before any good open-source reproduction will come out like VALLE is right now but it seems useful to an eye on it for now :) https://speechresearch.github.io/naturalspeech2/ though lucidrain already started with his [pytorch implementation](https://github.com/lucidrains/naturalspeech2-pytorch) because he's insane lol
Author
Owner

mmm, yeah I definitely won't try and tackle that. I'll let the real experts deal with maturing it, and hopefully someone with the actual compute to play around with it and homebrewing a model.

From a cursory glance at the paper, it does seem to address my "concerns" I had with VALL-E with it's "haha lets throw some quantized waveforms at it and see how it learns from it" approach that makes VALL-E, in a hyper-reductionist manner, a sophisticated MIDI synthesizer.

However, the more I look at the paper, the more turned off I feel about it.

It's reintroducing the problems I did have with TorToiSe with more moving parts, still relying on conditioning latents (or at least, an analog to it). Now there has to be a model for encoding phonemes AND the pitch/duration predictor AND the speech prompt encoder. Yeesh. Not to mention the paper says the sourced audio is sampled at 16KHz. I understand the intent, as it effectively serves as an inherent way to squash out any unwanted sound from the waveform by narrowing the bandwidth, but it's still a quality drop somewhere, which I feel is a bit of what TorToiSe suffers from too. Relying on latent vectors instead of the input waveform also pretty much erases any hope for voices with intentional quirks like SHODAN or GLaDOS from being reproduced with it. VALL-E at least has that saving grace from working on the actual waveform itself, and can reproduce all acoustic conditions.

The training dataset seems to leave a lot to be desired too. The paper mentions the dataset is 44K hours, which at first seemed like it just means the new method is just that more efficient, but later the paper mentions "our model is still underfitting and longer training will result in better performance". Like, they mention that a large, large dataset is practically necessary for good TTS, but they just don't quite do that.

The demo also leaves a lot to be desired. At first, it sounds better than VALL-E, as VALL-E has that slight quantize crust that I'm all too familiar with. But, I checked back with the original demo page, and that crust is missing. It's funny, since the paper mentions they "directly collect some audio samples from its demo page for comparison". Ignoring that, the speech seems rather stilted for it being "natural".

I'll give it to the singing, though. While I'm sure VALL-E could reproduce singing of some kind (with what I imagine is better annotating the input text), but currently it doesn't, and for all I know it might very well not be able to. But, I think if anyone wants something that sings, they'd just use a VITS solution, at least from all the prattling I've heard about it in passing.


iunno, I'm trying not to be a hater, and it definitely is neat seeing that in the span of what, a few months from VALL-E, and much shorter from VALL-E X, there's already a contender to replace it. I'm sure the lucidrains implementation will accomplish something, especially seeing it's sponsored, and I'll definitely play around with it if something realizes from it.

But, my impressions of it so far are just... flaccid, and at that point I'd just use TorToiSe over it.


In other news, I don't have much of a progress update. Training seems to need at least another week at this rate. It's dawning more and more on me that it might take a really long time to train the model until it gets something adequate, and the temptation to just rent something like an 8x4090 machine is creeping up on me, I think for like $6/hr. I think my only setback (besides the obvious inevitable money pit) is that I already kind of forgot the exact procedure to get training under a docker container working, and I can't be assed to play around with docker files first.

mmm, yeah I definitely won't try and tackle that. I'll let the real experts deal with maturing it, and hopefully someone with the actual compute to play around with it and homebrewing a model. From a cursory glance at the paper, it does seem to address my "concerns" I had with VALL-E with it's "haha lets throw some quantized waveforms at it and see how it learns from it" approach that makes VALL-E, in a hyper-reductionist manner, a sophisticated MIDI synthesizer. However, the more I look at the paper, the more turned off I feel about it. It's reintroducing the problems I did have with TorToiSe with more moving parts, still relying on conditioning latents (or at least, an analog to it). Now there *has* to be a model for encoding phonemes AND the pitch/duration predictor AND the speech prompt encoder. Yeesh. Not to mention the paper says the sourced audio is sampled at 16KHz. I understand the intent, as it effectively serves as an inherent way to squash out any unwanted sound from the waveform by narrowing the bandwidth, but it's still a quality drop somewhere, which I feel is a bit of what TorToiSe suffers from too. Relying on latent vectors instead of the input waveform also pretty much erases any hope for voices with intentional quirks like SHODAN or GLaDOS from being reproduced with it. VALL-E at least has that saving grace from working on the actual waveform itself, and can reproduce *all* acoustic conditions. The training dataset seems to leave a lot to be desired too. The paper mentions the dataset is 44K hours, which at first seemed like it just means the new method is just that more efficient, but later the paper mentions `"our model is still underfitting and longer training will result in better performance"`. Like, they mention that a large, large dataset is practically necessary for good TTS, but they just don't *quite* do that. The demo also leaves a lot to be desired. At first, it sounds better than VALL-E, as VALL-E has that slight quantize crust that I'm all too familiar with. But, I checked back with the original demo page, and that crust is missing. It's funny, since the paper mentions they `"directly collect some audio samples from its demo page for comparison"`. Ignoring that, the speech seems rather stilted for it being "natural". I'll give it to the singing, though. While I'm sure VALL-E *could* reproduce singing of some kind (with what I imagine is better annotating the input text), but currently it doesn't, and for all I know it might very well not be able to. But, I think if anyone wants something that sings, they'd just use a VITS solution, at least from all the prattling I've heard about it in passing. --- iunno, I'm trying not to be a hater, and it definitely is neat seeing that in the span of what, a few months from VALL-E, and much shorter from VALL-E X, there's already a contender to replace it. I'm sure the lucidrains implementation will accomplish something, especially seeing it's sponsored, and I'll definitely play around with it if something realizes from it. But, my impressions of it so far are just... flaccid, and at that point I'd just use TorToiSe over it. --- In other news, I don't have much of a progress update. Training seems to need at least another week at this rate. It's dawning more and more on me that it might take a really long time to train the model until it gets something adequate, and the temptation to just rent something like an 8x4090 machine is creeping up on me, I think for like $6/hr. I think my only setback (besides the obvious inevitable money pit) is that I already kind of forgot the exact procedure to get training under a docker container working, and I can't be assed to play around with docker files first.
Author
Owner

Since I don't really got anywhere else to mention it, I think I squashed the error 500 bugs. I'm not sure why it happened recently, but fuck SystemD. I had to use coreadm in my global zone to disable core dumping, which coredumps never ever matter for me anyways.


In quasi-related news, I'm leveraging LibriTTS's test-clean dataset to serve as an actual validation dataset, to gauge how well the model is at generalizing speech (the crux of zero-shot). I should have done it much, much earlier, to better gauge how things are going over time, but oh well. Training this monster of a batch is currently at iteration 19915, epoch 53ish, so I got probably a half-week left before deciding when to add more data in. I might just cave and dump the clean-360 dataset into it then, iunno.

Just moreso wanted to mention the error 500 issue being resolved, hopefully.


I do have this, I suppose as a very rough zero-shot test: output / reference

It's kind of cute in a weird way seeing it try and speak. It's definitely getting there, but a lot of the other validation output leaves a lot to be desired.

Since I don't really got anywhere else to mention it, I think I squashed the error 500 bugs. I'm not sure why it happened recently, but fuck SystemD. I had to use `coreadm` in my global zone to disable core dumping, which coredumps never ever matter for me anyways. --- In quasi-related news, I'm leveraging LibriTTS's `test-clean` dataset to serve as an actual validation dataset, to gauge how well the model is at generalizing speech (the crux of zero-shot). I should have done it much, much earlier, to better gauge how things are going over time, but oh well. Training this monster of a batch is currently at iteration 19915, epoch 53ish, so I got probably a half-week left before deciding when to add more data in. I might just cave and dump the `clean-360` dataset into it then, iunno. Just moreso wanted to mention the error 500 issue being resolved, hopefully. --- I do have this, I suppose as a *very rough* zero-shot test: [output](https://vocaroo.com/14XRI8sQD5Bg) / [reference](https://vocaroo.com/1cdRNHtQQjWS) It's kind of cute in a weird way seeing it try and speak. It's definitely getting there, but a lot of the other validation output leaves a lot to be desired.
Author
Owner

Bit the bullet yesterday; transcribed the train-360 LibriTTS (sub)dataset, putting me at a total of 116 hours, 167520 lines (total, actual dataset could be more, but I dropped Westwood's Blade Runner and FFXII lines since I felt they weren't really worth training against for the quality they were at).

I'm starting to be at wit's end, though. The metrics towards the end of the last batch stagnated, and the current batch seems pretty stagnated too, even with a new LR restart, so I don't know. I'll have to keep an eye on it for a few days, but I'm worried that no amount of more training and data will help.

End of the last batch:
image

Current progress with the new batch:
image

Bit the bullet yesterday; transcribed the `train-360` LibriTTS (sub)dataset, putting me at a total of 116 hours, 167520 lines (total, actual dataset could be more, but I dropped Westwood's Blade Runner and FFXII lines since I felt they weren't really worth training against for the quality they were at). I'm starting to be at wit's end, though. The metrics towards the end of the last batch stagnated, and the current batch seems pretty stagnated too, even with a new LR restart, so I don't know. I'll have to keep an eye on it for a few days, but I'm worried that no amount of more training and data will help. End of the last batch: ![image](/attachments/cef4e810-2ce9-497e-9987-573bad568f35) Current progress with the new batch: ![image](/attachments/ea1fdd6c-32b5-4791-b0f6-4a8bb14a65c7)

Have you seen https://github.com/Fictiverse/bark? The singing is pretty neat, and the inference time is quite fast.

Have you seen https://github.com/Fictiverse/bark? The singing is pretty neat, and the inference time is quite fast.
Author
Owner

Have you seen https://github.com/Fictiverse/bark?

Seen it. I mentioned some thoughts here, but I'll mention my current thoughts:

  • it takes some of VALL-E's principals for using EnCodec as an intermediate representation of the sound, which is neat and should reduce a lot of issues I had with TorToiSe's pipelines
  • although it takes from some other TTS LMs by conditioning the inputs to be arbitrary, going back to a bit of an issue I have with TorToiSe's pipelines
  • "muh ethics", so cloning is restricted to the models provided
    • there's a fork that seems to have some homebrewed voices provided, but still no way to brew them yourself

The last one is my biggest problem. desu I shouldn't really bother with it right now if it can't do unrestricted voice cloning (or at least, bothering trying to cobble together a way to provide your own voice .npzs)


As a slight progress update, I might have fucked up by setting my de-facto batch size (bs=16, ga=16). I have a hunch that I started getting worse results from training after I optimized the VRAM usage and increasing my settings from bs=4, ga=4. Can't really make much conclusions right now, as I just need to wait and see before making more judgments on if it works or not.

Although, I'm tempted to try the quarter sized models again. Technically I think they can be fine, since I think I fixed it outputting the combined AR+NAR audio after I gave up on it, and it'd be much, much faster to train.

In case I keel over and die or go AWOL, I'm putting my current progress and dataset onto a HuggingFace dataset repo. It'll also help me whenever I finally cave and rent out an actual machine to train this bitch on.


Also, the site should be actually fixed now. I migrated the gitea setup from an LX-brand Ubuntu zone into a normal SmartOS VM (because SystemD fucking sucks and won't tell me what's wrong), and I was able to narrow down it was 1040: Too Many Connections issue from using a neglected SQL VM.

Apologies for it being down however many times, I guess the increase in traffic caused issues. I'm not sure why though, as I have a MediaWiki on the same machine but in a different VM that gets 5x the traffic here and it hasn't given me these issues.

> Have you seen https://github.com/Fictiverse/bark? Seen it. I mentioned some thoughts [here](https://git.ecker.tech/mrq/ai-voice-cloning/issues/213), but I'll mention my current thoughts: * it takes some of VALL-E's principals for using EnCodec as an intermediate representation of the sound, which is neat and should reduce a lot of issues I had with TorToiSe's pipelines * although it takes from some other TTS LMs by conditioning the inputs to be arbitrary, going back to a bit of an issue I have with TorToiSe's pipelines * "muh ethics", so cloning is restricted to the models provided + there's a fork that seems to have some homebrewed voices provided, but still no way to brew them yourself The last one is my biggest problem. desu I shouldn't really bother with it right now if it can't do unrestricted voice cloning (or at least, bothering trying to cobble together a way to provide your own voice `.npz`s) --- As a slight progress update, I might have fucked up by setting my de-facto batch size (bs=16, ga=16). I have a hunch that I started getting worse results from training after I optimized the VRAM usage and increasing my settings from bs=4, ga=4. Can't really make much conclusions right now, as I just need to wait and see before making more judgments on if it works or not. Although, I'm tempted to try the quarter sized models again. Technically I think they *can* be fine, since I think I fixed it outputting the combined AR+NAR audio after I gave up on it, and it'd be much, much faster to train. In case I keel over and die or go AWOL, I'm putting my current progress and dataset onto a HuggingFace dataset repo. It'll also help me whenever I finally cave and rent out an actual machine to train this bitch on. --- Also, the site *should* be actually fixed now. I migrated the gitea setup from an LX-brand Ubuntu zone into a normal SmartOS VM (because SystemD fucking sucks and won't tell me what's wrong), and I was able to narrow down it was `1040: Too Many Connections` issue from using a neglected SQL VM. Apologies for it being down however many times, I guess the increase in traffic caused issues. I'm not sure *why* though, as I have a MediaWiki on the same machine but in a different VM that gets 5x the traffic here and it hasn't given me these issues.
Author
Owner

Oh, actually, there is a repo for it: https://github.com/serp-ai/bark-with-voice-clone

I'll play around with it. If it proves favorable then I guess I won't need VALL-E.

Oh, actually, there *is* a repo for it: https://github.com/serp-ai/bark-with-voice-clone I'll play around with it. If it proves favorable then I guess I won't need VALL-E.

I tried that fork and I the voice replication is comparable to using the non-finetuned custom voice of Tortoise in that it kind of replicate the voice of characters but it doesn't do well with anything outside of audiobook type voices... still pretty neat at least.

I tried that fork and I the voice replication is comparable to using the non-finetuned custom voice of Tortoise in that it kind of replicate the voice of characters but it doesn't do well with anything outside of audiobook type voices... still pretty neat at least.
Author
Owner

I whipped up a small script to play around with it and, while I had zero hitches actually getting it to run (which in post I guess I got lucky, apparently people have had it not work given the issues).

Terrible results. I tried a segment of SA2 Knuckles I already cracked out and the result is unusable. I also used a default provided speaker voice and it's mostly unusable. I'm not sure if it's related to using the small models (as this was running on my 2060, the 4070TI is still training) or not, but I might take a crack at it later with the non-small model.

If it's something inherently wrong with the repo, then at least I got an idea on generating the .npz speaker files, and the code for that can live in this repo.

I suppose I'll still add it in. I have an idea on how I would extend backbends, so if anything it'll be for that.

I whipped up a small script to play around with it and, while I had zero hitches actually getting it to run (which in post I guess I got lucky, apparently people have had it not work given the issues). Terrible results. I tried a segment of SA2 Knuckles I already cracked out and the result is unusable. I also used a default provided speaker voice and it's mostly unusable. I'm not sure if it's related to using the small models (as this was running on my 2060, the 4070TI is still training) or not, but I might take a crack at it later with the non-small model. If it's something inherently wrong with the repo, then at least I got an idea on generating the .npz speaker files, and the code for that can live in this repo. *I suppose* I'll still add it in. I have an idea on how I would extend backbends, so if anything it'll be for that.

mrq it's really sad that you are the entire hope for the open source TTS community right now and you are using a 4070. If you open a patreon, I'll donate $50 towards your compute costs and I think some others would too.

mrq it's really sad that you are the entire hope for the open source TTS community right now and you are using a 4070. If you open a patreon, I'll donate $50 towards your compute costs and I think some others would too.
Author
Owner

AIVC has Bark integration. I don't really need to use any of the forks, as:

  • Fictiverse seems to mostly focus on a web UI
  • JonathanFly seems to mostly focus on saving randomly generated voices and workarounds that technically AIVC already accounts for with the generation code
  • serp-ai also seems to have a wrapper function to work around things

Relying on the main repo just seems better, as I don't have to wait for a fork maintainer to merge upstream commits.

It's extremely kludgy, as it requires your voices to already be transcribed with Whisper in order to use them (because generating speaker files requires a text transcription anyways). Output sounds puke at best and dogshit at worse, so I don't actually think it should be used.

But if you do want to toy with it:

  • git clone https://github.com/suno-ai/bark ./modules/bark
  • pip3 install -e ./modules/bark
  • start.sh --tts-backend='bark'

This way is required because I don't have a way to inject speaker prompt paths anywhere outside of the default one, and this way will keep some uniformity between OS's (implying that I have tested this on Windows, much less, expect it to work under Windows, much less, care if it does at the moment). This also implies DirectML won't get any love, it seems bark loves to use flags like use_gpu=True rather than device='cuda'.

A ton of settings aren't use, the temperature slider works for both text_temp and waveform_temp, because I can't be assed to modify the existing generation_proxy function on the web UI side. You are required to have already transcribed your target voice in the web UI. When generating, it'll pick a random transcription to use as the source. I do not have a convenient way to "select" a prompt.'

I figured I might as well get it added in and wait for things to mature. This will not replace TorToiSe, like, at all.


it's really sad that you are the entire hope for the open source TTS community right now and you are using a 4070[ti]

Nah, I'm just being both stingy and stubborn. Money isn't an issue, as it hasn't been for several of my endeavors. I also refuse to spend any more money on Ngreedia cards, much less, another GPU (I'm still stuck with a 2060, two 6800XTs, and now this 4070Ti I'm feeling some remorse for).

I'll be fine.

AIVC has Bark integration. I don't really need to use any of the forks, as: * Fictiverse seems to mostly focus on a web UI * JonathanFly seems to mostly focus on saving randomly generated voices and workarounds that technically AIVC already accounts for with the generation code * serp-ai also seems to have a wrapper function to work around things Relying on the main repo just seems better, as I don't have to wait for a fork maintainer to merge upstream commits. It's ***extremely*** kludgy, as it requires your voices to already be transcribed with Whisper in order to use them (because generating speaker files requires a text transcription anyways). Output sounds puke at best and dogshit at worse, so I don't actually think it should be used. But if you do want to toy with it: * `git clone https://github.com/suno-ai/bark ./modules/bark` * `pip3 install -e ./modules/bark` * `start.sh --tts-backend='bark'` This way is required because I don't have a way to inject speaker prompt paths anywhere outside of the default one, and this way will keep some *uniformity* between OS's (implying that I have tested this on Windows, much less, expect it to work under Windows, much less, care if it does at the moment). This also implies DirectML won't get any love, it seems bark loves to use flags like `use_gpu=True` rather than `device='cuda'`. A ton of settings aren't use, the temperature slider works for both `text_temp` and `waveform_temp`, because I can't be assed to modify the existing `generation_proxy` function on the web UI side. You are required to have already transcribed your target voice in the web UI. When generating, it'll pick a random transcription to use as the source. I do not have a convenient way to "select" a prompt.' I figured I might as well get it added in and wait for things to mature. This *will* not replace TorToiSe, like, at all. --- > it's really sad that you are the entire hope for the open source TTS community right now and you are using a 4070[ti] Nah, I'm just being both stingy and stubborn. Money isn't an issue, as it hasn't been for several of my endeavors. I also refuse to spend any more money on Ngreedia cards, much less, another GPU (I'm still stuck with a 2060, two 6800XTs, and now this 4070Ti I'm feeling some remorse for). I'll be fine.
Author
Owner

I had a pretty lengthy progress report (despite framing it as brief), but I felt it was much ado about nothing, and might have painted too high of an expectation that I keep forgetting that I make and forget that I break. Anyways:

  • I was wrong about the gradient accumulation being "too big". It harmed my throughput rate, my metrics were noisy that entire period I reduced it, my losses actually went up, blah blah. A defacto batch of 256 seems fine.
  • my blind faith in my LR scheduling has bit me in the ass. It might be fine for models overfitting or datasets that are too small, but it seems now it's not helpful in the slightest. Future LR/optimizer state restarts will follow the tried and true method used for TorToiSe finetuning and not do any warming up, and instead do the similar decaying. After reverting my gradient accumulation factor back to 16, I went ahead and skipped to start the decaying, and my losses are going down nicely again, even the NAR, which was very stubborn and stagnated at best and at worst kept spiking.
  • I reduced the max phoneme count for data from 80 to 64. While it reduces the total dataset size to 140656, it reduces the time it takes to do an iteration by 0.4s on average (~1.6s/it to ~1.0s/it), and reduces VRAM use a smidge, as I would semi-rarely OOM during training.
    • in the future, I might train on the culled-for-being-too-long dataset with a smaller batch size.

The above changes has me not stressing so much about training now. I just need to remember to stop making the same mistakes again and again.

And when this run is over (I am not making any promises):

  • the aforementioned training on the longer pieces of data, while also incorporating a better LR scheduling strategy (without long warmups).
  • revisiting quarter sized models, and seeing how they fare again. It's a bit of a crime to give up on it before fixing the issues I had.
  • revisiting my 2x6800XTs, although I imagine they're still slow because ROCm (although I wonder if there was a hidden penalty to int64).
  • probably renting a 8x4090 machine I was thinking about to train on. Although, the cost to rent is going to explode a bit too fast I imagine. There's Azure, but trying to figure out how to even get a machine spun up is way over my head.
  • something involving this paper, I guess, while I wait for more training to be done.
  • rewriting AIVC, or at least cleaning it up. It's such a mess of hairy spaghetti, it's starting to worry me.
  • actually documenting my VALL-E fork and training it.

iunno, Bark sounding so terrible seems to put more pressure on getting this model trained. I don't know how it can sound so bad. Even the output for the past month of training at least maintained some semblance of the source. But Bark didn't sound anything like the demos.

I had a pretty lengthy progress report (despite framing it as brief), but I felt it was much ado about nothing, and might have painted too high of an expectation that I keep forgetting that I make and forget that I break. Anyways: * I was wrong about the gradient accumulation being "too big". It harmed my throughput rate, my metrics were noisy that entire period I reduced it, my losses actually went up, blah blah. A defacto batch of 256 seems fine. * my blind faith in my LR scheduling has bit me in the ass. It might be fine for models overfitting or datasets that are too small, but it seems now it's not helpful in the slightest. Future LR/optimizer state restarts will follow the tried and true method used for TorToiSe finetuning and not do any warming up, and instead do the similar decaying. After reverting my gradient accumulation factor back to 16, I went ahead and skipped to start the decaying, and my losses are going down nicely again, even the NAR, which was very stubborn and stagnated at best and at worst kept spiking. * I reduced the max phoneme count for data from 80 to 64. While it reduces the total dataset size to 140656, it reduces the time it takes to do an iteration by 0.4s on average (~1.6s/it to ~1.0s/it), and reduces VRAM use a smidge, as I would semi-rarely OOM during training. - in the future, I might train on the culled-for-being-too-long dataset with a smaller batch size. The above changes has me not stressing so much about training now. I just need to remember to stop making the same mistakes again and again. And when this run is over (I am not making any promises): * the aforementioned training on the longer pieces of data, while also incorporating a better LR scheduling strategy (without long warmups). * revisiting quarter sized models, and seeing how they fare again. It's a bit of a crime to give up on it before fixing the issues I had. * revisiting my 2x6800XTs, although I imagine they're still slow because ROCm (although I wonder if there was a hidden penalty to int64). * probably renting a 8x4090 machine I was thinking about to train on. Although, the cost to rent is going to explode a bit too fast I imagine. There's Azure, but trying to figure out how to even get a machine spun up is way over my head. * [something involving this paper](https://arxiv.org/pdf/2304.03442.pdf), I guess, while I wait for more training to be done. * rewriting AIVC, or at least cleaning it up. It's such a mess of hairy spaghetti, it's starting to worry me. * actually documenting my VALL-E fork and training it. iunno, Bark sounding so terrible seems to put more pressure on getting this model trained. I don't know how it can sound so bad. Even the output for the past month of training at least maintained *some* semblance of the source. But Bark didn't sound anything like the demos.

Again, I'd like to consider contributing some money towards the cloud compute costs if possible. Opening a patreon would be good.

Again, I'd like to consider contributing some money towards the cloud compute costs if possible. Opening a patreon would be good.
Author
Owner

Slight progress report again: things are going swimmingly again. Losses are going down (not as fast as apparent as I wish, but still going down), and accuracies are going up (again, not as fast as I wish).

I suppose, given how the evaluation / validation output consistently sounds, it's doing a great job at replicating the acoustic "environment" of a speaker (good), but still has a lot more to go in order to consistently synthesize speech.

  • I imagine this emerged from the training approach the implementation did, where it's fed random utterances from the same speaker to better learn the acoustics of a speaker, rather than just being fed the actual source.
    • I feel like I can remedy this by having each piece of data of a batch randomly select between shuffling utterances for the input prompt (default behavior), or ignore the input prompt and just use the source audio as the input prompt (which should also cut down VRAM usage).
      • The issue is that this is just conjecture, given how the model is currently behaving. I think training for TorToiSe doesn't shuffle for random utterances, so it might really be the difference maker.
  • Speakers seem to fall into two camps:
    • speakers with small datasets are "learned" quickly and sound better, due to less variance in the input prompts.
      • I wonder if this applies to voices with similar acoustics, which might explain how some LibriTTS voices are sounding solved.
    • speakers with large datasets take quite a while, as the model hasn't quite figured out how to synthesize speech, just mimicking acoustics for a given prompt
      • I wonder what neat shit could be done with it trying to solve for acoustics more
      • this doesn't quite explain how voices with similar acoustics (example, SA2 voices, Half-Life voices) tend to "cross-talk" / contaminate (for example, I've had the HEV suit come out through the train ride voice, I've had Barney and Gman swap, and I've had Rouge and Knuckles swap).

I suppose that's my #\1 worry now: trying to nudge it in the right direction to start prioritizing speech synthesis itself rather than just deriving acoustics. Sure, it's far, far, far, far, far better that it's that way (solving acoustics, then trying to solve speech, rather than solving speech, then trying to have it clone). But for only figuring out how to better goad it into solving speech synthesis more rather than solving for acoustics replicating is what I should focus more on.

I guess when this batch is done (probably next week, at this rate), I'll:

  • set the minimum phoneme count for the dataset to 64 (rather than it being the maximum) to target the data I've had culled for who knows how long due to being too long.
  • increase the input prompt size (from three seconds to I don't know, maybe five or six).
  • reduce my batch size (because I know I'll OOM).
  • implement randomly switching between the random utterances for input prompts and just using the target / reference to goad it better into synthesizing speech (and set the odds behind some config value for tuning).

Outputs (evaluation output):

Slight progress report again: things are going swimmingly again. Losses are going down (not as fast as apparent as I wish, but still going down), and accuracies are going up (again, not as fast as I wish). I suppose, given how the evaluation / validation output consistently sounds, it's doing a great job at replicating the acoustic "environment" of a speaker (good), but still has a lot more to go in order to consistently synthesize speech. * I imagine this emerged from the training approach the implementation did, where it's fed random utterances from the same speaker to better learn the acoustics of a speaker, rather than just being fed the actual source. + I *feel* like I can remedy this by having each piece of data of a batch randomly select between shuffling utterances for the input prompt (default behavior), or ignore the input prompt and just use the source audio as the input prompt (which should also cut down VRAM usage). - The issue is that this is just conjecture, given how the model is currently behaving. I *think* training for TorToiSe doesn't shuffle for random utterances, so it might really be the difference maker. * Speakers seem to fall into two camps: + speakers with small datasets are "learned" quickly and sound better, due to less variance in the input prompts. - I wonder if this applies to voices with similar acoustics, which might explain how some LibriTTS voices are sounding solved. + speakers with large datasets take quite a while, as the model hasn't quite figured out how to synthesize speech, just mimicking acoustics for a given prompt - I wonder what neat shit could be done with it trying to solve for acoustics more - this doesn't quite explain how voices with similar acoustics (example, SA2 voices, Half-Life voices) tend to "cross-talk" / contaminate (for example, I've had the HEV suit come out through the train ride voice, I've had Barney and Gman swap, and I've had Rouge and Knuckles swap). I suppose that's my #\1 worry now: trying to nudge it in the right direction to start prioritizing speech synthesis itself rather than just deriving acoustics. Sure, it's far, far, far, far, far better that it's that way (solving acoustics, then trying to solve speech, rather than solving speech, then trying to have it clone). But for only figuring out how to better goad it into solving speech synthesis more rather than solving for acoustics replicating is what I should focus more on. I guess when this batch is done (probably next week, at this rate), I'll: * set the minimum phoneme count for the dataset to 64 (rather than it being the maximum) to target the data I've had culled for who knows how long due to being too long. * increase the input prompt size (from three seconds to I don't know, maybe five or six). * reduce my batch size (because I know I'll OOM). * implement randomly switching between the random utterances for input prompts and just using the target / reference to goad it better into synthesizing speech (and set the odds behind some config value for tuning). Outputs (evaluation output): * https://vocaroo.com/1cJyF5mxF7xd ([reference](https://vocaroo.com/1h1i9cBpaZCX)) * https://vocaroo.com/1l9qU2plUh22 ([reference](https://vocaroo.com/1gK4rUsRe5ZT)) * https://vocaroo.com/19c5ve3LhROv ([reference](https://vocaroo.com/1iiJX9zabVvv))
Author
Owner

Progress report: since my metrics seemed to have flatlined after running through the LR schedule, I went ahead and:

  • swapped to the bigger portion of the dataset that got culled (phoneme lengths between 64 and 128, before it culled anything bigger than 64 phonemes).
  • increased the input prompt size from three seconds to six seconds (although I think I should actually keep this at three, it might be better to just hard enforce a three second input prompt).
  • set the probability of using the reference clip instead of using random utterance's from a speaker to happen 35% of the time now.
  • reduced my batch size from 16 to 8 to make up for the increased VRAM usage.

To spare the boring details, my losses jumped back up a bit, but not as bad as every other step-restart. I'm not sure what to make of it.

  • the evaluation / validation audio isn't that bad, but it still needs to get back up to par to sound good.
    • I thought my validation dataset would remain the same, but it seems it also changed from shifting my restrictions, so I don't have a like-for-like baseline.
  • I guess it's reasonable to assume that the model has been getting better and better at fitment, as the speakers are the same, it's just which utterances are used (and the lengths).
  • I suppose it's also reasonable to assume that the model moreso got used to shorter utterances, so training on longer ones should help? My main issue is it falling apart when it would generate longer sentences, rather than small pieces, so this should help resolve that.
  • I'm unsure if I should have done this earlier, as it effectively would have undone a good portion of my training work at the end, but I suppose the only way to verify is to revert to the old restrictions and see how the losses are.

It'd be nice if I do have a way to dynamically swap between the two datasets (larger batch size but smaller data, and smaller batchsize but bigger data) to try and avoid the model from fixating against lengths, but I need a bigger brain (or the attention) to do that.

It just kinda blows (not in a good way) that I still haven't got something decent to show for it outside of some validation clips I manage to pick out amongst the clips that sounds okay but not that great. I suppose as long as TorToiSe is still serviceable, then it's not that big of a boon to stress over.

iunno, I feel like I'm going back to the FUD phase of training cycle, where I'm fretting over probably everything that could be wrong. It's not as bad as the other dozens of times, at least.


Also I failed to realize I still have the train-other-500 dataset for LibriTTS, so I'll let my 2060 crunch at it over the next three days (since I think train-clean-360 took a day and a half). By that time:

  • I should have a good idea on whether or not to continue baking the model with the larger portion of the dataset (>=64 phonemes).
  • should have a way to alternate between a "small data large batch size" dataset and a "large data small batch size" data set to compromise between the two.
  • have a Dockerfile ready to slap this onto some 4090s and train and see how far I can get with it.

I think I've fudged up with underestimating how crucial it is to just have a large dataset, rather than just a narrower but more real-world one.


I made the mistake of slapping my 2x6800XTs back into my training system to see how it would fare: it did not. For some reason training was completely impossible under ROCm; it kept throwing errors during the forward pass about tensors being wrong or whatever, so I guess I can't really check without devoting a day to it. Oh well.

Progress report: since my metrics seemed to have flatlined after running through the LR schedule, I went ahead and: * swapped to the bigger portion of the dataset that got culled (phoneme lengths between 64 and 128, before it culled anything bigger than 64 phonemes). * increased the input prompt size from three seconds to six seconds (although I think I should actually keep this at three, it might be better to just hard enforce a three second input prompt). * set the probability of using the reference clip instead of using random utterance's from a speaker to happen 35% of the time now. * reduced my batch size from 16 to 8 to make up for the increased VRAM usage. To spare the boring details, my losses jumped back up a bit, but not as bad as every other step-restart. I'm not sure what to make of it. * the evaluation / validation audio isn't *that* bad, but it still needs to get back up to par to sound good. - I thought my validation dataset would remain the same, but it seems it also changed from shifting my restrictions, so I don't have a like-for-like baseline. * I guess it's reasonable to assume that the model has been getting better and better at fitment, as the speakers are the same, it's just which utterances are used (and the lengths). * I suppose it's also reasonable to assume that the model moreso got used to shorter utterances, so training on longer ones *should* help? My main issue is it falling apart when it would generate longer sentences, rather than small pieces, so this should help resolve that. * I'm unsure if I should have done this earlier, as it effectively would have undone a good portion of my training work at the end, but I suppose the only way to verify is to revert to the old restrictions and see how the losses are. It'd be nice if I do have a way to dynamically swap between the two datasets (larger batch size but smaller data, and smaller batchsize but bigger data) to try and avoid the model from fixating against lengths, but I need a bigger brain (or the attention) to do that. It just kinda blows (not in a good way) that I still haven't got something *decent* to show for it outside of some validation clips I manage to pick out amongst the clips that sounds *okay* but not that great. I suppose as long as TorToiSe is still *serviceable*, then it's not that big of a boon to stress over. iunno, I feel like I'm going back to the FUD phase of training cycle, where I'm fretting over probably everything that could be wrong. It's not as bad as the other dozens of times, at least. --- Also I failed to realize I still have the `train-other-500` dataset for LibriTTS, so I'll let my 2060 crunch at it over the next three days (since I think `train-clean-360` took a day and a half). By that time: * I should have a good idea on whether or not to continue baking the model with the larger portion of the dataset (>=64 phonemes). * should have a way to alternate between a "small data large batch size" dataset and a "large data small batch size" data set to compromise between the two. * have a Dockerfile ready to slap this onto some 4090s and train and see how far I can get with it. I think I've fudged up with underestimating how crucial it is to just have a large dataset, rather than just a narrower but more real-world one. --- I made the mistake of slapping my 2x6800XTs back into my training system to see how it would fare: it did not. For some reason training was completely impossible under ROCm; it kept throwing errors during the forward pass about tensors being wrong or whatever, so I guess I can't really check without devoting a day to it. Oh well.

If you still need more data, I'd recommend checking out the VoxCeleb dataset.
It advertises over 7000 celebrity voices and over 2000 hours of audio, so it's a fairly large one. The dataset references YouTube URLs, and provides frame ranges for relevant utterances (it also has face tracking data, but you can ignore that). The main inconveniences are that there aren't any .wav files to download, so you need to download the relevant audio and then extract the utterances based on the frame numbers, some links may be dead, and utterance transcriptions aren't distributed in the public dataset.

There are two versions of the dataset, VoxCeleb1, which has 150,000+ utterance references from 1251 celebrities, and VoxCeleb2, which has 1,000,000+ utterance references from 6112 celebrities.

Here's where you can get the dataset:
http://mm.kaist.ac.kr/datasets/voxceleb/index.html

Here's some random example videos from the dataset:
https://www.youtube.com/watch?v=0rpfN7wThsg
https://www.youtube.com/watch?v=jUSC4i_eGHs
https://www.youtube.com/watch?v=Tzs_CTbHT9Y
https://www.youtube.com/watch?v=PfcJLmkhGbk


Here's a download script for the dataset if you end up using it:

#!/bin/bash

frame_rate=25
# Loop through all subdirectories
for speaker in *; do
  for video in $speaker/*; do
    video_ref=$(basename $video)
    yt-dlp -x --audio-format mp3 --output "$video/$video_ref.%(ext)s" "https://www.youtube.com/watch?v=$video_ref"

    for file in $video/*.txt; do
      # get start and end frames and remove leading zeros
      start=$(grep -A1 "FRAME" "$file" | tail -n1 | awk '{print $1}' | sed 's/^0*//')
      end=$(tail -n1 "$file" | awk '{print $1}' | sed 's/^0*//')
      ffmpeg -i "$video/$video_ref.mp3" -ss $(echo "scale=2; $start / $frame_rate" | bc) -to $(echo "scale=2; $end / $frame_rate" | bc) -c copy "$(echo ${file%.*}).mp3"
    done
    rm "$video/$video_ref.mp3"
  done
done
If you still need more data, I'd recommend checking out the VoxCeleb dataset. It advertises over 7000 celebrity voices and over 2000 hours of audio, so it's a fairly large one. The dataset references YouTube URLs, and provides frame ranges for relevant utterances (it also has face tracking data, but you can ignore that). The main inconveniences are that there aren't any .wav files to download, so you need to download the relevant audio and then extract the utterances based on the frame numbers, some links may be dead, and utterance transcriptions aren't distributed in the public dataset. There are two versions of the dataset, VoxCeleb1, which has 150,000+ utterance references from 1251 celebrities, and VoxCeleb2, which has 1,000,000+ utterance references from 6112 celebrities. Here's where you can get the dataset: http://mm.kaist.ac.kr/datasets/voxceleb/index.html Here's some random example videos from the dataset: https://www.youtube.com/watch?v=0rpfN7wThsg https://www.youtube.com/watch?v=jUSC4i_eGHs https://www.youtube.com/watch?v=Tzs_CTbHT9Y https://www.youtube.com/watch?v=PfcJLmkhGbk --- Here's a download script for the dataset if you end up using it: ``` #!/bin/bash frame_rate=25 # Loop through all subdirectories for speaker in *; do for video in $speaker/*; do video_ref=$(basename $video) yt-dlp -x --audio-format mp3 --output "$video/$video_ref.%(ext)s" "https://www.youtube.com/watch?v=$video_ref" for file in $video/*.txt; do # get start and end frames and remove leading zeros start=$(grep -A1 "FRAME" "$file" | tail -n1 | awk '{print $1}' | sed 's/^0*//') end=$(tail -n1 "$file" | awk '{print $1}' | sed 's/^0*//') ffmpeg -i "$video/$video_ref.mp3" -ss $(echo "scale=2; $start / $frame_rate" | bc) -to $(echo "scale=2; $end / $frame_rate" | bc) -c copy "$(echo ${file%.*}).mp3" done rm "$video/$video_ref.mp3" done done ```

I was just playing around with vast.ai, a GPU peer sharing service and my first impression is that it works really well. Used it with the paperspace notebook and it seems pretty robust.

You can get a 4090 for 43 cents per hour when I checked, although it varies. Each user has a limit on how many days you can use it consecutive so in that regard it seems a lot more dependable than paperspace.

This could be a way to really get a nice model going. Fuck I'd even chip in a couple of bucks.

Also, are you currently using audiobooks for training? I composed a 900 hour Belgian Dutch dataset just from ripping audiobooks from Storytel using this, using a free trial as well, so it didn't even cost me anything This seems like a no-brainer seeing the original creator of tortoise also used a lot of audiobooks and this way we can get a chonky dataset in no-time. Just have to download a variety of speakers but that should be much easier in English.

If you want I could make a balanced dataset of male female speakers and send it to you for transcribing. Or run it on my 3060 TI which can run the large-v2 model using whisperx's v3 branch.

Finally, to transcribe my dataset I wrote a script which takes the word-level timestamps that whisperx spits out and merges these together to form natural sentences between a given minimum and maximum length. All you have to do is then slice your dataset using ffmpeg. If it's any help to you, I could clean it up (because it was written at 3 AM) and post it here.

I was just playing around with vast.ai, a GPU peer sharing service and my first impression is that it works really well. Used it with the paperspace notebook and it seems pretty robust. You can get a 4090 for 43 cents per hour when I checked, although it varies. Each user has a limit on how many days you can use it consecutive so in that regard it seems a lot more dependable than paperspace. This could be a way to really get a nice model going. Fuck I'd even chip in a couple of bucks. Also, are you currently using audiobooks for training? I composed a 900 hour Belgian Dutch dataset just from ripping audiobooks from Storytel using [this](https://github.com/jo1gi/audiobook-dl), using a free trial as well, so it didn't even cost me anything This seems like a no-brainer seeing the original creator of tortoise also used a lot of audiobooks and this way we can get a chonky dataset in no-time. Just have to download a variety of speakers but that should be much easier in English. If you want I could make a balanced dataset of male female speakers and send it to you for transcribing. Or run it on my 3060 TI which can run the large-v2 model using whisperx's v3 branch. Finally, to transcribe my dataset I wrote a script which takes the word-level timestamps that whisperx spits out and merges these together to form natural sentences between a given minimum and maximum length. All you have to do is then slice your dataset using ffmpeg. If it's any help to you, I could clean it up (because it was written at 3 AM) and post it here.
Author
Owner

I meant to send this earlier, but I kept getting sidetracked. Oops.


If you still need more data, I'd recommend checking out the VoxCeleb dataset.

Mmm, my only qualm with that is:

overlapping speech

I could be smart and diarize during transcription, and discard any output that reports multiple speakers, but I honestly don't know how much I can trust it, as I tried using it for something completely unrelated to the project, and it failed me.

I'll keep it in mind when I need to feed more data, but I think I hit the point of diminishing returns for adding more data:


LibriTTS's train-other-500 has been transcribed, quantized, and phonemized, and some changes to the data loading procedure, I am now at:

  • 532 trimmed hours
  • 543478 samples 536512 samples
  • 2238 speakers
  • minimum phoneme length of 4, maximum phoneme length of 192 100 (there's only like, 7k samples above this mark, I can't be assed to test it with it higher right now as I worry it'll break the balance.)
  • batch size 8 (YUCK. I can't get anything stable at even bs=10 without it OOMing during the backwards pass, which sucks because my card can definitely have a bigger batch size without harming the throughput all that much). batch size 16 (I pulled every remaining optimization out of my ass to get it stable)

Despite the dataset size 3xing (some of that does have to do with increasing the maximum phoneme length), my existing losses and accuracies haven't taken that much of a hit. I suppose this is a good sign, as the model hasn't been overfitting for the existing dataset, and can perform fine against new data (although that was evident when the validation output is at-parity with the training dataset).

So I'm a bit loss:

  • technically the training dataset should be at parity with the unofficial-homebrewed model trained with the newer implementation. I say technically, since while I do have the entirety of the training set for LibriTTS implemented, I did do a little more tighter cuts for them, and I probably have additional tokens for phonemes, and I have all the other extra non-LibriTTS stuff.
  • I think any noticeable jumps in the training metrics when I feed the beast will require an astronomical amount of new data, as I'm only at ~532 hours compared to the original paper saying it was trained on LibriLight's 60K hours. >100x'ing the dataset is a bit of a boon to tackle, but seeing as the other implementation has had somewhat decent output from the homebrewed model, I suppose all the extra "in-context learning" and other "emergent properties" of an LLM aren't that necessary. If anything, my original goal was to get a model anyways to finetune from there, so.

And coincidentally, my next set of notes:


was just playing around with vast.ai

I've used runpod before to get a rough idea on whether to go balls deep into a 4090 or if a 4070TI is "good enough", before:

  • adding my optimizations for training
  • realizing that the extra VRAM would really be nice,,,,,,,,,, (since I'm pretty much bottlenecked by small batch sizes, I can increase it more and have a marginal throughput hit... sort of)

My only issues with it are:

  • trying to game the system with reducing how much I'll pay for storage
  • having to go through a rigmarole of installing additional dependencies (namely nvcc-cuda-12-1 or something). This used to be a bit of a bigger pain with more outdated CUDA libraries and Python libs, but the supplied Docker image has been updated to be less of a pain.
  • the urgency of trying to get everything set up as fast as I can since it charges hourly.

and most importantly:

Training doesn't seem to actually use multiple GPUs

I noticed this with my 2x6800XTs, but didn't think much of it, but I tried this morning with 2x4090s and:

  • while the other GPU will have a load on it, it doesn't seem to actually be useful for anything at all.
  • the total dataset size / (batch size * GPUs) don't change between single and distributed.
  • the iteration rate actually gets worse.

So I'm at a loss for that too. I don't really have an idea how to diagnose it, and my only good lead is to dig through DL Art School to see how it does it, as I don't think DeepSpeed has any actual examples of it being used like the first implementation does it.

Even then, just a single 4090 didn't seem to offer that much of an uplift to warrant just throwing everything onto a rented server and stressing even more about getting the most out of it. Sure, I can semi-comfortably set the batch size to 24 over 8 with the current settings, but the iteration rate is about 2x the iteration rate I'm getting now, so it's more like 1.3x the throughput (I don't know if there is also from some penalty of it being in Docker vs bare metal). So I suppose I did make the sensible move of not paying twice as much for a 4090 over my 4070Ti (but I'm still aching from being VRAM starved).


In short, I'm a bit at a loss.

  • I'm hitting diminishing returns with more data. I think any more additions to the dataset has to at least double it in size for even a chance of it making more of an impact.
    • and where should I balance it around, as a bigger dataset means all the smaller speakers will be visited less for training against.
  • I actually can't use multiple GPUs to train without having to CBT myself in finding the problem.
    • assuming that it actually isn't working. For all I can tell, it's not actually working.
  • I'm simultaneously being VRAM starved and not.
    • I really need to find a better strategy for sampling the dataset for batching.
  • Do I even need to do anything, or should I just let it sit and train for however long now.
I meant to send this earlier, but I kept getting sidetracked. Oops. --- > If you still need more data, I'd recommend checking out the VoxCeleb dataset. Mmm, my only qualm with that is: > overlapping speech I *could* be smart and diarize during transcription, and discard any output that reports multiple speakers, but I honestly don't know how much I can trust it, as I tried using it for something completely unrelated to the project, and it failed me. I'll keep it in mind when I need to feed more data, but I think I hit the point of diminishing returns for adding more data: --- LibriTTS's `train-other-500` has been transcribed, quantized, and phonemized, and some changes to the data loading procedure, I am now at: * 532 trimmed hours * ~~543478 samples~~ 536512 samples * 2238 speakers * minimum phoneme length of 4, maximum phoneme length of ~~192~~ 100 (there's only like, 7k samples above this mark, I can't be assed to test it with it higher right now as I worry it'll break the balance.) * ~~batch size 8 (YUCK. I can't get anything stable at even bs=10 without it OOMing during the backwards pass, which sucks because my card can definitely have a bigger batch size without harming the throughput all that much).~~ batch size 16 (I pulled every remaining optimization out of my ass to get it stable) Despite the dataset size 3xing (some of that does have to do with increasing the maximum phoneme length), my existing losses and accuracies haven't taken that much of a hit. I suppose this is a good sign, as the model hasn't been overfitting for the existing dataset, and can perform fine against new data (although that was evident when the validation output is at-parity with the training dataset). So I'm a bit loss: * *technically* the training dataset should be at parity with the unofficial-homebrewed model trained with the newer implementation. I say technically, since while I do have the entirety of the training set for LibriTTS implemented, I did do a little more tighter cuts for them, and I probably have additional tokens for phonemes, and I have all the other extra non-LibriTTS stuff. * I think any noticeable jumps in the training metrics when I feed the beast will require an astronomical amount of new data, as I'm only at ~532 hours compared to the original paper saying it was trained on LibriLight's 60K hours. >100x'ing the dataset is a bit of a boon to tackle, but seeing as the other implementation has had *somewhat decent* output from the homebrewed model, I suppose all the extra "in-context learning" and other "emergent properties" of an LLM aren't *that* necessary. If anything, my original goal was to get a model anyways to finetune from there, so. And coincidentally, my next set of notes: --- > was just playing around with vast.ai I've used runpod before to get a rough idea on whether to go balls deep into a 4090 or if a 4070TI is "good enough", before: * adding my optimizations for training * realizing that the extra VRAM would really be nice,,,,,,,,,, (since I'm pretty much bottlenecked by small batch sizes, I can increase it more and have a marginal throughput hit... sort of) My only issues with it are: * trying to game the system with reducing how much I'll pay for storage * having to go through a rigmarole of installing additional dependencies (namely `nvcc-cuda-12-1` or something). This used to be a bit of a bigger pain with more outdated CUDA libraries and Python libs, but the supplied Docker image has been updated to be less of a pain. * the urgency of trying to get everything set up as fast as I can since it charges hourly. and most importantly: # Training doesn't seem to actually use multiple GPUs I noticed this with my 2x6800XTs, but didn't think much of it, but I tried this morning with 2x4090s and: * while the other GPU will have a load on it, it doesn't seem to actually be useful for anything at all. * the `total dataset size / (batch size * GPUs)` don't change between single and distributed. * the iteration rate actually gets worse. So I'm at a loss for that too. I don't really have an idea how to diagnose it, and my only good lead is to dig through DL Art School to see how it does it, as I don't think DeepSpeed has any actual examples of it being used like the first implementation does it. Even then, just a single 4090 didn't seem to offer *that* much of an uplift to warrant just throwing everything onto a rented server and stressing even more about getting the most out of it. Sure, I can semi-comfortably set the batch size to 24 over 8 with the current settings, but the iteration rate is about 2x the iteration rate I'm getting now, so it's more like 1.3x the throughput (I don't know if there is also from some penalty of it being in Docker vs bare metal). So I *suppose* I did make the sensible move of not paying twice as much for a 4090 over my 4070Ti (but I'm still aching from being VRAM starved). --- In short, I'm a bit at a loss. * I'm hitting diminishing returns with more data. I think any more additions to the dataset *has* to at least double it in size for even a chance of it making more of an impact. - and where should I balance it around, as a bigger dataset means all the smaller speakers will be visited less for training against. * I actually can't use multiple GPUs to train without having to CBT myself in finding the problem. - assuming that it actually isn't working. For all I can tell, it's not actually working. * I'm simultaneously being VRAM starved and not. - I ***really*** need to find a better strategy for sampling the dataset for batching. * Do I even need to do anything, or should I just let it sit and train for however long now.
  • I think any noticeable jumps in the training metrics when I feed the beast will require an astronomical amount of new data, as I'm only at ~532 hours compared to the original paper saying it was trained on LibriLight's 60K hours. >100x'ing the dataset is a bit of a boon to tackle.

So what's the limiting factor in just using that 60k hour dataset (I'm guessing compute)? As for the balancing problem. Can we not just restart the dataset from scratch and just alternate between male and female spoken audiobooks (like I said in my previous post)? Maybe trim each one so it has a max length of 5 hours? That would balance all the speakers. I believe tortoise's creator used 2 hour long audiobooks to create most of his 50k hour dataset. You could start with 5k hours, that's still a 10X increase. Also just to check, you're using the V3 branch of whisperX right? That thing is a lot faster than the main branch and lowers VRAM usage.

As for the batching problem, what exactly are you trying to solve? Is it just slicing segments to different lengths that you're after? My bad if these are braindead questions.

> * I think any noticeable jumps in the training metrics when I feed the beast will require an astronomical amount of new data, as I'm only at ~532 hours compared to the original paper saying it was trained on LibriLight's 60K hours. >100x'ing the dataset is a bit of a boon to tackle. So what's the limiting factor in just using that 60k hour dataset (I'm guessing compute)? As for the balancing problem. Can we not just restart the dataset from scratch and just alternate between male and female spoken audiobooks (like I said in my previous post)? Maybe trim each one so it has a max length of 5 hours? That would balance all the speakers. I believe tortoise's creator used 2 hour long audiobooks to create most of his 50k hour dataset. You could start with 5k hours, that's still a 10X increase. Also just to check, you're using the V3 branch of whisperX right? That thing is a lot faster than the main branch and lowers VRAM usage. As for the batching problem, what exactly are you trying to solve? Is it just slicing segments to different lengths that you're after? My bad if these are braindead questions.
Author
Owner

So what's the limiting factor in just using that 60k hour datase

In order:

  • disk space
    • not a big deal, I can move things around on my bunch of spinning rust, or transcribe in pieces, since the final slices for training are much much smaller.
  • compute time for transcription
    • not so much of a deal, just a chore to wait and supervise. If I use faster-whisper backed whisperX, then it should be faster, but only if the audio wasn't already pre-sliced (which I think it isn't from what I recall from seeing the small version)
  • training
    • right now, after all of my further savings to eek out a larger batch size, the estimated time to go through an epoch is about 14 hours.
    • and this concern is a bit hard to articulate, but I'm not sure where the balance is for "large dataset to avoid overfitting" versus "being able to revisit everything in the dataset more often". I know the older portion of the dataset wasn't sounding all that well before, but I haven't got a chance to check the output all that well the past two days.

Can we not just restart the dataset from scratch

No real point in it, as even not-so-good weights are better than a clean slate. I've already done LR/optimizer restarts a lot anyways as I kept adding more and more (except the last one), so it's sort of already been "restarted", save for the new weights.

balancing

Isn't so much of a concern. The speaker balancing from the original implementation should be good enough for "balancing".

As for the batching problem, what exactly are you trying to solve? Is it just slicing segments to different lengths that you're after?

The data loader that assembles the batches to train against is fairly naïve in terms of trying to balance the actual size in memory it takes.

My initial solution was to have a better data loader that could aim for a target "token" length, which I believe is what the newer implementation does (and the VALL-E paper might do, as it says it's batch size is by acoustic token length).

Now, I'm pretty sure this is actually a bit of a red herring, as I've sussed out some causes:

  • some DeepSpeed optimizations had some funny values that aren't explained well that led to increased VRAM usage for the backwards pass / gradients, since that was the common place I'll OOM.
  • explicitly deleting scratch tensors, as the forward pass will do a few tensor merges and concats, which seemed to help not OOM during the backwards pass.

So in reality I don't think I need to touch the data loader.

Also just to check, you're using the V3 branch of whisperX right?

I tried it before, but it didn't give much of an uplift on smaller segments, which was the case for using whisperX's batching (it didn't get any faster on small segments). Also, I would need to fiddle around with re-parsing the output, as v3 breaks compat.


I think training should be fine if I just let it bake at s low, low LR now and let it do its thing for... however much longer.


However, I'm having doubts again. I forgot TorToiSe was trained on 50k hours and, for what it's worth, is still a damn decent model, while the newer implementation 's homebrewed model was on ~550h or so. My concern of "visiting things more often" is probably just an issue I should only really solve it by longer training times. Mmm...

> So what's the limiting factor in just using that 60k hour datase In order: * disk space - not a big deal, I can move things around on my bunch of spinning rust, or transcribe in pieces, since the final slices for training are much much smaller. * compute time for transcription - not so much of a deal, just a chore to wait and supervise. If I use faster-whisper backed whisperX, then it should be faster, but only if the audio wasn't already pre-sliced (which I think it isn't from what I recall from seeing the small version) * training - right now, after all of my further savings to eek out a larger batch size, the estimated time to go through an epoch is about 14 hours. - and this concern is a bit hard to articulate, but I'm not sure where the balance is for "large dataset to avoid overfitting" versus "being able to revisit everything in the dataset more often". I know the older portion of the dataset wasn't sounding all that well before, but I haven't got a chance to check the output all that well the past two days. > Can we not just restart the dataset from scratch No real point in it, as even not-so-good weights are better than a clean slate. I've already done LR/optimizer restarts a lot anyways as I kept adding more and more (except the last one), so it's sort of already been "restarted", save for the new weights. > balancing Isn't so much of a concern. The speaker balancing from the original implementation should be good enough for "balancing". > As for the batching problem, what exactly are you trying to solve? Is it just slicing segments to different lengths that you're after? The data loader that assembles the batches to train against is fairly naïve in terms of trying to balance the actual size in memory it takes. My initial solution was to have a better data loader that could aim for a target "token" length, which I believe is what the newer implementation does (and the VALL-E paper might do, as it says it's batch size is by acoustic token length). Now, I'm pretty sure this is actually a bit of a red herring, as I've sussed out some causes: * some DeepSpeed optimizations had some funny values that aren't explained well that led to increased VRAM usage for the backwards pass / gradients, since that was the common place I'll OOM. * explicitly deleting scratch tensors, as the forward pass will do a few tensor merges and concats, which seemed to help not OOM during the backwards pass. So in reality I don't *think* I need to touch the data loader. > Also just to check, you're using the V3 branch of whisperX right? I tried it before, but it didn't give much of an uplift on smaller segments, which was the case for using whisperX's batching (it didn't get any faster on small segments). Also, I would need to fiddle around with re-parsing the output, as v3 breaks compat. --- I think training should be fine if I just let it bake at s low, low LR now and let it do its thing for... however much longer. --- However, I'm having doubts again. I forgot TorToiSe was trained on 50k hours and, for what it's worth, is still a damn decent model, while the newer implementation 's homebrewed model was on ~550h or so. My concern of "visiting things more often" is probably just an issue I should only really solve it by longer training times. Mmm...

I tried training the other fancy Valle implementation with a large data I crawled (~10k). After around 5m steps, the audio quality is nowhere near Tortoise. It is also pretty unstable. It might be just me being stupid but I have my qualm that this model will never be like good old

I tried training the other fancy Valle implementation with a large data I crawled (~10k). After around 5m steps, the audio quality is nowhere near Tortoise. It is also pretty unstable. It might be just me being stupid but I have my qualm that this model will never be like good old

Also, I say, sitting your ass and waiting for a couple of weeks is the best engineering effort you'd do to shine in better results. ::

Also, I say, sitting your ass and waiting for a couple of weeks is the best engineering effort you'd do to shine in better results. ::
Author
Owner

the audio quality is nowhere near Tortoise. It is also pretty unstable. It might be just me being stupid but I have my qualm that this model will never be like good old

That's pretty much how I felt when I messed with it after the weights got posted for the homebrewed model; it left a lot to be desired and it was just better to use TorToiSe + finetunes than wrestle with it.

My main cope is that it's just flawed from oversights with the implementation that aren't issues with the one I forked (namely shuffling for random utterances as the input prompt rather than use the target audio as the input prompt).

I'm quite happy with what evaluation / validation output it does produce, at least, before the accuracy dropped from increasing the dataset.

Also, I say, sitting your ass and waiting for a couple of weeks is the best engineering effort you'd do to shine in better results.

That's what I keep telling myself, and keep having a hard time actually doing it. I'm just dreadfully impatient.

I keep looking at the metrics stagnating and worry that I need to change something: image

When in reality it's probably just from my LR being too high, and I just need to wait for the LR to decay low enough and have it sit there to see any movement.

> the audio quality is nowhere near Tortoise. It is also pretty unstable. It might be just me being stupid but I have my qualm that this model will never be like good old That's pretty much how I felt when I messed with it after the weights got posted for the homebrewed model; it left a lot to be desired and it was just better to use TorToiSe + finetunes than wrestle with it. My main cope is that it's just flawed from oversights with the implementation that aren't issues with the one I forked (namely shuffling for random utterances as the input prompt rather than use the target audio as the input prompt). I'm quite happy with what evaluation / validation output it does produce, at least, before the accuracy dropped from increasing the dataset. > Also, I say, sitting your ass and waiting for a couple of weeks is the best engineering effort you'd do to shine in better results. That's what I keep telling myself, and keep having a hard time actually doing it. I'm just dreadfully impatient. I keep looking at the metrics stagnating and worry that I need to change something: ![image](/attachments/b9e65693-498b-46d0-8991-483fa594d901) When in reality it's probably just from my LR being too high, and I just need to wait for the LR to decay low enough and have it sit there to see any movement.
Author
Owner

Mmm... I think I fucked up the training script.

There's been no movement for the past few days, and I removed the train-other-500 dataset and, while the loss / accuracies moved, it still isn't changing over time. I even tested with quarter sized models and there's no movement either, so I definitely botched something.
I'm so mad since I effectively wasted another 3-4 days.


Seemed to have been an odd mix between the DeepSpeed version I had, and moving the engine/models between CPU and GPU, which I guess actually fucks shit up. Ugh.

Mmm... I think I fucked up the training script. There's been no movement for the past few days, and I removed the `train-other-500` dataset and, while the loss / accuracies moved, it still isn't changing over time. I even tested with quarter sized models and there's no movement either, so I definitely botched something. I'm so mad since I effectively wasted another 3-4 days. --- Seemed to have been an odd mix between the DeepSpeed version I had, and moving the engine/models between CPU and GPU, which I guess actually fucks shit up. Ugh.

Well that's good news I guess, those metrics did look pretty bad

Well that's good news I guess, those metrics did look pretty bad

Thank you for your work. I've been on the lifeiteng version, and also been failing to get any good results. I was hoping to try your version next, but I'm unable to find a script that you used to preprocess the libritts dataset. Like I see the scripts to download and quantize librilight-tts but not LibriTTS.

Thank you for your work. I've been on the lifeiteng version, and also been failing to get any good results. I was hoping to try your version next, but I'm unable to find a script that you used to preprocess the libritts dataset. Like I see the scripts to download and quantize librilight-tts but not LibriTTS.
Author
Owner

those metrics did look pretty bad

They still look pretty bad 1.5 epochs in, but it at least seems to be showing it's "learning" from the gradient norms getting smaller, and a random spike in the losses.


I was hoping to try your version next

Things should be stable enough to train with my fork. I just haven't been actively advising anyone to train with it given it's quite the pill to homebrew (and I still actually need to figure out why multi-GPU training doesn't seem to be working).

Like I see the scripts to download and quantize librilight-tts but not LibriTTS.

Right. I forgot to re-adapt the script I used for re-labelling LibriTTS with the one I cobbled to test on runpod instances.

I can't quite remember how much it really does help to properly transcribe / slice using AIVC's web UI over just shortcutting it with already provided transcriptions + without slicing the utterances down. I think at the end of the day it shouldn't matter all that much now from all the VRAM-savings I pulled out of my ass, but if you're after the entire training+validation LibriTTS, I can just provide the dataset for it itself when I get a chance.

> those metrics did look pretty bad They still look pretty bad 1.5 epochs in, but it at least seems to be showing it's "learning" from the gradient norms getting smaller, and a random spike in the losses. --- > I was hoping to try your version next Things *should* be stable enough to train with my fork. I just haven't been actively advising anyone to train with it given it's quite the pill to homebrew (and I still actually need to figure out why multi-GPU training doesn't seem to be working). > Like I see the scripts to download and quantize librilight-tts but not LibriTTS. Right. I forgot to re-adapt the script I used for re-labelling LibriTTS with the one I cobbled to test on runpod instances. I can't *quite* remember how much it really does help to properly transcribe / slice using AIVC's web UI over just shortcutting it with already provided transcriptions + without slicing the utterances down. I think at the end of the day it *shouldn't* matter all that much now from all the VRAM-savings I pulled out of my ass, but if you're after the entire training+validation LibriTTS, I can just provide the dataset for it itself when I get a chance.
Author
Owner

Dreaded progress report: image

  • as noted with the dip at the end, that's from me removing LibriTTS's train-other-500 subset, and the average loss per iteration going down from seeing the old data more
    • there's been very little progress in terms of the average loss going down / the average accuracy going up over just 6 epochs.
    • I hope that, by reverting to the older dataset, that nothing can go wrong, and things will get better.
      • Crossing my fingers, as it did show progress at <=64 phoneme lengthed and >=64 phoneme lengthed data.
    • the loss not quite going back to where it was before I tossed train-other-500 in maybe shows some good signs that it did at least do something to the model? (COPIUM).
  • there was something odd that happened last night, where, when I woke up to check the SSH session, the average iteration rate was reporting as ~32s/it, and it returned fine after restarting training. I'm not sure why this is so, but I imagine it's probably because, ironically, it's the longest the trainer has ran without a restart (a good testament to all the VRAM stabilization optimizations I suppose).
  • I haven't even bothered actually checking the output. Partly because I worry that restarting SDDM and running an X session will be just enough VRAM to have it trip up the training script, since running the web UI off my 6800XT sometimes will cause training off the 4070Ti to restart.
  • my rough mental math was charting that a decent training time before this change at like, 40 days. About a week of barely perceptible progress is a no go. I might return back to this if I grow a wild hair and train a quarter-sized model, if (when) this is done.
  • the usual problem I noticed when I did check the evaluation / validation output was that the NAR was fine (and still reporting to be "worse" than the AR), but the AR being not-quite-there means the combined output wasn't quite decent. Non-LibriTTS voices also kept struggling to sound right, so I imagine I might have actually done a better job with clean weights after all, but that would still mean scrapping the model and starting again.

It also doesn't help I'm split between a few other things I want to work on, dividing my attention even further.

Oh well. I'll judge things again in another week.

Dreaded progress report: ![image](/attachments/e51cfe92-7bc9-4843-8800-02ba0ed422aa) * as noted with the dip at the end, that's from me removing LibriTTS's `train-other-500` subset, and the average loss per iteration going down from seeing the old data more - there's been very little progress in terms of the average loss going down / the average accuracy going up over just 6 epochs. - I hope that, by reverting to the older dataset, that nothing can go wrong, and things will get better. * Crossing my fingers, as it *did* show progress at <=64 phoneme lengthed and >=64 phoneme lengthed data. - the loss not quite going back to where it was before I tossed `train-other-500` in maybe shows some good signs that it did at least do something to the model? (COPIUM). * there was something odd that happened last night, where, when I woke up to check the SSH session, the average iteration rate was reporting as ~32s/it, and it returned fine after restarting training. I'm not sure why this is so, but I imagine it's probably because, ironically, it's the longest the trainer has ran without a restart (a good testament to all the VRAM stabilization optimizations I suppose). * I haven't even bothered actually checking the output. Partly because I worry that restarting SDDM and running an X session will be *just* enough VRAM to have it trip up the training script, since running the web UI off my 6800XT sometimes will cause training off the 4070Ti to restart. * my rough mental math was charting that a *decent* training time before this change at like, 40 days. About a week of barely perceptible progress is a no go. I might return back to this if I grow a wild hair and train a quarter-sized model, if (when) this is done. * the usual problem I noticed when I did check the evaluation / validation output was that the NAR was fine (and still reporting to be "worse" than the AR), but the AR being not-quite-there means the combined output wasn't quite decent. Non-LibriTTS voices also kept struggling to sound right, so I imagine I might have actually done a better job with clean weights after all, but that would still mean scrapping the model and starting again. It also doesn't help I'm split between a few other things I want to work on, dividing my attention even further. Oh well. I'll judge things again in another week.
Author
Owner

I'll just do my weekly evaluation a little bit ahead of time.

I think the AR fried.

Despite the loss slowly going down, the range between the metrics are even more chaotic, and the evaluation / validation output sounds awful, it managed to be worse before I meddled and added in train-other-500.

I do have backups from every time before I modified the dataset, but I don't know if I should bother taking the risk of it frying again if it could also be from all the other previous re-use of the weights over and over and over again.

But I think at this point, after constantly reusing weights over and over again every time the dataset grew, I should just take a page from LLaMa-variant trainings and start from scratch with a small model, then after getting something actually usable, do the big model. I had one of my typical lists for cope points on why, but they just boil down to it being much, much, much faster to train it (eyeballing it, it's like 6x throughput).

Sucks it's about two (three? I can't remember desu) months just to realize the weights are doomed, but you got to break some eggs. A lot of trial and errors and errors and errors are needed to nail out all the issues to get it to be easy for off-the-shelf cards to train off of.

I just worry that this is going to be another timesink where quarter-sized models just aren't viable at all. However, the metrics are looking pretty good for being at about epoch 3. image

I'll just do my weekly evaluation a little bit ahead of time. I think the AR fried. Despite the loss ***slowly*** going down, the range between the metrics are even more chaotic, and the evaluation / validation output sounds awful, it managed to be worse before I meddled and added in `train-other-500`. I do have backups from every time before I modified the dataset, but I don't know if I should bother taking the risk of it frying again if it could also be from all the other previous re-use of the weights over and over and over again. But I think at this point, after constantly reusing weights over and over again every time the dataset grew, I should just take a page from LLaMa-variant trainings and start from scratch with a small model, then after getting *something* actually usable, do the big model. I had one of my typical lists for cope points on why, but they just boil down to it being much, much, much faster to train it (eyeballing it, it's like 6x throughput). Sucks it's about two (three? I can't remember desu) months just to realize the weights are doomed, but you got to break some eggs. A lot of trial and errors and errors and errors are needed to nail out all the issues to get it to be easy for off-the-shelf cards to train off of. I just worry that this is going to be another timesink where quarter-sized models just aren't viable at all. However, the metrics are looking pretty good for being at about epoch 3. ![image](/attachments/73824ac9-24c9-490d-946d-e93c0155aeca)
Author
Owner

Actual weekly progress report:

I feel very, very stupid for burning so much time being stubborn. Restarting the weights was actually the correct call, as the results are looking pretty good. This is with training a quarter sized model over three days, a little over 40 epoch and 40000 iterations with the dataset before adding in the train-other-500 portion:

image

For a quarter sized model and a few days, it's pretty good. However, I'm not sure if it's because of the model size, but I cannot go any lower than AR loss=~3.6, no matter what LR I leave it running at (I tried high, I tried tiny, I tried average, it was left running decaying between the two in hopes it'll find a sweet spot, and no go).

So, I think it was last night, I grew a wild hair and restarted the training, but with the train-other-500 dataset included to, feeding the beast my most complete dataset, and:

image

In just 7000 iterations and a little under three epochs, it's already at the same progress as it was with the previous test run, and seems it can breach the AR loss=~3.6 floor. My only worry is that my LR is still too high, as I started from a much, much, much higher peak of 1.5e-3.


I haven't gotten a chance to start an Xorg session again and check the evaluation / validation output of either models, but given the metrics, I can assume they're decent, but not quite there yet, as the accuracies are still not where I know they shine at.

Also, I guess this confirms my doubts over a large dataset over "muh epochs", as the importance of epochs wanes the larger the dataset is itself. Which sucks, because now I'm going to have to go back and find more and more and more and more data to feed the beast with, since just adding back in train-other-500 really boosted things given the small model size.

I think right now my only fear is that there is a floor of how low the loss can go for a given model size, since it's already looking like it hit that floor again, as the curves are approaching that asymptote.

Actual weekly progress report: I feel very, very stupid for burning so much time being stubborn. Restarting the weights was actually the correct call, as the results are looking pretty good. This is with training a quarter sized model over three days, a little over 40 epoch and 40000 iterations with the dataset before adding in the `train-other-500` portion: ![image](/attachments/ba310dce-1049-412a-829a-e45f635337d4) For a quarter sized model and a few days, it's pretty good. However, I'm not sure if it's because of the model size, but I cannot go any lower than AR loss=~3.6, no matter what LR I leave it running at (I tried high, I tried tiny, I tried average, it was left running decaying between the two in hopes it'll find a sweet spot, and no go). So, I think it was last night, I grew a wild hair and restarted the training, but with the `train-other-500` dataset included to, feeding the beast my most complete dataset, and: ![image](/attachments/2df699c3-daa0-4082-93e5-432cefd098d1) In just 7000 iterations and a little under three epochs, it's already at the same progress as it was with the previous test run, and seems it *can* breach the AR loss=~3.6 floor. My only worry is that my LR is still too high, as I started from a much, much, much higher peak of 1.5e-3. --- I haven't gotten a chance to start an Xorg session again and check the evaluation / validation output of either models, but given the metrics, I can *assume* they're decent, but not quite there yet, as the accuracies are still not where I know they shine at. Also, I guess this confirms my doubts over a large dataset over "muh epochs", as the importance of epochs wanes the larger the dataset is itself. Which sucks, because now I'm going to have to go back and find more and more and more and more data to feed the beast with, since just adding back in `train-other-500` ***really*** boosted things given the small model size. I think right now my only fear is that there *is* a floor of how low the loss can go for a given model size, since it's already looking like it hit that floor again, as the curves are approaching that asymptote.

That actually looks encouraging. I'd give it some more time. Do you have a loss target in mind?

I do however wonder how it would fare if you gave it like 2000 hours worth of speech to train on though. Want me to rip you some copyrighted audiobooks just in case? The alternative would be that librivox dataset. Seems easier than just picking up small bits of audio here and there.

On a sidenote, I really want to know how 11labs does their TTS. Theirs still sound a little better than tortoise's finetuned models. Did they just use tortoise and throw computing power at it you think?

That actually looks encouraging. I'd give it some more time. Do you have a loss target in mind? I do however wonder how it would fare if you gave it like 2000 hours worth of speech to train on though. Want me to rip you some copyrighted audiobooks just in case? The alternative would be that librivox dataset. Seems easier than just picking up small bits of audio here and there. On a sidenote, I really want to know how 11labs does their TTS. Theirs still sound a little better than tortoise's finetuned models. Did they just use tortoise and throw computing power at it you think?
Author
Owner

Do you have a loss target in mind?

Not necessarily a target loss, but moreso a mix of playing it by ear from the output, and the reported AR accuracy being >90%. I can't remember what loss correlated to it when I was doing longer runs on the smaller datasets, though.

It's a bit of a pickle too, since good output quality is mostly predicated on the AR being up to par; no point in a good NAR if the output from the AR fed into it isn't good.

I do however wonder how it would fare if you gave it like 2000 hours worth of speech to train on though

Seeing how well it performed relative to the epoch count unironically whitepilled me on the whole ordeal of a big dataset. I think it could overcome the lower model parameters, but right now the ~550+ hours is slowly improving now.

Want me to rip you some copyrighted audiobooks just in case

If it's not going to be too much of a hassle for you. Huge batches through AIVC (`./start.sh --tts-backend="vall-e") tends to have hiccups while transcribing/timestamping under CUDA I found; ROCm somehow had more stability. Then there's the phonemizing/quantizing step that will hang after a bit of time regardless.

The alternative would be that librivox dataset. Seems easier than just picking up small bits of audio here and there.

There's the full 60K hours LibriLight dataset, which VALL-E originally was trained on. My only concern is if there's any overlap between it and LibriTTS. I wouldn't want to fuck up my dataset with a sizeable amount of duplicates. I could prune transcriptions with similar strings, but the issue is that LibriTTS is trimmed down, and LibriLight I-believe is one whole piece, so even just relying on the full transcription of a sound file won't do any good. I suppose I could just check for similarities and prune manually, but even then I imagine would be an astronomic task (unless I do embedding vector similarities shit).

Theirs still sound a little better than tortoise's finetuned models. Did they just use tortoise and throw computing power at it you think?

Some idle thoughts I have had over the months is that it's definitely it's own thing. From what I remember:

  • the annoying-as-hell quirk it has where the delivery is too fast isn't around in any other LM-based TTS solution that I'm aware of.
  • the other features it boasts like the Voice Lab means that the manner in which it conditions reference clips allows for actual control over the "latents" (or whatever intermediary for the input voice prompts), and I imagine it has its own model to do it, like Bark.
  • there was that mentioning of there being a limit of 30 seconds used for the reference clip. I imagine at reference-clip-processing-time, the audio is also transcribed (Bark also seems to require a transcription of the reference prompt to generate the "latents") and an embedding of the transcription is stored. At inference time, it'll use the embedding of the input text and find enough similar reference clips to fill up to the 30 seconds.
    • I was actually thinking of doing this myself for VALL-E (or TorToise), after dabbling in the land of text LLMs, since a similar approach is done to work around narrow context windows.

Although again, it's speculation.

> Do you have a loss target in mind? Not necessarily a target loss, but moreso a mix of playing it by ear from the output, and the reported AR accuracy being >90%. I can't remember what loss correlated to it when I was doing longer runs on the smaller datasets, though. It's a bit of a pickle too, since good output quality is mostly predicated on the AR being up to par; no point in a good NAR if the output from the AR fed into it isn't good. > I do however wonder how it would fare if you gave it like 2000 hours worth of speech to train on though Seeing how well it performed relative to the epoch count unironically whitepilled me on the whole ordeal of a big dataset. I think it *could* overcome the lower model parameters, but right now the ~550+ hours is ***slowly*** improving now. > Want me to rip you some copyrighted audiobooks just in case If it's not going to be too much of a hassle for you. Huge batches through AIVC (`./start.sh --tts-backend="vall-e") tends to have hiccups while transcribing/timestamping under CUDA I found; ROCm somehow had more stability. Then there's the phonemizing/quantizing step that will hang after a bit of time regardless. > The alternative would be that librivox dataset. Seems easier than just picking up small bits of audio here and there. There's the full 60K hours LibriLight dataset, which VALL-E originally was trained on. My only concern is if there's any overlap between it and LibriTTS. I wouldn't want to fuck up my dataset with a sizeable amount of duplicates. I could prune transcriptions with similar strings, but the issue is that LibriTTS is trimmed down, and LibriLight I-believe is one whole piece, so even just relying on the full transcription of a sound file won't do any good. I suppose I could just check for similarities and prune manually, but even then I imagine would be an astronomic task (unless I do embedding vector similarities shit). > Theirs still sound a little better than tortoise's finetuned models. Did they just use tortoise and throw computing power at it you think? Some idle thoughts I have had over the months is that it's definitely it's own thing. From what I remember: * the annoying-as-hell quirk it has where the delivery is too fast isn't around in any other LM-based TTS solution that I'm aware of. * the other features it boasts like the Voice Lab means that the manner in which it conditions reference clips allows for actual control over the "latents" (or whatever intermediary for the input voice prompts), and I imagine it has its own model to do it, like Bark. * there was that mentioning of there being a limit of 30 seconds used for the reference clip. I imagine at reference-clip-processing-time, the audio is also transcribed (Bark also seems to require a transcription of the reference prompt to generate the "latents") and an embedding of the transcription is stored. At inference time, it'll use the embedding of the input text and find enough similar reference clips to fill up to the 30 seconds. + I was actually thinking of doing this myself for VALL-E (or TorToise), after dabbling in the land of text LLMs, since a similar approach is done to work around narrow context windows. Although again, it's speculation.

https://github.com/facebookresearch/fairseq/blob/main/examples/mms

facebook released some models, not sure how to use it tho

https://github.com/facebookresearch/fairseq/blob/main/examples/mms facebook released some models, not sure how to use it tho

https://github.com/facebookresearch/fairseq/blob/main/examples/mms

facebook released some models, not sure how to use it tho

It's right there in the TTS and ASR sections and the finetuning instructions are here.

> https://github.com/facebookresearch/fairseq/blob/main/examples/mms > > facebook released some models, not sure how to use it tho It's right there in the [TTS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#tts-1) and [ASR](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr-1) sections and the finetuning instructions are [here](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec#fine-tune-a-pre-trained-model-with-ctc).
Author
Owner

https://github.com/facebookresearch/fairseq/blob/main/examples/mms

The ASR might be promising at the very least, but I'm not too sure if it'd be comparable to Whisper in terms of timestamped transcriptions.

The TTS being plain jane TTS and not cloning is expected, and it being VITS is a little bit of a letdown.

and the finetuning instructions are here.

Isn't that just for finetuning a wav2vec2 model?


Quick progress update: image

On the full-size model, it's already at the same spot the quarter sized model was at for the same dataset at 7k iterations (of the same effective batch size). I am pleased, but still upset at my stubbornness to do a clean train before.

Thoughts:

  • not going to bother and listen to the evaluation / validation output as I'm both scared to start an Xorg session while it's training, and I don't think it's necessary at the moment, since it's not at the AR accuracy where I want it.
  • this is both technically not a unique full one epoch, as it OOM'd once about 30% through, and I had to restart training around noon earlier, as the iteration rate dropped to 50s/it again. It's also technically a unique epoch, as the nature of training makes the input prompts random since it's picked at random for a given speaker.
  • I'm a bit skeptical if I should actually bother with more data, now that I look at how the full size model behaves with my "most-complete" dataset; it already was starting to slow down well before epoch 1, so it might only serve to prolong when it hits the loss floor.
    • I might also be able to work around this with going beyond the spec and increasing the model size (it crossed my mind to try 30 layers like TorToiSe's AR is, instead of 12, but that would REALLY ruin the iteration rate).

Oh well. I'll let it run until Friday before doing more evaluations with it.


I also got around to actually zipping the dataset for anyone interested in training it but don't have a dataset. You just need to extract it into ./training/valle/ (or edit the data_dir paths) and run:

CUDA_HOME=/opt/cuda/ ./modules/vall-e/scripts/run.sh deepspeed --module vall_e.train yaml="./training/valle/config.yaml"
>https://github.com/facebookresearch/fairseq/blob/main/examples/mms The ASR might be promising at the very least, but I'm not too sure if it'd be comparable to Whisper in terms of timestamped transcriptions. The TTS being plain jane TTS and not cloning is expected, and it being VITS is a little bit of a letdown. > and the finetuning instructions are here. Isn't that just for finetuning a wav2vec2 model? --- Quick progress update: ![image](/attachments/c9c8621e-ad49-4e9e-9d10-2ce5d50373f6) On the full-size model, it's already at the same spot the quarter sized model was at for the same dataset at 7k iterations (of the same effective batch size). I am pleased, but still upset at my stubbornness to do a clean train before. Thoughts: * not going to bother and listen to the evaluation / validation output as I'm both scared to start an Xorg session while it's training, and I don't think it's necessary at the moment, since it's not at the AR accuracy where I want it. * this is both *technically* not a unique full one epoch, as it OOM'd once about 30% through, and I had to restart training around noon earlier, as the iteration rate dropped to 50s/it again. It's also *technically* a unique epoch, as the nature of training makes the input prompts random since it's picked at random for a given speaker. * I'm a bit skeptical if I should actually bother with more data, now that I look at how the full size model behaves with my "most-complete" dataset; it already was starting to slow down well before epoch 1, so it might only serve to prolong when it hits the loss floor. + I might also be able to work around this with going beyond the spec and increasing the model size (it crossed my mind to try 30 layers like TorToiSe's AR is, instead of 12, but that would REALLY ruin the iteration rate). Oh well. I'll let it run until Friday before doing more evaluations with it. --- I also got around to actually zipping the [dataset](https://huggingface.co/datasets/ecker/valle-aivc) for anyone interested in training it but don't have a dataset. You just need to extract it into `./training/valle/` (or edit the `data_dir` paths) and run: ``` CUDA_HOME=/opt/cuda/ ./modules/vall-e/scripts/run.sh deepspeed --module vall_e.train yaml="./training/valle/config.yaml" ```

Isn't that just for finetuning a wav2vec2 model?

That's what they are, as far as I can tell (the ASR models, I mean).

> Isn't that just for finetuning a wav2vec2 model? That's what they are, as far as I can tell (the ASR models, I mean).

Hey man, not sure where to throw this in or whether this is viable. One thing I recently "discovered" is that if you produce a TTS clip let say on Tortoise and then feed that into RVC (https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) the quality is drastically improved and dare I say 11labs esque quality.

See these samples :

Tortoise : https://vocaroo.com/1lQcWB4R0Vaw

RVC : https://vocaroo.com/1eHQe0LruRbz

Tortoise : https://vocaroo.com/1kO4tXnTmqYE

RVC : https://vocaroo.com/1b0FmJI905bw

Obvious caveats : You have to train two different models and the average gpu enjoyer will not like that. But can this be done on the "fly"?

Hey man, not sure where to throw this in or whether this is viable. One thing I recently "discovered" is that if you produce a TTS clip let say on Tortoise and then feed that into RVC (https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) the quality is drastically improved and dare I say 11labs esque quality. See these samples : Tortoise : https://vocaroo.com/1lQcWB4R0Vaw RVC : https://vocaroo.com/1eHQe0LruRbz - Tortoise : https://vocaroo.com/1kO4tXnTmqYE RVC : https://vocaroo.com/1b0FmJI905bw Obvious caveats : You have to train two different models and the average gpu enjoyer will not like that. But can this be done on the "fly"?
Author
Owner

if you produce a TTS clip let say on Tortoise and then feed that into RVC the quality is drastically improved

It has crossed my mind to just have TorToiSe (or the other VALL-E homebrewed model) to generate a base clip and throw it into a VITS (or any speech-to-speech) and see how it fares, but it'd require juggling around more models. I'll keep it in mind.


So I found another training issue that I feel stupid for not really catching: I have let my LRs decay too low, or at least decay it too fast: the full-sized model was stagnating at the same AR loss=~3.6 wall, so I played around with the LR and kept it as high as it can go without it frying, and it seemed to have helped get past the barrier. I resumed training the quarter sized model with a similar tactic (with a slightly higher LR, as the quarter sized model can take it for some reason) plus increasing the input prompt duration from 3 seconds to 6 (but having to drop my batch size down), and it's already averaging at an AR loss=~3.45. I'm not sure why I kept forgetting about my decaying LR. 1.0e-4 seems to be the sweet spot for it, which kind of irritates me since it goes against my usual go-to LRs.

> if you produce a TTS clip let say on Tortoise and then feed that into RVC the quality is drastically improved It has crossed my mind to just have TorToiSe (or the other VALL-E homebrewed model) to generate a base clip and throw it into a VITS (or any speech-to-speech) and see how it fares, but it'd require juggling around more models. I'll keep it in mind. --- So I found another training issue that I feel stupid for not really catching: I have let my LRs decay *too* low, or at least decay it too fast: the full-sized model was stagnating at the same AR loss=~3.6 wall, so I played around with the LR and kept it as high as it can go without it frying, and it seemed to have helped get past the barrier. I resumed training the quarter sized model with a similar tactic (with a slightly higher LR, as the quarter sized model can take it for some reason) plus increasing the input prompt duration from 3 seconds to 6 (but having to drop my batch size down), and it's already averaging at an AR loss=~3.45. I'm not sure why I kept forgetting about my decaying LR. 1.0e-4 seems to be the sweet spot for it, which kind of irritates me since it goes against my usual go-to LRs.

OK so I have a 2000 hour audiobook dataset compiled. Didn't take that long to gather but uploading it took forever. It's untranscribed still as well.

Use it if you feel like you're not making progress anymore I guess.

Can I DM you the link in some way? Had to use onedrive since it's 120GB, and onedrive puts personal information in your link apparently.

OK so I have a 2000 hour audiobook dataset compiled. Didn't take that long to gather but uploading it took forever. It's untranscribed still as well. Use it if you feel like you're not making progress anymore I guess. Can I DM you the link in some way? Had to use onedrive since it's 120GB, and onedrive puts personal information in your link apparently.
Author
Owner

OK so I have a 2000 hour audiobook dataset compiled. Didn't take that long to gather but uploading it took forever. It's untranscribed still as well.

Sweet. Training seemed to have slow down quite a bit even on a quarter sized model at a pretty high LR of 1.25e-4, enough to where I think it isn't all that viable to continue trying

Can I DM you the link in some way?

Shoot it over email to mrq@ecker.tech.

> OK so I have a 2000 hour audiobook dataset compiled. Didn't take that long to gather but uploading it took forever. It's untranscribed still as well. Sweet. Training seemed to have slow down quite a bit even on a quarter sized model at a pretty high LR of 1.25e-4, enough to where I think it isn't all that viable to continue trying > Can I DM you the link in some way? Shoot it over email to mrq@ecker.tech.

Alright, sent you the link.

Alright, sent you the link.

Greetings! Thank you for your great work and all these comments resonate with my work on Mockingbird(https://github.com/babysor/MockingBird), an open-source Chinese cloning project, which was modified from RTVC. I appreciate you taking the time to write down all your progress and meaningful thoughts.

Although I haven't been involved in TTS for over a year, your work has reignited my interest in the field. It's amazing how open-source projects can foster continuous progress by bringing together passionate individuals like us. Thank you again and I look forward to potential collaborations in the future!

BTW, I have a large collection of Chinese voice data on my computer, and I also have over 1000 followers who can contribute more datasets. I would love to collaborate or share resources whatever can help on this.

Greetings! Thank you for your great work and all these comments resonate with my work on Mockingbird(https://github.com/babysor/MockingBird), an open-source Chinese cloning project, which was modified from RTVC. I appreciate you taking the time to write down all your progress and meaningful thoughts. Although I haven't been involved in TTS for over a year, your work has reignited my interest in the field. It's amazing how open-source projects can foster continuous progress by bringing together passionate individuals like us. Thank you again and I look forward to potential collaborations in the future! BTW, I have a large collection of Chinese voice data on my computer, and I also have over 1000 followers who can contribute more datasets. I would love to collaborate or share resources whatever can help on this.
Author
Owner

Could have sworn I sent a post here, but I suppose I didn't.

Training is slowly improving over the weekend with a maintained LR of 1.0e-4 on the full-size model; but I don't know if I should keep bothering with it until I get the new additions to the dataset added in.

I did finally get around to listening to the evaluation / validation output yesterday, and it's somewhat solid even at a "low" accuracy of ~70%. Ironically, the quarter-sized model actually sounds better than the full-size model did at the time, but the quarter-sized model has had a ton more iterations into it (and I imagine the larger batch size, and it being able to use a slightly higher LR, are favoring it moreso).

I'll get around to cherry picking the examples, since some were decent and some weren't as decent, between the two models, but it seemed consistently "decent" for the given progress. This week has me quite busy with a bunch of other endeavors overlapping at once.


Alright, sent you the link.

Received. I have it all on my training system, just need to spend some time to get the transcription going.


Mockingbird(https://github.com/babysor/MockingBird), an open-source Chinese cloning project, which was modified from RTVC

Looks pretty nice and robust; unless I'm mistaken, it's just an encoder + mel synthesizer + vocoder? The example output seems pretty decent for what it's worth.

I appreciate you taking the time to write down all your progress and meaningful thoughts.
Although I haven't been involved in TTS for over a year, your work has reignited my interest in the field. It's amazing how open-source projects can foster continuous progress by bringing together passionate individuals like us. Thank you again and I look forward to potential collaborations in the future!

Glad my ramblings managed to make its way out of my own sphere here. What started as a simple batch of QoL improvements for TorToiSe turned into quite the endeavor. I felt a lot of the layman knowledge of it all is either outright nonexistent or woven in papers or implementations. I still don't feel that qualified, but I suppose my understanding is better than nothing.

BTW, I have a large collection of Chinese voice data on my computer, and I also have over 1000 followers who can contribute more datasets. I would love to collaborate or share resources whatever can help on this.

Right. I keep forgetting to toy around with my passing thoughts for getting a "VALL-E X" implementation (which is just annotating with a language token to better hint at which language is inferenced).

I'll keep it in mind whenever I do get around to needing to source more voice data, although who knows when that'll be; I don't expect the experiments in a stapled on "VALL-E X" implementation is going to be all that fruitful for a while when I get around to it.

Could have sworn I sent a post here, but I suppose I didn't. Training is slowly improving over the weekend with a maintained LR of 1.0e-4 on the full-size model; but I don't know if I should keep bothering with it until I get the new additions to the dataset added in. I did finally get around to listening to the evaluation / validation output yesterday, and it's somewhat solid even at a "low" accuracy of ~70%. Ironically, the quarter-sized model actually sounds better than the full-size model did at the time, but the quarter-sized model has had a ton more iterations into it (and I imagine the larger batch size, and it being able to use a slightly higher LR, are favoring it moreso). I'll get around to cherry picking the examples, since some were decent and some weren't as decent, between the two models, but it seemed consistently "decent" for the given progress. This week has me quite busy with a bunch of other endeavors overlapping at once. --- > Alright, sent you the link. Received. I have it all on my training system, just need to spend some time to get the transcription going. --- > Mockingbird(https://github.com/babysor/MockingBird), an open-source Chinese cloning project, which was modified from RTVC Looks pretty nice and robust; unless I'm mistaken, it's just an encoder + mel synthesizer + vocoder? The example output seems pretty decent for what it's worth. > I appreciate you taking the time to write down all your progress and meaningful thoughts. > Although I haven't been involved in TTS for over a year, your work has reignited my interest in the field. It's amazing how open-source projects can foster continuous progress by bringing together passionate individuals like us. Thank you again and I look forward to potential collaborations in the future! Glad my ramblings managed to make its way out of my own sphere here. What started as a simple batch of QoL improvements for TorToiSe turned into quite the endeavor. I felt a lot of the layman knowledge of it all is either outright nonexistent or woven in papers or implementations. I still don't feel *that* qualified, but I suppose my understanding is better than nothing. > BTW, I have a large collection of Chinese voice data on my computer, and I also have over 1000 followers who can contribute more datasets. I would love to collaborate or share resources whatever can help on this. Right. I keep forgetting to toy around with my passing thoughts for getting a "VALL-E X" implementation (which is just annotating with a language token to better hint at which language is inferenced). I'll keep it in mind whenever I do get around to needing to source more voice data, although who knows when that'll be; I don't expect the experiments in a stapled on "VALL-E X" implementation is going to be all that fruitful for a while when I get around to it.

https://google.github.io/df-conformer/librittsr/

Consider replacing LibriTTS with this if you haven't

https://google.github.io/df-conformer/librittsr/ Consider replacing LibriTTS with this if you haven't
Author
Owner

Consider replacing LibriTTS with this if you haven't

Will do.

I'm not sure if I should bother trying to use the faster-whisper-backed WhisperX, since shorter clips don't really benefit from faster transcription times.

> Consider replacing LibriTTS with this if you haven't Will do. I'm not sure if I should bother trying to use the faster-whisper-backed WhisperX, since shorter clips don't really benefit from faster transcription times.

There's a relatively new TTS called Balacoon, aimed at low end devices. I tried it out on my desktop and it was faster than RT. I'm not sure to what degree everything is open source, but the dev is claiming 50x inference speed improvement on CPUs and talks about some of the optimizations they made to do so. Maybe there's some insights to glean.

https://balacoon.com/blog/on-device/
https://github.com/balacoon/balacoon.github.io

There's a relatively new TTS called Balacoon, aimed at low end devices. I tried it out on my desktop and it was faster than RT. I'm not sure to what degree everything is open source, but the dev is claiming 50x inference speed improvement on **CPUs** and talks about some of the optimizations they made to do so. Maybe there's some insights to glean. https://balacoon.com/blog/on-device/ https://github.com/balacoon/balacoon.github.io

There's a relatively new TTS called Balacoon, aimed at low end devices. I tried it out on my desktop and it was faster than RT.

How was the quality?

> There's a relatively new TTS called Balacoon, aimed at low end devices. I tried it out on my desktop and it was faster than RT. How was the quality?
Author
Owner

There's a relatively new TTS called Balacoon, aimed at low end devices
claiming 50x inference speed improvement on CPUs

Neat.

talks about some of the optimizations they made to do so

Seems predicated on very, very small parameter counts, which with my test of the quarter-sized one seemingly outperforming the fullsize one I mentioned the other day (week?), I guess it's believable.

precompiled wheel and binaries

Ugh.

supported_speakers = tts.get_speakers()

Ugh. Ruined.

There seems to be no samples, but the HuggingFace space (https://huggingface.co/spaces/balacoon/tts) works to generate some of my own (catbox because I can't be assed to upload them to vocaroo, I am very uncreative with test prompts, and this is the one that I tested against ad nauseum for TorToiSe finetunes):

For what it's worth, it's decent I suppose. The documentation seems to suggest it uses an IPA-based approach (good) and you can pretty much coerce the input text to do what you want.
For what it's worth, as a plain Jane TTS system, it works. It's not a voice cloner (on the surface, unless it's like Bark), so I don't have any interest in it. I suppose it has its own niche as a lightweight-yet-competent TTS system.


I also forgot to do my weekly assessment. Uhhh...

The loss on the fullsized model is down to AR loss=~3.1 (I saw it dip to 2.9 earlier) and an accuracy wavering between 76% and 80% on average. I suppose my issue before was having too low of an LR, and I suppose DeepSpeed will do loss scaling according to the gradient norm, so it compounded to effectively useless training over time.

I forgot to comb through my samples, last week kept me preoccupied in other endeavors. I would have grabbed more samples the other day (week?), but Xorg seems to not want to work again, as SDDM/Plasma will return a 640x480 screen and no GUI elements, and I keep forgetting to yank out my 6800XT since I remember it would give me grief for not booting it up in a specific configuration, so I just have the normal model being baked until I remember to devote some time to get Xorg cooperating. (I need Xorg because I can't be assed to mount my SMBv1 share and Dolphin has it Just Working when I open the shares).

> There's a relatively new TTS called Balacoon, aimed at low end devices > claiming 50x inference speed improvement on CPUs Neat. > talks about some of the optimizations they made to do so Seems predicated on very, very small parameter counts, which with my test of the quarter-sized one seemingly outperforming the fullsize one I mentioned the other day (week?), I guess it's believable. > precompiled wheel and binaries Ugh. > `supported_speakers = tts.get_speakers()` ***Ugh.*** Ruined. There seems to be no samples, but the HuggingFace space (https://huggingface.co/spaces/balacoon/tts) works to generate some of my own (catbox because I can't be assed to upload them to vocaroo, I am very uncreative with test prompts, and this is the one that I tested against ad nauseum for TorToiSe finetunes): * https://files.catbox.moe/y4gfu8.wav * https://files.catbox.moe/klmidz.wav For what it's worth, it's *decent* I suppose. The [documentation](https://balacoon.com/use/frontend) seems to suggest it uses an IPA-based approach (good) and you can pretty much coerce the input text to do what you want. For what it's worth, as a plain Jane TTS system, it works. It's not a voice cloner (on the surface, unless it's like Bark), so I don't have any interest in it. I suppose it has its own niche as a lightweight-yet-competent TTS system. --- I also forgot to do my weekly assessment. Uhhh... The loss on the fullsized model is down to AR loss=~3.1 (I saw it dip to 2.9 earlier) and an accuracy wavering between 76% and 80% on average. I suppose my issue before was having too low of an LR, and I suppose DeepSpeed will do loss scaling according to the gradient norm, so it compounded to effectively useless training over time. I forgot to comb through my samples, last week kept me preoccupied in other endeavors. I *would* have grabbed more samples the other day (week?), but Xorg seems to not want to work again, as SDDM/Plasma will return a 640x480 screen and no GUI elements, and I keep forgetting to yank out my 6800XT since I remember it would give me grief for not booting it up in a specific configuration, so I just have the normal model being baked until I remember to devote some time to get Xorg cooperating. (I need Xorg because I can't be assed to mount my SMBv1 share and Dolphin has it Just Working when I open the shares).

I was catching up on the thread and wondering if there was a reason for not using the LibriLight dataset until I saw you mention

My only concern is if there's any overlap between it and LibriTTS. I wouldn't want to fuck up my dataset with a sizeable amount of duplicates. I could prune transcriptions with similar strings, but the issue is that LibriTTS is trimmed down, and LibriLight I-believe is one whole piece, so even just relying on the full transcription of a sound file won't do any good. I suppose I could just check for similarities and prune manually, but even then I imagine would be an astronomic task (unless I do embedding vector similarities shit).

This shouldn't be an issue, since both datasets provide speakers' Librivox unique speaker ID. LibriTTS_R has a comprehensive list of the speaker IDs in a file called SPEAKERS.txt, and LibriLight is structured like LibriSpeech, so uses the speaker ID as a directory name:

dataset_name/speakerID/book_name/

It should be a simple matter to prune any duplicate speaker IDs in the LibriLight dataset- and that would at worst add 5500 additional speakers and tens of thousands of hours of audio.

I was catching up on the thread and wondering if there was a reason for not using the LibriLight dataset until I saw you mention > My only concern is if there's any overlap between it and LibriTTS. I wouldn't want to fuck up my dataset with a sizeable amount of duplicates. I could prune transcriptions with similar strings, but the issue is that LibriTTS is trimmed down, and LibriLight I-believe is one whole piece, so even just relying on the full transcription of a sound file won't do any good. I suppose I could just check for similarities and prune manually, but even then I imagine would be an astronomic task (unless I do embedding vector similarities shit). This shouldn't be an issue, since both datasets provide speakers' Librivox unique speaker ID. LibriTTS_R has a comprehensive list of the speaker IDs in a file called SPEAKERS.txt, and LibriLight is structured like LibriSpeech, so uses the speaker ID as a directory name: ``` dataset_name/speakerID/book_name/ ``` It should be a simple matter to prune any duplicate speaker IDs in the LibriLight dataset- and that would at worst add 5500 additional speakers and tens of thousands of hours of audio.
Author
Owner

This shouldn't be an issue, since both datasets provide speakers' Librivox unique speaker ID.

Oh duh, why didn't I think of that. I can probably make do with merging speakers but not book IDs then.


In other news, I finally got off my ass to unplug my 6800XT to get Xorg working again, so now I can:

  • yank the evaluation / validation output from both the quarter and the full sized model
  • actually extract the 2000 hours of audiobooks for transcribing / slicing / quantizing
  • download and reslice / quantize LibriTTS_R

Hearing the outputs again, it's a bit tough to put my finger on where the issues lies. Yes, the AR itself is definitely flawed, but it's a given since the AR is only responsible for the first residuals. The NAR still sounds pretty accurate, but that's also a given, since it's responsible for the 7/8ths of the sound itself.

Fullsize Samples:

Quartersize Samples (sounds like poop desu compared to the fullsize now):

The validation also seems to change a little dramatically given the input prompt fed to it, so that's a little concerning:

  • Narrator (Stanley Parable): Reference / 26500 / 26250 / 26000
  • This isn't necessarily from there being that "much" of a change between these intervals, I think it's just that much up to randomness.
  • there was also a validation output where it said something quite spicy only in the NAR, but the AR+NAR (and every other attempt it couldn't say either word right) (Reference / Prompt )
    • this also highlights that the validation output, while mostly consistent in forming words, doesn't quite capture the right prosody

I don't know, I kind of feel these are a bit of a regression from before I botched the model with train-other-500 (naturally, since that model's loss was pretty low / the accuracy was pretty high), but it's validation output does sound better in that it's forming actual words at the very least.

The one thing I can't recall if I ever mentioned that I prefer with VALL-E over anything else that uses the magic cope of "representing" traits of a voice, is that VALL-E will learn every acoustic trait of a voice its trained against. This is evident when I'm training against old crusty voices like from Half-Life or SHODAN from System Shock (which I actually need to find output for), while stuff like TorToiSe or Bark will utterly fail because it's not able to capture the raw acoustics of the voice. It's probably why I'm trying to make it work, since there's no other TTS that actually does this.


Transcribing the 2000 hours has began, since whisperX's v3 (with faster-whisper) just works now. I've updated the repo with the "fixes" (dropping code) so you can use the batch size without needing an HF token now.

> This shouldn't be an issue, since both datasets provide speakers' Librivox unique speaker ID. Oh duh, why didn't I think of that. I can probably make do with merging speakers but not book IDs then. --- In other news, I finally got off my ass to unplug my 6800XT to get Xorg working again, so now I can: * yank the evaluation / validation output from both the quarter and the full sized model * actually extract the 2000 hours of audiobooks for transcribing / slicing / quantizing * download and reslice / quantize LibriTTS_R Hearing the outputs again, it's a bit tough to put my finger on where the issues lies. Yes, the AR itself is definitely flawed, but it's a given since the AR is only responsible for the first residuals. The NAR still sounds pretty accurate, but that's also a given, since it's responsible for the 7/8ths of the sound itself. Fullsize Samples: * Patrick Bateman (American Psycho): [Reference](https://files.catbox.moe/xqobmw.wav) / [AR](https://files.catbox.moe/13kgws.wav) / [NAR](https://files.catbox.moe/xtaee0.wav) / (AR+NAR)[https://files.catbox.moe/m1jeoz.wav] * Elizabeth (Persona 3): [Reference](https://files.catbox.moe/7k8ymq.wav) / [AR](https://files.catbox.moe/fy7dzf.wav) / [NAR](https://files.catbox.moe/lbl2jf.wav) / [AR+NAR](https://files.catbox.moe/1a1axg.wav) Quartersize Samples (sounds like poop desu compared to the fullsize now): * Scientist (Half-Life): [Reference](https://files.catbox.moe/jitmbb.wav) / [AR](https://files.catbox.moe/qn9toi.wav) / [NAR](https://files.catbox.moe/p2cicp.wav) / [AR+NAR](https://files.catbox.moe/o16viy.wav) * Takaya (Persona 3): [Reference](https://files.catbox.moe/8ql95f.wav) / [AR](https://files.catbox.moe/gm1nn2.wav) / [NAR](https://files.catbox.moe/sokowd.wav) / [AR+NAR](https://files.catbox.moe/ls6ab0.wav) The validation also seems to change a little dramatically given the input prompt fed to it, so that's a little concerning: * Narrator (Stanley Parable): [Reference](https://files.catbox.moe/3hyvao.wav) / [26500](https://files.catbox.moe/m3522v.wav) / [26250](https://files.catbox.moe/x1aqhe.wav) / [26000](https://files.catbox.moe/0tunv3.wav) * This isn't necessarily from there being that "much" of a change between these intervals, I think it's just that much up to randomness. * there was also a validation output where it said something [quite spicy](https://files.catbox.moe/eknfub.wav) only in the NAR, but the [AR+NAR](https://files.catbox.moe/p4f86b.wav) (and every other attempt it couldn't say either word right) ([Reference](https://files.catbox.moe/v9z32n.wav) / [Prompt](https://files.catbox.moe/yjtuev.wav) ) - this also highlights that the validation output, while *mostly* consistent in forming words, doesn't quite capture the right prosody I don't know, I kind of feel these are a bit of a regression from before I botched the model with `train-other-500` (naturally, since that model's loss was pretty low / the accuracy was pretty high), but it's validation output does sound better in that it's forming actual words at the very least. The one thing I can't recall if I ever mentioned that I prefer with VALL-E over anything else that uses the magic cope of "representing" traits of a voice, is that VALL-E *will* learn every acoustic trait of a voice its trained against. This is evident when I'm training against old crusty voices like from Half-Life or SHODAN from System Shock (which I actually need to find output for), while stuff like TorToiSe or Bark will utterly fail because it's not able to capture the raw acoustics of the voice. It's probably why I'm trying to make it work, since there's no other TTS that actually does this. --- Transcribing the 2000 hours has began, since whisperX's v3 (with faster-whisper) just works now. I've updated the repo with the "fixes" (dropping code) so you can use the batch size without needing an HF token now.
Author
Owner

Training has resumed after a few days I spent to transcribe the audiobooks + re-quantize the LibriTTS_R dataset. I think I would have been a day ahead of schedule, but I had to reslice-requantize the audiobooks since my +0.02s end offset wasn't actually proper with new-whiserpX. Updating to the new whisperX with the faster-whisper backend seemed really nice for the audiobooks since they're one giant file, so I was able to reap the gains of the bigger batch size. What wasn't so nice was figuring out was hotfixing the web UI to play nice with them being MP3s first and then scrounge to free up as much space on the training machine's SSD. My COPIUM is that the full LibriLight dataset will be a bitch to prepare and will eat up a month just to go through it all, if I grew a wild hair and picked at it.

I think unironically, quantizing the audio through Encodec on my 4070Ti is much slower than quantizing off my 6800XT. It was chugging a little more than usual.

WhisperX emitting word level transcriptions now might give me a wild hair and change how I want to have datasets prepared, to instead just have the main audio quantized, and pick out slices procedurally, since all the timestamps would be there, but iunno, I haven't had any issues with the current method.

New dataset metrics:

  • 2502406 samples (restricting to a min phoneme length 4, max phoneme length 96).
  • 8180094 seconds / 2272 hours.
  • 2868 speakers.
  • a full epoch to eat through for the quarter-sized models seems to come in at an ETA of 17 hours.

At it=195 already (195 * 64 * 16 = 199680 samples, LR=0.001 even though I had it set to something lower) for the quarter sized model: image

I'm pretty pleased with it already. However, it takes forever to initialize, so I'll probably need to rewrite / fix up the initial data loader, and I came back to it an hour later and the iteration rate dropped to 400s/it, but I should probably restart the machine since it was giving me hell over time as I was doing the dataset preparation process over time.

One thing I did (finally) make note of is the loss scaling to the gradient norm. Usually every other attempt would have the high LR fry it quickly, but I guess the huge crux of the inconsistent training has been the loss scaling either saving me or biting me. I'll need to keep an eye on how it fares with not using an LR scheduler.

I think I should also spend some time playing with the weights I cooked up for both the quarter and full size models. I know they aren't that close to being perfect, but they're at a great spot to fiddle around with them and their zero-shotting (the validation doesn't seem all that degraded compared to the training set), and also finetune them to see how it fares.

Training has resumed after a few days I spent to transcribe the audiobooks + re-quantize the LibriTTS_R dataset. I think I would have been a day ahead of schedule, but I had to reslice-requantize the audiobooks since my +0.02s end offset wasn't actually proper with new-whiserpX. Updating to the new whisperX with the faster-whisper backend seemed really nice for the audiobooks since they're one giant file, so I was able to reap the gains of the bigger batch size. What wasn't so nice was figuring out was hotfixing the web UI to play nice with them being MP3s first and then scrounge to free up as much space on the training machine's SSD. My COPIUM is that the full LibriLight dataset will be a bitch to prepare and will eat up a month just to go through it all, if I grew a wild hair and picked at it. I think unironically, quantizing the audio through Encodec on my 4070Ti is much slower than quantizing off my 6800XT. It was chugging a little more than usual. WhisperX emitting word level transcriptions now might give me a wild hair and change how I want to have datasets prepared, to instead just have the main audio quantized, and pick out slices procedurally, since all the timestamps would be there, but iunno, I haven't had any issues with the current method. New dataset metrics: * 2502406 samples (restricting to a min phoneme length 4, max phoneme length 96). * 8180094 seconds / 2272 hours. * 2868 speakers. * a full epoch to eat through for the quarter-sized models seems to come in at an ETA of 17 hours. At it=195 already (195 * 64 * 16 = 199680 samples, LR=0.001 even though I had it set to something lower) for the quarter sized model: ![image](/attachments/46e51aae-ab8b-484e-9c62-98691cdcf111) I'm pretty pleased with it already. However, it takes *forever* to initialize, so I'll probably need to rewrite / fix up the initial data loader, and I came back to it an hour later and the iteration rate dropped to 400s/it, but I should probably restart the machine since it was giving me hell over time as I was doing the dataset preparation process over time. One thing I did (finally) make note of is the loss scaling to the gradient norm. Usually every other attempt would have the high LR fry it quickly, but I guess the huge crux of the inconsistent training has been the loss scaling either saving me or biting me. I'll need to keep an eye on how it fares with not using an LR scheduler. I think I should also spend some time playing with the weights I cooked up for both the quarter and full size models. I know they aren't *that* close to being perfect, but they're at a great spot to fiddle around with them and their zero-shotting (the validation doesn't seem all that degraded compared to the training set), and also finetune them to see how it fares.

There's a relatively new TTS called Balacoon, aimed at low end devices. I tried it out on my desktop and it was faster than RT.

How was the quality?

The quality is fine, It comes with Mozilla TTS voices, but it's not like tortoise-tts level intonation (think like Google assistant). However, I feel the biggest selling point is that it can produce 4 minutes of audio in <30 secs ON A CPU (my exp).

I think some interesting applications would be using it to quickly prepare large amounts of audio for voice-to-voice changing, or (because it imports voice models from TTS) using HQ voice models from tortoise to create a corpus for training a model that could be inserted into an imported library and leveraging the faster inferences for longer tasks.

> > There's a relatively new TTS called Balacoon, aimed at low end devices. I tried it out on my desktop and it was faster than RT. > > How was the quality? The quality is fine, It comes with Mozilla TTS voices, but it's not like tortoise-tts level intonation (think like Google assistant). However, I feel the biggest selling point is that it can produce 4 minutes of audio in <30 secs ON A CPU (my exp). I think some interesting applications would be using it to quickly prepare large amounts of audio for voice-to-voice changing, or (because it imports voice models from TTS) using HQ voice models from tortoise to create a corpus for training a model that could be inserted into an imported library and leveraging the faster inferences for longer tasks.
Author
Owner

https://github.com/descriptinc/descript-audio-codec

can do 44.1KHz at 8kbps of bandwidth

Sob. I JUST quantized everything.


If I grow a wild hair and get a hankering to, I guess I'll have to overhaul the entire data loader process, something like:

  • further pre-process everything into an .hdf5 (yuck), or just per-speaker JSONs, for:
    • making the symmap ahead of time rather than at runtime and then compile (this requires reading EVERY phoneme file first, which eats up a bunch of time on large datasets), since I just fried a model because I kept changing the min/max phoneme lengths and didn't store the old one first.
    • store the phoneme lengths to filter against easier rather than at runtime
    • store quantized audio lengths rather than having to compute them at runtime (which takes forever when I need the metric)
    • store the phonemized text properly (rather than relying on delimiting by spaces) alongside the original text (and if I'm crafty I can keep the sliced times per-phoneme, and do some snazzy slicing for generating input prompts)
    • some other buzzwords
  • muck around with descript-audio-codec to make it work
    • it alleges that it's drop in place with Encodec, but it seems it requires metadata (yuck!) to function, which I can't really provide when decoding audio generated from the AR/NAR.
      • I can probably hack together a workaround, but that's gross.
    • primarily need to see if I can just use 24KHz instead, since all of my non-donated-audiobook audio is at 24KHz anyways, so there's no point in upsampling (unless I was running it through voicefixer, which has issues). 24KHz is good enough for labbing.
      • the sample pages has a bunch of samples provided at varying bitrates, but there doesn't seem to be an analogue
    • there's allegedly other LM-based audio codec models like one that boasts using 4 RVQ bins instead of 8, which sounds pretty spicy, but I don't know where the line between "it benefits the AR because layer 1 is more impactful" and "it benefits the NAR since it generates less bins" is drawn.

This is all, also, hoping that it would solve my newly-emergent instability problem in training. Training is a pain in my ass now with the now-4x'd dataset, since it'll either, pretty often, randomly hang and go to a crawl per iteration (I undid my "move the batch off the GPU after a forward but before the backwards pass, that one cope optimization), and training outright killing itself and giving no error (I check htop and my system RAM usage is "fine", but I wouldn't be surprised if it ends up triggering OOM killers).

> https://github.com/descriptinc/descript-audio-codec > can do 44.1KHz at 8kbps of bandwidth Sob. I JUST quantized everything. --- If I grow a wild hair and get a hankering to, I *guess* I'll have to overhaul the entire data loader process, something like: * further pre-process everything into an .hdf5 (*yuck*), or just per-speaker JSONs, for: - making the symmap ahead of time rather than at runtime and then compile (this requires reading EVERY phoneme file first, which eats up a bunch of time on large datasets), since I just fried a model because I kept changing the min/max phoneme lengths and didn't store the old one first. - store the phoneme lengths to filter against easier rather than at runtime - store quantized audio lengths rather than having to compute them at runtime (which takes forever when I need the metric) - store the phonemized text properly (rather than relying on delimiting by spaces) alongside the original text (and if I'm crafty I can keep the sliced times per-phoneme, and do some snazzy slicing for generating input prompts) - some other buzzwords * muck around with descript-audio-codec to make it work - it alleges that it's drop in place with Encodec, but it seems it requires metadata (*yuck*!) to function, which I can't really provide when decoding audio generated from the AR/NAR. - I can probably hack together a workaround, but that's gross. - primarily need to see if I can just use 24KHz instead, since all of my non-donated-audiobook audio *is* at 24KHz anyways, so there's no point in upsampling (unless I was running it through voicefixer, which has issues). 24KHz is *good enough* for labbing. - the sample pages has a bunch of samples provided at varying bitrates, but there doesn't seem to be an analogue - there's allegedly other LM-based audio codec models like one that boasts using 4 RVQ bins instead of 8, which sounds pretty spicy, but I don't know where the line between "it benefits the AR because layer 1 is more impactful" and "it benefits the NAR since it generates less bins" is drawn. This is all, also, hoping that it would solve my newly-emergent instability problem in training. Training is a pain in my ass now with the now-4x'd dataset, since it'll either, pretty often, randomly hang and go to a crawl per iteration (I undid my "move the batch off the GPU after a forward but before the backwards pass, that one cope optimization), and training outright killing itself and giving no error (I check htop and my system RAM usage is "fine", but I wouldn't be surprised if it ends up triggering OOM killers).
Author
Owner

Training the quarter sized model with the new dataset has stabilized. I don't know whether it's:

  • skipping "validating" the phonemes, which requires loading every file several times and some things probably are still sitting in memory. I doubt it, but it's a thing I did change.
  • commenting out the "VRAM optimization" where I'm moving the batch out of VRAM and into main RAM after a forward pass but before the backwards pass, since the biggest issue is OOMing during a backwards pass. This is most likely the issue, as maybe there's a quirk on PyTorch's data loader that will have a lot more shit loaded in memory ready for a really, really big dataset, that's emergent and creates an odd-high memory pressure scenario that either will cause very, very slow iteration rates from moving between VRAM and RAM, or triggering OOM killers and causing the process to die.
  • reducing my batch size from 64 to 32. As this really hurts the throughput, I had to in trying to make training stable. I'm not too sure how much it actually saves purely on system RAM pressure, but it'll help with VRAM pressure, since I don't have any validation at initialization time since I'm not pre-checking the phonemes now and culling anything over an arbitrary length of 100 phonemes (which, ironically, the dataset prep for DLAS would account for in a way).

I suppose since it's fixed, I don't immediately have to work on cleaning up how the dataset is handled.

In any case, at 1000 iterations, 1024000 samples processed, epoch ~0.32, bs=32 ga=32, the model's loss is averaging at an AR loss of 3.9 and an AR accuracy of 67% (not bothering with reporting the NAR metrics since it's always backwards). At this given point in time for before I added in the donated audiobooks, this is quite impressive. Playing by ear from the validation output, it's semi-decent at replicating the raw acoustics, but language is still subpar, but that's a given since it's nowhere near enough time invested into it. I can at least sleep knowing it's not gonna crash and burn and get into a loop of "uh oh :))) the port for NCCL is already taken :))))) the OOM killer didn't actually gracefully close python :))))))))" and it'll never resolve itself.

I'm still a bit scared that I'm forgetting something. The reported LR is still at 0.001 despite that number not showing up anywhere, since I thought I explicitly set my LR to 1.25e-4 or something. DeepSpeed >=0.92 has the reported gradient norm broken and reports 0.0, which I remember when it reported 0.0, the model wasn't learning (despite it being my fuckup), so it's a bit triggering. And lastly, I feel like my understanding of the loss scaling is a bit off, since that only actually seems to apply to fp16 training and not bfloat16, but it seems to be doing some kind of loss scaling regardless. Oh well.


In slightly other news, I played around with descript-audio-codec just to compare the raw output to Encodec and... I'm disappointed. While I can actually set the target sample rate with just model.sample_rate = 24_000 to reduce the length of the output (I'm guessing the model really is multi-modal and can support arbitrary sample rates, unlike I think Encodec which has specific models for specific sample rates), it still outputs a much larger sequence over Encodec. Given this sample, Encodec yields a sequence of 8x831 codes, while DAC will yield a sequence of 9x885 codes. The extra RVQ layer is a bit of a pain in the ass, since I would have to add a small bit of logic in the NAR model (not that hard, just a bit of a pill), and the extra layer just makes it drastically larger compared to Encodec. Bummer. I was kind of thinking it'd be neat to slot out Encodec for [shiny new toy], but oh well. Maybe in the future when I do actually scale this up to use full-blown 44.1KHz audio as the source, it might eek out a win in comparison to Encodec, but for now I'll shelve it in my mind.

I am curious as to how HiFi-Codec fares, as it boasts only 4 RVQ layers over 8, but I'm sure there's some monkey paw of "uh oh, ackshually, this makes our code sequence much much larger :)))))))`. Also, the fact it seems there's only one repo that has weights for it is a little goncering, and the inferencing script is quite kludgy to look at.

I suppose it's a relief that I don't have to bend over backwards to convert over to implement [epic new meme repo] and burn a week re-quantizing everything and cobbling together code.


Regardless, I'll just leave the quarter sized model to cook until it reaches a decent point. Maybe by Sunday I'll evaluate where it's at and do a progress report, and then move onto cooking a full size model and seeing how it fares with the new dataset. I just hope 2272 hours is "good enough" and the model will be very usable, especially from training on stubborn gonsumer hardware.

Training the quarter sized model with the new dataset has stabilized. I don't know whether it's: * skipping "validating" the phonemes, which requires loading every file several times and some things probably are still sitting in memory. I doubt it, but it's a thing I did change. * commenting out the "VRAM optimization" where I'm moving the batch out of VRAM and into main RAM after a forward pass but before the backwards pass, since the biggest issue is OOMing during a backwards pass. This is most likely the issue, as *maybe* there's a quirk on PyTorch's data loader that will have a lot more shit loaded in memory ready for a really, really big dataset, that's emergent and creates an odd-high memory pressure scenario that either will cause very, very slow iteration rates from moving between VRAM and RAM, or triggering OOM killers and causing the process to die. * reducing my batch size from 64 to 32. As this ***really*** hurts the throughput, I had to in trying to make training stable. I'm not *too* sure how much it actually saves purely on system RAM pressure, but it'll help with VRAM pressure, since I don't have any validation at initialization time since I'm not pre-checking the phonemes now and culling anything over an arbitrary length of 100 phonemes (which, ironically, the dataset prep for DLAS would account for in a way). I suppose since it's fixed, I don't immediately have to work on cleaning up how the dataset is handled. In any case, at 1000 iterations, 1024000 samples processed, epoch ~0.32, bs=32 ga=32, the model's loss is averaging at an AR loss of 3.9 and an AR accuracy of 67% (not bothering with reporting the NAR metrics since it's ***always*** backwards). At this given point in time for before I added in the donated audiobooks, this is quite impressive. Playing by ear from the validation output, it's semi-decent at replicating the raw acoustics, but language is still subpar, but that's a given since it's nowhere near enough time invested into it. I can at least sleep knowing it's not gonna crash and burn and get into a loop of "uh oh :))) the port for NCCL is already taken :))))) the OOM killer didn't actually gracefully close python :))))))))" and it'll never resolve itself. I'm still a bit scared that I'm forgetting something. The reported LR is still at 0.001 despite that number not showing up anywhere, since I thought I explicitly set my LR to 1.25e-4 or something. DeepSpeed >=0.92 has the reported gradient norm broken and reports 0.0, which I remember when it reported 0.0, the model wasn't learning (despite it being my fuckup), so it's a bit *triggering*. And lastly, I feel like my understanding of the loss scaling is a bit off, since that only actually seems to apply to fp16 training and not bfloat16, but it seems to be doing some kind of loss scaling regardless. Oh well. --- In slightly other news, I played around with [descript-audio-codec](https://github.com/descriptinc/descript-audio-codec/) just to compare the raw output to Encodec and... I'm disappointed. While I can actually set the target sample rate with just `model.sample_rate = 24_000` to reduce the length of the output (I'm guessing the model really is multi-modal and can support arbitrary sample rates, unlike I think Encodec which has specific models for specific sample rates), it still outputs a much larger sequence over Encodec. Given this [sample](https://vocaroo.com/1b1xD8AN2TWg), Encodec yields a sequence of 8x831 codes, while DAC will yield a sequence of 9x885 codes. The extra RVQ layer is a bit of a pain in the ass, since I would have to add a small bit of logic in the NAR model (not that hard, just a bit of a pill), and the extra layer just makes it drastically larger compared to Encodec. Bummer. I was kind of thinking it'd be neat to slot out Encodec for `[shiny new toy]`, but oh well. Maybe in the future when I do actually scale this up to use full-blown 44.1KHz audio as the source, it might eek out a win in comparison to Encodec, but for now I'll shelve it in my mind. I am curious as to how HiFi-Codec fares, as it boasts only 4 RVQ layers over 8, but I'm sure there's some monkey paw of "uh oh, ackshually, this makes our code sequence much much larger :)))))))`. Also, the fact it seems there's only one repo that has weights for it is a little goncering, and the inferencing script is quite kludgy to look at. I suppose it's a relief that I don't have to bend over backwards to convert over to implement `[epic new meme repo]` and burn a week re-quantizing everything and cobbling together code. --- Regardless, I'll just leave the quarter sized model to cook until it reaches a decent point. Maybe by Sunday I'll evaluate where it's at and do a progress report, and then move onto cooking a full size model and seeing how it fares with the new dataset. I just hope 2272 hours is "good enough" and the model will be very usable, especially from training on stubborn gonsumer hardware.

Babe wake up, another TTS system just dropped for mrq to look at:
https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/

It's not auto-regressive and uses something called 'flow matching'!

There seems to be no weights or even source code yet but the paper seems to go in-depth enough for some guy to replicate it in [Time Frame].

To my ears it sounds very flat... but that might be because it's just audiobook'd.

Babe wake up, another TTS system just dropped for mrq to look at: https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/ It's not auto-regressive and uses something called 'flow matching'! There seems to be no weights or even source code yet but the paper seems to go in-depth enough for some guy to replicate it in [Time Frame]. To my ears it sounds very flat... but that might be because it's just audiobook'd.
Author
Owner

It being very, very general purpose for being able to do a variety of inputs + infilling seems very, very promising solely in the new model architecture (getting a bit tired of the typical transformer approach). It pretty much combines everything you could possibly need: pure text-based synthesis, reference-based synthesis, speech-to-speech style transfers of VITS/RVC/what-have-you, infilling even. But...

There seems to be no weights or even source code yet

Yeah, it's expected for the one thing Zucc won't share being their voice synthesis because of "muh ethics", especially with this being the holy grail of voice synthesis strictly from its capabilities. Unless there's another leak of the same caliber as the LLaMa one, this isn't ever getting into the layman's hands.

but the paper seems to go in-depth enough for some guy to replicate it in [Time Frame].

True, true. VALL-E had two implementations spawn from it solely from the paper. Although...

  • only lackluster model weights came from the newer one, and I effectively killed a few months of training time on being stubborn.
  • both have their warts (not to discredit either, I'm NOT an expert in anything to the actual underlyings of model architecture, I'm just a decent enough programmer to try and make things work).
  • they're still pretty much based on existing model arch, while this is [brand new shiny thing].
  • I don't think anyone that is competent enough to do it is willing to actually do it because of "muh ethics".

To my ears it sounds very flat... but that might be because it's just audiobook'd.

mmm.

I'll cope with saying it's most likely just from terrible quality audio / stilted reference clips used.

  • The pure text-to-speech synthesis sounds fine and I can write off the weird oddities in the same way pure random voices in TorToiSe have their quirks.
  • the narrator not being a native American-English speaker would lend some credence to the stiltedness / flatness. If anything, it also sounds like there's some post-processing added to it, like it's smoothening out the highs/lows and unintentionally narrowing the range. I don't necessarily notice it in the provided examples in the demo video.
  • all the other stuff that rely on an existing voice clip almost has the same acoustics as the reference clip, so it'll still sound like the mid-2000s-Source-engine-voice-chat-tier mic audio its fed.

The demo seems nicer than the blogpost since it actually provides the clips unimpeded. I did have a bit of a laugh when I heard the reference clip for the main zero-shot demo (the reference clip has that annoying shit VoiceFixer will have at the end of the clip with the loud sharp noise). Just from a listening to a few of the zero-shot examples, I'm very pleased strictly because it's close to matching the input acoustics of the audio, the big thing I'm autistic about with VALL-E. This also means that it doesn't necessarily fall for the annoying meme TorToiSe / Bark where it'll rely on some intermediary to represent traits of a voice (voice latents / semantic tokens, etc).

I'll have to snoop through the paper for Voicebox itself (the paper for Flow Matching is waaaaaaaaaay over my head to be desu). I'm not too sure how much I can borrow, as I am not an expert, but it could give me some more insight in case I missed something with fixing up the fork.

The 50k multilingual hours it mentions it being trained on makes me think I probably don't even really need to bother much with annotating for specific languages when I add in more to my dataset, but I'll have to comb over the paper to actually see if it does. It should be able to leverage the existing English audio and only really provide enough multilingual data to offer better references for accents and non-language-but-lingual traits, I suppose.

Overall, I'm very pleased with the cherrypicked demo. I wouldn't say I'm hyped for it, since I know we plebs aren't ever getting our hands on it, even with their cope that they can classify audio generated with it, but it's the step in the right direction (in my opinion) with voice synthesis.


Reading the paper:

This paper presents Voicebox, the most versatile text-conditioned speech generative model at scale. Voicebox is trained on a text-guided speech infilling task, where the goal is to generate masked speech given its surrounding audio and text transcript. This can be considered as a guided in-context learning problem, where audio style is inferred from the audio context and textual content is specified through transcript. Voicebox does not require any audio style labels (e.g., speaker, emotion, and noise), which differentiates Voicebox from the majority of prior work where such labels are used extensively. Prior work uses labels to make the mapping between input (text and audio style) and output (speech) more deterministic to reduce underfitting [Wang et al., 2021, Popov et al., 2021]. We show that Voicebox’s text-guided speech infilling approach is much more scalable in terms of data while subsuming many common speech generative tasks.

Ahhhh, there's the rub. It's entirely a model for infilling, and every other feature is after-the-fact ("emergent"). So it's about as raw as VALL-E in the sense there's only the input text as a label, but utilizing only the NAR and new model arch to accomplish infilling. Interesting.

It does not use any style labels, pre-trained embedders, or multilingual samples.

Alright, neat-o. I wonder what shit VALL-E X was needing in the first place to accomplish multilingual-ness outside of a ton more data. I suppose it wouldn't hurt for me to get the Japanese phonemizer back up and dump my Japanese dataset at it then with zero changes to the code (so I won't even need to bother with a language marker token after all).

Audio x is represented as an 80-dimensional log Mel spectrogram
Audio is represented as a 80-dimensional log Mel spectrogram and a HiFi-GAN vocoder

I'm sobbing. It was too good to be true... stuck using mel spectorgrams...

Jokes aside, I don't think it's an issue at all though. The paper mentions it can very well slot it out for Encodec when it was dick comparing against VALL-E, I think.

We adapt the HiFi-GAN V1 configuration to generate 16kHz audio from 80 dimensional log Mel
spectral features sampled at 100Hz

And it's shit. At least, the demos are. I didn't think much to check the actual frequency of the demos, but it checks out: they're 16KHz. Another funny note, if you poke in with Inspect Element, you can see some of the provided outputs are tagged like https://dl.fbaipublicfiles.com/voicebox/webpage/public/assets/orig/audios/zstts_shortlisted/valle/hyp_voc_chunk/5639-40744-0020_0.wav. Now I don't know if they're just calling it VALL-E, but it activates my almonds a bit. There's some other tagged like https://dl.fbaipublicfiles.com/voicebox/webpage/public/assets/orig/audios/zstts_shortlisted/extra/hyp_voc_chunk/21_0_0.wav instead, so who knows.

I'm sure you can easily slot out the vocoder, like you can with TorToiSe for BigVGAN, so I don't know why they elected to use their own flavor of HiFi-GAN instead. I hope any unofficial implementation that does rise up catches this. Although, I wonder if that means the training input is restricted to 16KHz. I'll have to cross-reference with what TorToiSe has its mel representations capped at, but it's a bit grim if that's the case. Although, that only sort of matters for existing weights. I'm sure it'll be easy to slot in higher bandwidth audio for homebrewed weights.

It being very, very general purpose for being able to do a variety of inputs + infilling seems very, very promising solely in the new model architecture (getting a bit tired of the typical transformer approach). It pretty much combines everything you could possibly need: pure text-based synthesis, reference-based synthesis, speech-to-speech style transfers of VITS/RVC/what-have-you, infilling even. But... > There seems to be no weights or even source code yet Yeah, it's expected for the one thing Zucc won't share being their voice synthesis because of "muh ethics", especially with this being the holy grail of voice synthesis strictly from its capabilities. Unless there's another leak of the same caliber as the LLaMa one, this isn't ever getting into the layman's hands. > but the paper seems to go in-depth enough for some guy to replicate it in [Time Frame]. True, true. VALL-E had two implementations spawn from it solely from the paper. Although... * only lackluster model weights came from the newer one, and I effectively killed a few months of training time on being stubborn. * both have their warts (not to discredit either, I'm NOT an expert in anything to the actual underlyings of model architecture, I'm just a decent enough programmer to try and make things work). * they're still pretty much based on existing model arch, while this is `[brand new shiny thing]`. * I don't think anyone that is competent enough to do it is willing to actually do it because of "muh ethics". > To my ears it sounds very flat... but that might be because it's just audiobook'd. mmm. I'll cope with saying it's most likely just from terrible quality audio / stilted reference clips used. * The pure text-to-speech synthesis sounds *fine* and I can write off the weird oddities in the same way pure random voices in TorToiSe have their quirks. * the narrator not being a native American-English speaker would lend some credence to the stiltedness / flatness. If anything, it also sounds like there's some post-processing added to it, like it's smoothening out the highs/lows and unintentionally narrowing the range. I don't necessarily notice it in the provided examples in the demo video. * all the other stuff that rely on an existing voice clip *almost* has the same acoustics as the reference clip, so it'll still sound like the mid-2000s-Source-engine-voice-chat-tier mic audio its fed. The [demo](https://voicebox.metademolab.com/) seems nicer than the blogpost since it actually provides the clips unimpeded. I did have a bit of a laugh when I heard the reference clip for the main zero-shot demo (the reference clip has that annoying shit VoiceFixer will have at the end of the clip with the loud sharp noise). Just from a listening to a few of the [zero-shot examples](https://voicebox.metademolab.com/zs_tts.html), I'm ***very*** pleased strictly because it's *close* to matching the input acoustics of the audio, the big thing I'm autistic about with VALL-E. This also means that it doesn't necessarily fall for the annoying meme TorToiSe / Bark where it'll rely on some intermediary to represent traits of a voice (voice latents / semantic tokens, etc). I'll have to snoop through the paper for Voicebox itself (the paper for Flow Matching is *waaaaaaaaaay* over my head to be desu). I'm not *too* sure how much I can borrow, as I am not an expert, but it could give me some more insight in case I missed something with fixing up the fork. The 50k multilingual hours it mentions it being trained on makes me think I probably don't even really need to bother much with annotating for specific languages when I add in more to my dataset, but I'll have to comb over the paper to actually see if it does. It should be able to leverage the existing English audio and only really provide enough multilingual data to offer better references for accents and non-language-but-lingual traits, I suppose. Overall, I'm very pleased with the cherrypicked demo. I wouldn't say I'm hyped for it, since I know we plebs aren't ever getting our hands on it, even with their cope that they can classify audio generated with it, but it's the step in the right direction (in my opinion) with voice synthesis. --- Reading the paper: > This paper presents Voicebox, the most versatile text-conditioned speech generative model at scale. Voicebox is trained on a text-guided speech infilling task, where the goal is to generate masked speech given its surrounding audio and text transcript. This can be considered as a guided in-context learning problem, where audio style is inferred from the audio context and textual content is specified through transcript. Voicebox does not require any audio style labels (e.g., speaker, emotion, and noise), which differentiates Voicebox from the majority of prior work where such labels are used extensively. Prior work uses labels to make the mapping between input (text and audio style) and output (speech) more deterministic to reduce underfitting [Wang et al., 2021, Popov et al., 2021]. We show that Voicebox’s text-guided speech infilling approach is much more scalable in terms of data while subsuming many common speech generative tasks. Ahhhh, there's the rub. It's entirely a model for infilling, and every other feature is after-the-fact ("emergent"). So it's about as raw as VALL-E in the sense there's only the input text as a label, but utilizing only the NAR and new model arch to accomplish infilling. Interesting. > It does not use any style labels, pre-trained embedders, or multilingual samples. Alright, neat-o. I wonder what shit VALL-E X was needing in the first place to accomplish multilingual-ness outside of a ton more data. I suppose it wouldn't hurt for me to get the Japanese phonemizer back up and dump my Japanese dataset at it then with zero changes to the code (so I won't even need to bother with a language marker token after all). > Audio x is represented as an 80-dimensional log Mel spectrogram > Audio is represented as a 80-dimensional log Mel spectrogram and a HiFi-GAN vocoder I'm sobbing. It was too good to be true... stuck using mel spectorgrams... Jokes aside, I don't think it's an issue at all though. The paper mentions it can very well slot it out for Encodec when it was dick comparing against VALL-E, I think. > We adapt the HiFi-GAN V1 configuration to generate 16kHz audio from 80 dimensional log Mel spectral features sampled at 100Hz And it's shit. At least, the demos are. I didn't think much to check the actual frequency of the demos, but it checks out: they're 16KHz. Another funny note, if you poke in with Inspect Element, you can see some of the provided outputs are tagged like `https://dl.fbaipublicfiles.com/voicebox/webpage/public/assets/orig/audios/zstts_shortlisted/valle/hyp_voc_chunk/5639-40744-0020_0.wav`. Now I don't know if they're just calling it VALL-E, but it activates my almonds a bit. There's some other tagged like `https://dl.fbaipublicfiles.com/voicebox/webpage/public/assets/orig/audios/zstts_shortlisted/extra/hyp_voc_chunk/21_0_0.wav` instead, so who knows. I'm sure you can easily slot out the vocoder, like you can with TorToiSe for BigVGAN, so I don't know why they elected to use their own flavor of HiFi-GAN instead. I hope any unofficial implementation that does rise up catches this. Although, I wonder if that means the training input *is* restricted to 16KHz. I'll have to cross-reference with what TorToiSe has its mel representations capped at, but it's a bit grim if that's the case. Although, that only sort of matters for existing weights. I'm sure it'll be easy to slot in higher bandwidth audio for homebrewed weights.
Author
Owner

Progress report:
image

  • 2200+ 3400+ hours is definitely having the loss go down faster than with the 550+ hour dataset, obviously.
    • actually this dataset is clocked at 3472 hours with the phoneme length range being between [4,192]. Oops.
    • although I don't know if this is also due to using a very high mysterious LR of 1.0e-3.
  • it's a bitch to eat through though
    • initialization takes a while. I could disable the dataset validation part where it only allows in data with phonemes between a specific range, but then I need to drop my batch size.
    • with the quarter sized model, crunching through an epoch is estimated at 26 hours.
    • with the full sized model, crunching through an epoch is estimated at 120 hours......
    • I could definitely shrink the phoneme length range down to 100 again and bump back up my batch sizes, as I only did this mostly because of the next point.
  • training is very unstable. I think I keep triggering OOM killers.
    • Training will randomly hang, or slow to a crawl, or the training process AND the web UI I forget I left opened, and both will terminate.
    • I actually threw 64 more GiBs of system RAM at my training machine (bumping it up to a 96GiB total), so maybe I won't actually trigger any OOM killers. But we'll see.
  • I grew a wild hair and added in support for using an HDF5-based dataset. I added a little routine that should "convert" from the old "use a bunch of defined paths and iterate through a list of quantized audios + phonemized text files" to throw it into an HDF5 file with metadata to ease things.
    • ...this actually doesn't seem to make a difference in initialization times at very large datasets. Traversing through the HDF5 dataset still takes time.
    • However, I feel that it makes the dataloaders a little faster. Only a little.
    • ...However, this has about an overhead of ~12GiB (on disk, my entire dataset is 56GiB, and the HDF5 file is 64GiB, without compression. I might be better off enabling compression, since decompression algos should be pretty quick, and your bottleneck is always going to be disk IO bandwidth).
      • I sure hope so, since I remember just gunzipping the dataset on disk with the 550+ hour dataset eeping out only ~3GiB.
  • The mysterious LR rate being set to 1.0e-3 worked for a bit, I think until iteration 4700 / epoch 1.5, then the loss spiked. While it would eventually go down, or I could keep reloading from the last good-er checkpoint to keep rerolling the dice, but I carelessly dropped it to 1.0e-4, which is so slow now.
    • I also don't know if momentum has anything to factor into it. I have to use the previous LR scheduler defines that in reality aren't all that necessary, which can explicitly control the momentum, but I don't know what the momentum is by default without a scheduler. Logging reports mom=(0.9,0.999) but I'm not sure which one it's actually using.
  • the audio seems kind of worse compared to how it was around iteration 2000 when I last ripped out the evaluation / validation outputs. For that, it would have crackles but at least sound close-ish-ish to the reference, but the current output at iteration 7500 (accuracy ~76%) sounds cleaner with less crackles (still some, but not as bad), but it's speaker matching seems worse off.

I'm a bit scatterbrained though, since I feel a little at unease from how much shuffling around things I have to do. I think I shouldn't keep at the quarter sized model again and I would get better results fondling with baking a full sized model again. If training is stabilized with moar RAM then I suppose I'll pivot back to it.

Progress report: ![image](/attachments/5269ead1-bad8-4476-8567-dd734ec7a73e) * ~~2200+~~ 3400+ hours is definitely having the loss go down faster than with the 550+ hour dataset, obviously. - actually this dataset is clocked at 3472 hours with the phoneme length range being between [4,192]. Oops. - although I don't know if this is also due to using a very high mysterious LR of 1.0e-3. * it's a *bitch* to eat through though - initialization takes a while. I *could* disable the dataset validation part where it only allows in data with phonemes between a specific range, but then I need to drop my batch size. - with the quarter sized model, crunching through an epoch is estimated at 26 hours. - with the full sized model, crunching through an epoch is estimated at 120 hours...... - I could definitely shrink the phoneme length range down to 100 again and bump back up my batch sizes, as I only did this mostly because of the next point. * training is very unstable. I think I keep triggering OOM killers. - Training will randomly hang, or slow to a crawl, or the training process AND the web UI I forget I left opened, and both will terminate. - I actually threw 64 more GiBs of system RAM at my training machine (bumping it up to a 96GiB total), so maybe I won't actually trigger any OOM killers. But we'll see. * I grew a wild hair and added in support for using an HDF5-based dataset. I added a little routine that should "convert" from the old "use a bunch of defined paths and iterate through a list of quantized audios + phonemized text files" to throw it into an HDF5 file with metadata to ease things. - ...this actually doesn't seem to make a difference in initialization times at very large datasets. Traversing through the HDF5 dataset still takes time. - However, I feel that it makes the dataloaders a *little* faster. Only a little. - ...However, this has about an overhead of ~12GiB (on disk, my entire dataset is 56GiB, and the HDF5 file is 64GiB, without compression. I might be better off enabling compression, since decompression algos should be pretty quick, and your bottleneck is *always* going to be disk IO bandwidth). + I sure hope so, since I remember just gunzipping the dataset on disk with the 550+ hour dataset eeping out only ~3GiB. * The mysterious LR rate being set to 1.0e-3 worked for a bit, I think until iteration 4700 / epoch 1.5, then the loss spiked. While it would eventually go down, or I could keep reloading from the last good-er checkpoint to keep rerolling the dice, but I carelessly dropped it to 1.0e-4, which is *so slow* now. - I also don't know if momentum has anything to factor into it. I have to use the previous LR scheduler defines that in reality aren't all that necessary, which can explicitly control the momentum, but I don't know what the momentum is by default without a scheduler. Logging reports mom=(0.9,0.999) but I'm not sure which one it's actually using. * the audio seems kind of worse compared to how it was around iteration 2000 when I last ripped out the evaluation / validation outputs. For that, it would have crackles but at least sound close-ish-ish to the reference, but the current output at iteration 7500 (accuracy ~76%) sounds cleaner with less crackles (still some, but not as bad), but it's speaker matching seems worse off. I'm a bit scatterbrained though, since I feel a little at unease from how much shuffling around things I have to do. I think I shouldn't keep at the quarter sized model again and I would get better results fondling with baking a full sized model again. If training is stabilized with moar RAM then I suppose I'll pivot back to it.

There's a new neural vocoder that might be worth checking out called 'Vocos'. It was made for bark TTS, and sounds like an improvement to bare EnCodec. The demo doesn't compare it to any other neural vocoders, but it performed very well even at 1.5 kbps. It says it reconstructs audio from EnCodec tokens, so it might be worth checking it out.

https://github.com/charactr-platform/vocos

There's a new neural vocoder that might be worth checking out called 'Vocos'. It was made for bark TTS, and sounds like an improvement to bare EnCodec. The demo doesn't compare it to any other neural vocoders, but it performed very well even at 1.5 kbps. It says it reconstructs audio from EnCodec tokens, so it might be worth checking it out. https://github.com/charactr-platform/vocos
Author
Owner

There's a new neural vocoder that might be worth checking out called 'Vocos'. It was made for bark TTS, and sounds like an improvement to bare EnCodec. The demo doesn't compare it to any other neural vocoders, but it performed very well even at 1.5 kbps. It says it reconstructs audio from EnCodec tokens, so it might be worth checking it out.

Neato, this pretty much addresses a concern I had with being Encodec-pilled: TorToiSe was easy to add in audio quality uplifts by slotting out the vocoder, while I imagined there wasn't going to be that easy of a gain, since the quantized audio pretty much is frozen with the model used to encode it. If it really is that seamless to slot out an Encodec-compatible decoder, then that slight issue is solved.

Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a single forward pass.

Interdasting. I wonder if this means it either:

  • reduces the time needed to quantize audio (my throughput seemed to have degraded to quantizing 90 samples/s).
  • encodes audio to its own codebook with reduced bandwidth (the killer of training is how big even quantized audio is). But I wouldn't put my hopes in with this.

The demo

Christ, that's actually great. The difference at 1.5kbps between it and Encodec are night and day and makes it rather viable, but even at 3kbps I can't really hear a difference between it and the higher bandwidths. 12kbps still isn't exactly 100% as crisp as the source, but it's practically imperceptible.

I might actually consider using quantizing at 1.5kbps as a theoretical 4~16x throughput increase (this would increases my batch sizes AND reduces the time it needs to even process them). Vocos 1.5kbps is already better than Encodec 6kbps, so regardless it's an improvement. However, that would mean I would have to requantize everything, and that's a bit of a pain.

doesn't compare it to any other neural vocoders

desu I don't really need that; it lists the mel-based ones (which I guess would be nice to backport it into TorToiSe) but doesn't have any weight here, and the RVQ-based Encodec-alternatives have caveats that pretty much make it hard to go with (like the one I mentioned that ended up having bigger sequences, and the fabled 4-RVQ binned one not being easy to use).

I'm very pleased with the demo, but...


It'll need to wait until probably Sunday when it crunched through one epoch. image

image

The ~3400+ hour-dataset full-sized-model training run seems to be fruitful after throwing in moar RAM into the system, so much so I feel really silly for not doing it. htop always reported there being enough wiggle room, so it never crossed my mind that it was a RAM issue until earlier. I still can't start an Xorg session since it might cause it to OOM in CUDA land, but if it hits 80% accuracy by epoch 1, then that's a great sign.

When it hits one epoch, then I'll do my actual evaluation with Vocos, and if it works out, I'll requantize my audio to 1.5kbps and see how much more of an uplift I can squeeze out for training.

> There's a new neural vocoder that might be worth checking out called 'Vocos'. It was made for bark TTS, and sounds like an improvement to bare EnCodec. The demo doesn't compare it to any other neural vocoders, but it performed very well even at 1.5 kbps. It says it reconstructs audio from EnCodec tokens, so it might be worth checking it out. Neato, this pretty much addresses a concern I had with being Encodec-pilled: TorToiSe was easy to add in audio quality uplifts by slotting out the vocoder, while I imagined there wasn't going to be that easy of a gain, since the quantized audio pretty much is frozen with the model used to encode it. If it really is that seamless to slot out an Encodec-compatible decoder, then that slight issue is solved. > Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a single forward pass. Interdasting. I wonder if this means it either: * reduces the time needed to quantize audio (my throughput seemed to have degraded to quantizing 90 samples/s). * encodes audio to its own codebook with reduced bandwidth (the killer of training *is* how big even quantized audio is). But I wouldn't put my hopes in with this. > [The demo](https://charactr-platform.github.io/vocos/) Christ, that's actually great. The difference at 1.5kbps between it and Encodec are night and day and makes it rather viable, but even at 3kbps I can't really hear a difference between it and the higher bandwidths. 12kbps still isn't exactly 100% as crisp as the source, but it's practically imperceptible. I ***might*** actually consider using quantizing at 1.5kbps as a theoretical 4~16x throughput increase (this would increases my batch sizes AND reduces the time it needs to even process them). Vocos 1.5kbps is already better than Encodec 6kbps, so regardless it's an improvement. However, that would mean I would have to requantize *everything*, and that's a bit of a pain. > doesn't compare it to any other neural vocoders desu I don't really need that; it lists the mel-based ones (which I guess would be nice to backport it into TorToiSe) but doesn't have any weight here, and the RVQ-based Encodec-alternatives have caveats that pretty much make it hard to go with (like the one I mentioned that ended up having bigger sequences, and the fabled 4-RVQ binned one not being easy to use). I'm very pleased with the demo, but... --- It'll need to wait until *probably* Sunday when it crunched through one epoch. ![image](/attachments/3cbc7cea-d28e-4bd3-ad9b-51966cd2ae3b) ![image](/attachments/b2bb282b-9c1e-4b75-9854-158fb59a3d05) The ~3400+ hour-dataset full-sized-model training run seems to be fruitful after throwing in moar RAM into the system, so much so I feel really silly for not doing it. htop always reported there being enough wiggle room, so it never crossed my mind that it was a RAM issue until earlier. I still can't start an Xorg session since it might cause it to OOM in CUDA land, but if it hits 80% accuracy by epoch 1, then that's a great sign. When it hits one epoch, then I'll do my actual evaluation with Vocos, and if it works out, I'll requantize my audio to 1.5kbps and see how much more of an uplift I can squeeze out for training.
Author
Owner

The first epoch has concluded for the fullsize model, I'm not too pleased since training definitely petered off, but at least it's still going down. image

I've crammed Vocos into my fork to be used for decoding. It was a bit tricky since it wasn't so keen on how it wanted the tensor shaped out. It works, but I haven't done much apples-to-apples testing outside of making sure it did decode.

I also was sniffing around seeing how easy it would be to "switch" to 1.5kbps bandwidth for the quantized tokens (for reasons mentioned before), and it turns out (obvious in hindsight):

  • bandwidth levels correlate to more RVQ-bin levels, so 1.5kbps is 2 bins, 3Kbps is 4 bins, 6kbps is 8 bins, yadda.
  • quantizing audio at a higher "level" means the lower bins are the same for lower bandwidths. In other words, because I quantized at 6kbps / 8 RVQ bins, I can effectively reuse them for 1.5kbps / 2 bins, meaning I do not actually need to requantize my audio (phew).
  • I can just set the NAR's n_resp_levels to whatever bandwidth target I want in the Config YAML class, and in theory, going down levels should work for the same model. I hope.

I'll see about mucking around with reducing the n_resp_levels to suit 1.5kbps / 2 RVQ bins for my hopeful performance gains in training (and in turn inferencing, although in reality inference speeds aren't too terrible, training against HUGE data is the issue). Training for 120 hours is a bitch and I really need to try and find gains, and crossing my finger that Vocos's decoding is true and offers decoding at 1.5kbps to be at parity with 6kbps base Encodec.

The first epoch has concluded for the fullsize model, I'm not *too* pleased since training definitely petered off, but at least it's still going down. ![image](/attachments/64430d14-fdd4-4dd1-863b-615f69515a03) I've crammed Vocos into my fork to be used for decoding. It was a bit tricky since it wasn't so keen on how it wanted the tensor shaped out. It works, but I haven't done much apples-to-apples testing outside of making sure it *did* decode. I also was sniffing around seeing how easy it would be to "switch" to 1.5kbps bandwidth for the quantized tokens (for reasons mentioned before), and it turns out (obvious in hindsight): * bandwidth levels correlate to more RVQ-bin levels, so 1.5kbps is 2 bins, 3Kbps is 4 bins, 6kbps is 8 bins, yadda. * quantizing audio at a higher "level" means the lower bins are the same for lower bandwidths. In other words, because I quantized at 6kbps / 8 RVQ bins, I can effectively reuse them for 1.5kbps / 2 bins, meaning I do not actually need to requantize my audio (phew). * I can just set the NAR's `n_resp_levels` to whatever bandwidth target I want in the Config YAML class, and *in theory*, going down levels should work for the same model. I hope. I'll see about mucking around with reducing the `n_resp_levels` to suit 1.5kbps / 2 RVQ bins for my hopeful performance gains in training (and in turn inferencing, although in reality inference speeds aren't too terrible, training against HUGE data is the issue). Training for 120 hours is a bitch and I really need to try and find gains, and crossing my finger that Vocos's decoding is true and offers decoding at 1.5kbps to be at parity with 6kbps base Encodec.
Author
Owner

I'm really not too sure what to make of these results. There's some things that make sense in hindsight, but there's a lot of weird shit with this new run for the quarter-sized model at 1.5kbps / 2 RVQ-bins:

  • the loss / accuracy for the NAR is much closer to the AR.
  • this, for some reason, also benefitted in the training "progression" of the AR, even though it should benefit the NAR for having to handle less RVQ levels (from 7 to 1).
    • this might be because the input prompts are only for 2 levels for it to muck around with, rather than 8. I don't know if there's any biasing towards higher levels.
  • the de-facto iteration throughput is like, 3.5x? I could easily increase my batch size by 2x, and the actual iteration throughput is like 2x'd too (it's currently about 1.3it/s peak).
    • the throughput in reality, though, is like, reduced from 26 hours to crunch an epoch to 20 hours. I could have fucked something up with keeping the dataset "apples to apples" though. I'll need to validate the dataset size again.
  • I also forgot to switch the evaluation / validation process to use Vocos, since I had it disabled when doing a last-minute-at-night-launch of the training session, so their losses might be different than the reality.

image

I once again need to check the actual raw output though. I don't want to start an Xorg session mid-epoch and breaking things, and also I don't think the AR's accuracy is high enough for decent audio at the moment. However, the NAR being better than it has been recently has some hopes.

I'm not too sure what would be the best course of action, as this test pretty much invalidated the full-sized 1 epoch training I just did that burnt 120+ hours of training or so. I'm very sure this is the part where I usually say a bunch of shit that either I forget or are just wrong, or talk about things no one really cares about, so I don't think I'll make more comments about it until the training continues with the quarter sized and if Vocos will help bolster things. I'm kinda having difficulties piecing my findings here into something that makes sense, even for me, since I don't even really remember my expectations of it.

I'm really not too sure what to make of these results. There's some things that make sense in hindsight, but there's a lot of weird shit with this new run for the quarter-sized model at 1.5kbps / 2 RVQ-bins: * the loss / accuracy for the NAR is much closer to the AR. * this, for some reason, also benefitted in the training "progression" of the AR, even though it should benefit the NAR for having to handle less RVQ levels (from 7 to 1). - this might be because the input prompts are only for 2 levels for it to muck around with, rather than 8. I don't know if there's any biasing towards higher levels. * the de-facto iteration throughput is like, 3.5x? I could easily increase my batch size by 2x, and the actual iteration throughput is like 2x'd too (it's currently about 1.3it/s peak). - the throughput in reality, though, is like, reduced from 26 hours to crunch an epoch to 20 hours. I could have fucked something up with keeping the dataset "apples to apples" though. I'll need to validate the dataset size again. * I also forgot to switch the evaluation / validation process to use Vocos, since I had it disabled when doing a last-minute-at-night-launch of the training session, so their losses might be different than the reality. ![image](/attachments/9502b134-df54-45c8-aab1-1aa0d880996c) I once again need to check the actual raw output though. I don't want to start an Xorg session mid-epoch and breaking things, and also I don't think the AR's accuracy is high *enough* for decent audio at the moment. However, the NAR being better than it has been recently has some hopes. I'm not too sure what would be the best course of action, as this test pretty much invalidated the full-sized 1 epoch training I just did that burnt 120+ hours of training or so. I'm very sure this is the part where I usually say a bunch of shit that either I forget or are just wrong, or talk about things no one really cares about, so I don't think I'll make more comments about it until the training continues with the quarter sized and if Vocos will help bolster things. I'm kinda having difficulties piecing my findings here into something that makes sense, even for me, since I don't even really remember my expectations of it.

Next time you start an Xorg session, can you post some example audio?The last audio from June 6th didn't have the ~3400+ hour-dataset, nor did it have vocos, and I'm curious as to how much of an effect they've had on audio quality.

Fullsize Samples:

Next time you start an Xorg session, can you post some example audio?The last audio from June 6th didn't have the ~3400+ hour-dataset, nor did it have vocos, and I'm curious as to how much of an effect they've had on audio quality. > Fullsize Samples: > * Patrick Bateman (American Psycho): [Reference](https://files.catbox.moe/xqobmw.wav) / [AR](https://files.catbox.moe/13kgws.wav) / [NAR](https://files.catbox.moe/xtaee0.wav) / (AR+NAR)[https://files.catbox.moe/m1jeoz.wav] > * Elizabeth (Persona 3): [Reference](https://files.catbox.moe/7k8ymq.wav) / [AR](https://files.catbox.moe/fy7dzf.wav) / [NAR](https://files.catbox.moe/lbl2jf.wav) / [AR+NAR](https://files.catbox.moe/1a1axg.wav)
Author
Owner

Oh yeah, I'll see what I can grab. For sure I'll grab the latest examples to test, and I'll load back the old models and do an eval to spit out audio with Vocos.

I've had the quarter-sized model at RVQ-level 2 baking for the week. I kept wanting to go back to the full-sized model but there's something with the initial few hours of training the full-size again with RVQ-level 2 that has me rather just let the quarter size bake for a while.


Also, I might go back to fiddling with Bark again in the UI. When I was trying to figure out the tensor shape Vocos accepted with its provided Bark example, it actually outputted decent, usable audio for me, unlike when I tried it ages ago, it giving me unuseable garbage even with the base repo. Just something to note.

Oh yeah, I'll see what I can grab. For sure I'll grab the latest examples to test, and I'll load back the old models and do an eval to spit out audio with Vocos. I've had the quarter-sized model at RVQ-level 2 baking for the week. I kept wanting to go back to the full-sized model but there's something with the initial few hours of training the full-size again with RVQ-level 2 that has me rather just let the quarter size bake for a while. --- Also, I *might* go back to fiddling with Bark again in the UI. When I was trying to figure out the tensor shape Vocos accepted with its provided Bark example, it actually outputted decent, usable audio for me, unlike when I tried it ages ago, it giving me unuseable garbage even with the base repo. Just something to note.
Author
Owner

Alrighty, I've eval'd with Vocos. I've labeled them to make them clear so I don't have to keep referring to them by their dataset size + model size + RVQ levels:

  • model A: the old weights on the ~550+ hour dataset at 8 RVQ bins (reported AR accuracy ~85% / NAR accuracy ~58%)
  • model B: the ~3400+ hour dataset's full-sized weights at 8 RVQ bins (reported AR accuracy ~76% / NAR accuracy ~42%)
  • model B/4: the ~3400+ hour dataset's quarter-sized at 8 RVQ bins (reported AR accuracy ~75% / NAR accuracy ~47%)
  • model C: the ~3400+ hour dataset's quarter-sized weights at 2 RVQ bins (reported AR accuracy ~75% / NAR accuracy ~65%)

and... my god are the B models ass.

  • I suppose I jumped to conclusions too fast with thinking bigger dataset = faster training because the time it took to reach 72% accuracy was much shorter with the bigger dataset, but I'm still at the mercy of having to throw more training time at it to make the accuracy go up (loss go down).
    • I wonder if it's better to train one epoch on the full ~3400 hour dataset (or until 72% accuracy), and then revert to a smaller dataset and continue training off of that (and the "last" epoch back to the full dataset, for safety).
  • ...however, the same problem occurs: the AR is consistently terrible, and the NAR is suspiciously good, despite its training metrics saying otherwise (the evaluation/validation aural-loss reports the NAR's loss to be <1.0). I'm not really sure why the AR is used at this point, but I'll see if it ever improves or not, despite feeling like it's deadweight. Haha...
    • A bad AR will make the AR+NAR output sound worse, rather than the raw NAR output sounding great.
    • I'll need to read the paper on why the AR is used; it just seems like it's better to go with the NAR itself, but who knows.

I can't be assed to cherry pick and manually label and upload, so have the entire evaluation / validation output for each models here (MP3 to make it fit under 200MiB, but the effective audio quality shouldn't be perceptible).


Haha, I remember why the NAR isn't solely used. I'm so fucking stupid for forgetting it because I've specifically noted this before in the past:

  • for training, the reference audio is fed into the NAR as the first RVQ-bin layer, which is why it's suspiciously good. Other models with "lesser" NARs will still sound good because the primary RVQ-bin layer is derived from the reference audio.
    • this also explains why the AR on it's own always sounds like ass, because it's always just the first layer, and missing the remaining RVQ bin layers, while the NAR eval/val output will always be the full clip.
  • it's impossible to actually use the NAR as it is, since it relies on the AR to generate the remaining residuals.
    • it's also impossible to do do the same with the NAR eval/val output but for the AR, such that it can borrow the remaining RVQ bin layers from the source, because the durations will always mismatch.
  • I could rework it to be able to generate layer 1 out of thin air, but the paper explains why the AR is used for layer one (it's a good way to "figure out" the duration length. If I remember right, Zucc's voicebox has a separate model for inferencing the duration of an audio clip itself, or something like that). The AR probably is better anyways, at least given from the raw loss/accuracy, so I shouldn't really sweat it.

I suppose the best way to train it would be something like:

  • use a very large dataset to train the models for the first epoch(s)
    • this is to moreso prioritize it "learning" how to "speak", since I remember my much earlier tests suffering from actual speech.
    • I feel in my tests with the ~3400+ hour dataset, it'll start to hit diminishing returns at AR loss=3.7 / AR accuracy=72%, so that'll be the good "cutoff" for this phase.
  • afterwards, prioritize having a more speakers over having more data (so in my case, I would gut down the data from the audiobooks sent to me, since those prioritize lots of speaking hours over varied speakers).
    • this then should ensure it'll "learn" how to clone
    • also, the max phoneme length cutoff can be reduced to also have the forward pass crunch through much faster with shorter sequences, and this should slightly reduce the VRAM requirement for the training data.
  • for the "final" epoch, do the full dataset again just for safety.
    • this should also fix up any problems from training on shorter clips, such as the attention layers getting mucked up.

I do wonder, though, when it'll be a good time to finetune it. It should probably fix the issue of "bad" clonability, since my goal anyways was just to use a decent base to finetune.

Alrighty, I've eval'd with Vocos. I've labeled them to make them clear so I don't have to keep referring to them by their dataset size + model size + RVQ levels: * `model A`: the old weights on the ~550+ hour dataset at 8 RVQ bins (reported AR accuracy ~85% / NAR accuracy ~58%) * `model B`: the ~3400+ hour dataset's full-sized weights at 8 RVQ bins (reported AR accuracy ~76% / NAR accuracy ~42%) * `model B/4`: the ~3400+ hour dataset's quarter-sized at 8 RVQ bins (reported AR accuracy ~75% / NAR accuracy ~47%) * `model C`: the ~3400+ hour dataset's quarter-sized weights at 2 RVQ bins (reported AR accuracy ~75% / NAR accuracy ~65%) and... my god are the B models ass. * I suppose I jumped to conclusions too fast with thinking bigger dataset = faster training because the time it took to reach 72% accuracy was much shorter with the bigger dataset, but I'm still at the mercy of having to throw more training time at it to make the accuracy go up (loss go down). - I wonder if it's better to train one epoch on the full ~3400 hour dataset (or until 72% accuracy), and then revert to a smaller dataset and continue training off of that (and the "last" epoch back to the full dataset, for safety). * ...however, the same problem occurs: the AR is consistently terrible, ~~and the NAR is suspiciously good, despite its training metrics saying otherwise (the evaluation/validation aural-loss reports the NAR's loss to be <1.0). I'm not really sure why the AR is used at this point, but I'll see if it ever improves or not, despite feeling like it's deadweight.~~ Haha... - A bad AR will make the AR+NAR output sound worse, rather than the raw NAR output sounding great. - I'll need to read the paper on why the AR is used; it just seems like it's better to go with the NAR itself, but who knows. I can't be assed to cherry pick and manually label and upload, so have the entire evaluation / validation output for each models [here](https://files.catbox.moe/eir2w2.7z) (MP3 to make it fit under 200MiB, but the effective audio quality shouldn't be perceptible). --- Haha, I remember why the NAR isn't solely used. I'm so fucking stupid for forgetting it because I've specifically noted this before in the past: * for training, the reference audio is fed into the NAR as the first RVQ-bin layer, which is why it's suspiciously good. Other models with "lesser" NARs will still sound good because the primary RVQ-bin layer is derived from the reference audio. - this also explains why the AR on it's own always sounds like ass, because it's always just the first layer, and missing the remaining RVQ bin layers, while the NAR eval/val output will *always* be the full clip. * it's ***impossible*** to actually use the NAR as it is, since it relies on the AR to generate the remaining residuals. - it's also impossible to do do the same with the NAR eval/val output but for the AR, such that it can borrow the remaining RVQ bin layers from the source, because the durations will always mismatch. * I *could* rework it to be able to generate layer 1 out of thin air, but the paper explains why the AR is used for layer one (it's a good way to "figure out" the duration length. If I remember right, Zucc's voicebox has a separate model for inferencing the duration of an audio clip itself, or something like that). The AR probably is better anyways, at least given from the raw loss/accuracy, so I shouldn't really sweat it. --- I suppose the best way to train it would be something like: * use a very large dataset to train the models for the first epoch(s) - this is to moreso prioritize it "learning" how to "speak", since I remember my much earlier tests suffering from actual speech. - I feel in my tests with the ~3400+ hour dataset, it'll start to hit diminishing returns at AR loss=3.7 / AR accuracy=72%, so that'll be the good "cutoff" for this phase. * afterwards, prioritize having a more speakers over having more data (so in my case, I would gut down the data from the audiobooks sent to me, since those prioritize lots of speaking hours over varied speakers). - this then should ensure it'll "learn" how to clone - also, the max phoneme length cutoff can be reduced to also have the forward pass crunch through much faster with shorter sequences, and this should slightly reduce the VRAM requirement for the training data. * for the "final" epoch, do the full dataset again just for safety. - this should also fix up any problems from training on shorter clips, such as the attention layers getting mucked up. I do wonder, though, when it'll be a good time to finetune it. It should probably fix the issue of "bad" clonability, since my goal anyways was just to use a decent base to finetune.

Thanks for throwing me a bone, it's very interesting comparing the validation output of the different models! Despite the accuracy being lower than model A, the audio quality of the validation output for the AR for model C seems remarkably clear (but inconsistently so). I wonder if this is because of vocos? On the other hand, model C's AR+NAR sounded to be worse than A's.

A bad AR will make the AR+NAR output sound worse, rather than the raw NAR output sounding great.

Sounds like it may be the AR then, since the NAR validation output quality on model C is sounds comparable in quality to model A to my ear- with less garbling than either of the B models, but I haven't even read the VALL-E paper so take my opinion for what it's worth.

I do wonder, though, when it'll be a good time to finetune it. It should probably fix the issue of "bad" clonability, since my goal anyways was just to use a decent base to finetune.

I also want to remind you that me and many others would help fund cloud compute if a full general model is ever ready to train. IIRC, the other duplication trained for 4 days on an 8xA100 cluster on LibriTTS for 100 epochs- and at some point compute is going to be the bottleneck to getting a model properly trained. But, it sounds like you're still in the experimentation phase and have some kinks to work out.
-Cheers!

Thanks for throwing me a bone, it's very interesting comparing the validation output of the different models! Despite the accuracy being lower than model A, the audio quality of the validation output for the AR for model C seems remarkably clear (but inconsistently so). I wonder if this is because of vocos? On the other hand, model C's AR+NAR sounded to be worse than A's. >A bad AR will make the AR+NAR output sound worse, rather than the raw NAR output sounding great. Sounds like it may be the AR then, since the NAR validation output quality on model C is sounds comparable in quality to model A to my ear- with less garbling than either of the B models, but I haven't even read the VALL-E paper so take my opinion for what it's worth. >I do wonder, though, when it'll be a good time to finetune it. It should probably fix the issue of "bad" clonability, since my goal anyways was just to use a decent base to finetune. I also want to remind you that me and many others would help fund cloud compute if a full general model is ever ready to train. IIRC, the other duplication trained for 4 days on an 8xA100 cluster on LibriTTS for 100 epochs- and at some point compute is going to be the bottleneck to getting a model properly trained. But, it sounds like you're still in the experimentation phase and have some kinks to work out. -Cheers!
Author
Owner

Despite the accuracy being lower than model A, the audio quality of the validation output for the AR for model C seems remarkably clear (but inconsistently so). I wonder if this is because of vocos?

I have a theory, but it's pretty much conjecture: since model C is using only 2 RVQ bins (Encodec 1.5kbps) instead of 8 (Encodec 6kbps), there's less for the AR to have to attend to with the input prompt, so it can have "better" output from the AR. This would also explain how it seemed to have been training much, much better in a shorter amount of iterations even compared to the fullsized model (which I still need to get around to trying).

since the NAR validation output quality on model C is sounds comparable in quality to model A to my ear

This could also have to do with there being less residuals to muck around with that could be wrong enough to add in some noise. Vocos will make up for the lack of additional residuals, rather than being given worse residuals. I suppose I could re-eval model A and snip off RVQ bins 3-8 to "force" it to be at parity with model C, but I think it's a little too much work for some extremely niche comparison.

I also want to remind you that me and many others would help fund cloud compute if a full general model is ever ready to train. IIRC, the other duplication trained for 4 days on an 8xA100 cluster on LibriTTS for 100 epochs- and at some point compute is going to be the bottleneck to getting a model properly trained. But, it sounds like you're still in the experimentation phase and have some kinks to work out.

A good part is that last line; it's just easier to keep it local because doing this on a rental was always CBT.

  • P*p*rsp*c*'s model is the most "convenient" (ignoring the pricing) but will bend me over on a moment's notice if I'm being """abusive""" with the free GPUs (I'm not sure if it's specifically the A100-80G or even the A6000s, but I don't want to try a fourth time).
  • runpod would be so much nicer if I can just provide a Dockerfile and not have to touch uploading docker images and dealing with 100+GiB of bloat just to make an image to push.
    • also getting everything set up for training is a drag. Something like installing a specific nvcc package, copying over headers from one cuda version to another, some other draconian magic I forgot already, all while being charged for getting training up.
  • I'm probably just schizoing out over this, but I feel the moment I do accept any assistance (monetary or compute) is when a can of worms is opened, from being "obligated" to keep training this since now it's not strictly from my resources, to some other FUD that I can't quite recall concisely.
    • I still feel rather guilty I did kinda burn a few weeks (months?) with the first run that I kept reusing the weights when I added more to the dataset.
  • the core reason, though, is that distributed training is broken still, and I really don't want to rip out all of my hairs trying to figure out why it's broken and fixing it.
    • from what I remember, this was always a problem, so it's nothing I broke, since I remember it behaving the same when I initially were training this with my 2x6800XTs.
    • from what I remember, it would have all GPUs active, but I think each batch is the same across all processes? Something like that.
    • I wouldn't be surprised if it was an issue with DeepSpeed. I keep getting an itch to scrap it for PyTorch (or Lightning) + BitsAndBytes, or Accelerate, but I think I'm too married to DeepSpeed's compression training method and ZeRO optimizations (because disabling any of those will throw CUDA OOMs).
> Despite the accuracy being lower than model A, the audio quality of the validation output for the AR for model C seems remarkably clear (but inconsistently so). I wonder if this is because of vocos? I have a theory, but it's pretty much conjecture: since model C is using only 2 RVQ bins (Encodec 1.5kbps) instead of 8 (Encodec 6kbps), there's less for the AR to have to attend to with the input prompt, so it can have "better" output from the AR. This would also explain how it seemed to have been training much, much better in a shorter amount of iterations even compared to the fullsized model (which I still need to get around to trying). > since the NAR validation output quality on model C is sounds comparable in quality to model A to my ear This could also have to do with there being less residuals to muck around with that *could* be wrong enough to add in some noise. Vocos will make up for the lack of additional residuals, rather than being given worse residuals. I *suppose* I could re-eval `model A` and snip off RVQ bins 3-8 to "force" it to be at parity with `model C`, but I think it's a little too much work for some extremely niche comparison. > I also want to remind you that me and many others would help fund cloud compute if a full general model is ever ready to train. IIRC, the other duplication trained for 4 days on an 8xA100 cluster on LibriTTS for 100 epochs- and at some point compute is going to be the bottleneck to getting a model properly trained. But, it sounds like you're still in the experimentation phase and have some kinks to work out. A good part is that last line; it's just easier to keep it local because doing this on a rental was always CBT. * P\*p\*rsp\*c\*'s model is the most "convenient" (ignoring the pricing) but will bend me over on a moment's notice if I'm being """abusive""" with the free GPUs (I'm not sure if it's specifically the A100-80G or even the A6000s, but I don't want to try a fourth time). * runpod would be so much nicer if I can just provide a Dockerfile and not have to touch uploading docker images and dealing with 100+GiB of bloat just to make an image to push. - also getting everything set up for training is a drag. Something like installing a specific nvcc package, copying over headers from one cuda version to another, some other draconian magic I forgot already, all while being charged for getting training up. * I'm probably just schizoing out over this, but I feel the moment I do accept any assistance (monetary or compute) is when a can of worms is opened, from being "obligated" to keep training this since now it's not strictly from my resources, to some other FUD that I can't quite recall concisely. - I still feel rather guilty I did kinda burn a few weeks (months?) with the first run that I kept reusing the weights when I added more to the dataset. * the core reason, though, is that distributed training is broken still, and I *really* don't want to rip out all of my hairs trying to figure out why it's broken and fixing it. - from what I remember, this was always a problem, so it's nothing I broke, since I remember it behaving the same when I initially were training this with my 2x6800XTs. - from what I remember, it would have all GPUs active, but I think each batch is the same across all processes? Something like that. - I wouldn't be surprised if it was an issue with DeepSpeed. I keep getting an itch to scrap it for PyTorch (or Lightning) + BitsAndBytes, or Accelerate, but I think I'm too married to DeepSpeed's compression training method and ZeRO optimizations (because disabling any of those will throw CUDA OOMs).
Author
Owner

God I really need a better place to keep track of "news" outside of this issues thread, but people are reading this somehow, so I suppose it's fine.

Bark

I fixed Bark support with the web UI.

  • I'm not sure what fixed it, or what was breaking it, but it works now. Use the random voice to use its included voices.
  • voice cloning with it """""works""""", but it's not very good, and it's still kludgy, and I guess I had it rely on cloning bark under ./modules/ to save the .npz file associated with a voice.
    • from what I can tell now, voice cloning with Bark seems to rely on gitmylo/bark-voice-cloning-HuBERT-quantizer, which desu is a bit kludgy to work with, and I feel there's something I'm missing with the documentation.
  • it checks for Vocos, so if it's around, it'll use that to decode.
  • If I really want to support Bark, I suppose I'll need to give it some more love, but I imagine the """scene""" already has a web UI using Bark, so it's low priority since I imagine there's something better tailored for it.

VALL-E

I grew a wild hair and pivoted to training a fullsize model again, but with the 2 RVQ bins instead of the 8. I've done some tweaking with DeepSpeed ZeRO's configuration and upped the LR some more to hopefully get it to train "faster", so the throughput should reduce that ~140+ hour ETA for an epoch to... ~120 hours. TQDM's ETA/iteration rate meter varies pretty wildly, despite it being "stable" and using the past X iterations, so I'm not sure how much to believe it.

I hate that I keep pivoting between the two, since, in theory, the quarter sized model is much better for how much more data I can push through + it being smaller means faster inference speeds when it does "realize", but I'm still curious as to how a fullsize will fare, even with how slow it is with inference.

I feel a little bit at unease, I'm not sure if it's because I feel like I wasted most of my weekend sleeping, or that I'm forgetting another crucial detail that'll taint the training (like, how the paper mentioned the AR is fed long input prompts, while I'm being lazy and using a max of 4.5 seconds for the AR and NAR), but I think at this point I need to stop stressing over "wasting time" since this is already a multi-month long endeavor with trial and error.

God I really need a better place to keep track of "news" outside of this issues thread, but people are reading this somehow, so I suppose it's fine. ## Bark I fixed Bark support with the web UI. * I'm not sure what fixed it, or what was breaking it, but it works now. Use the `random` voice to use its included voices. * voice cloning with it """""works""""", but it's not very good, and it's still kludgy, and I guess I had it rely on cloning `bark` under `./modules/` to save the `.npz` file associated with a voice. - from what I can tell now, voice cloning with Bark seems to rely on [gitmylo/bark-voice-cloning-HuBERT-quantizer](https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer), which desu is a bit kludgy to work with, and I feel there's something I'm missing with the documentation. * it checks for Vocos, so if it's around, it'll use that to decode. * If I really want to support Bark, I suppose I'll need to give it some more love, but I imagine the """scene""" already has a web UI using Bark, so it's low priority since I imagine there's something better tailored for it. ## VALL-E I grew a wild hair and pivoted to training a fullsize model again, but with the 2 RVQ bins instead of the 8. I've done some tweaking with DeepSpeed ZeRO's configuration and upped the LR some more to hopefully get it to train "faster", so the throughput should reduce that ~140+ hour ETA for an epoch to... ~120 hours. TQDM's ETA/iteration rate meter varies pretty wildly, despite it being "stable" and using the past X iterations, so I'm not sure how much to believe it. I hate that I keep pivoting between the two, since, in theory, the quarter sized model is much better for how much more data I can push through + it being smaller means faster inference speeds when it does "realize", but I'm still curious as to how a fullsize will fare, even with how slow it is with inference. I feel a little bit at unease, I'm not sure if it's because I feel like I wasted most of my weekend sleeping, or that I'm forgetting another crucial detail that'll taint the training (like, how the paper mentioned the AR is fed long input prompts, while I'm being lazy and using a max of 4.5 seconds for the AR and NAR), but I think at this point I need to stop stressing over "wasting time" since this is already a multi-month long endeavor with trial and error.

Hello,

Just wanted to chime in any say this discussion has been a gold mine. I've spent the last hour pouring through all the the updates and discussions.

I've been working on my own implementation of Vall-E for sometime now. I initially started off implementing something like the partial-delayed-pattern defined in MusicGen paper . I wanted to take a stab at implementing a single AR model that can predict all the RVQ channels. This didn't really bear any fruit, so I decided to work on getting Vall-E working.

I started on my own full implementation there since I wasn't too happy with any of the existing implementations (the codebases in itself were messy and lacked things like KV-caching). I still haven't managed to get results as I'm quashing one bug at a time on my own since its from scratch.

But after reading your progress report here, I'm motivated to see my progress through.

In any case I have access to a good chunk of compute (upto 2xA100) on the off chance you're being bottlenecked by compute feel free to let me know. Since this repository is being very active let me know if I can help out in any way.

Edit: I just noticed your comment about not accepting outside compute. Will leave my comment just as a +1 on compute in the future.

Hello, Just wanted to chime in any say this discussion has been a gold mine. I've spent the last hour pouring through all the the updates and discussions. I've been working on my own implementation of Vall-E for sometime now. I initially started off implementing something like the partial-delayed-pattern defined in [MusicGen paper ](https://arxiv.org/pdf/2306.05284.pdf). I wanted to take a stab at implementing a single AR model that can predict all the RVQ channels. This didn't really bear any fruit, so I decided to work on getting Vall-E working. I started on my own full implementation there since I wasn't too happy with any of the existing implementations (the codebases in itself were messy and lacked things like KV-caching). I still haven't managed to get results as I'm quashing one bug at a time on my own since its from scratch. But after reading your progress report here, I'm motivated to see my progress through. In any case I have access to a good chunk of compute (upto 2xA100) on the off chance you're being bottlenecked by compute feel free to let me know. Since this repository is being very active let me know if I can help out in any way. Edit: I just noticed your comment about not accepting outside compute. Will leave my comment just as a +1 on compute in the future.
Author
Owner

Just wanted to chime in any say this discussion has been a gold mine. I've spent the last hour pouring through all the the updates and discussions.
But after reading your progress report here, I'm motivated to see my progress through.

Glad it was of some use.

I've been working on my own implementation of Vall-E for sometime now. I initially started off implementing something like the partial-delayed-pattern defined in MusicGen paper

Funny; sometimes in my idle thoughts I think about how I could apply the general structure of VALL-E to repurpose it for music gen, but I imagine if I ever grow a very wild hair and go about it, that paper would be of use. I just imagine it being quite complex in terms of "mapping" a song to an input, and even just getting the data for it.

I wanted to take a stab at implementing a single AR model that can predict all the RVQ channels. This didn't really bear any fruit, so I decided to work on getting Vall-E working.

Yeah; the more I try and think about "how can I get away from this AR+NAR spaghetti", I remember a detail that reminds me there really isn't an alternative to it and being stuck with the synergy of the two, since other VALL-E inspired neural voice synthesis seems to have some caveat when doing away with it (I can't quite recall Bark's stack, but Voicebox seems to be purely NAR, but it still needs a model for the duration of an audio clip).

Now, that paper mentions interleaving the residuals into one dimension, but without it being the clever way they're doing it, I'm not too sure what improvements that would offer. I suppose if I grow a wild hair I can see what it does with just the AR, since the model (and its modules) itself shouldn't require any tweaking, just a interleave-er/deinterleave-er routine. Although, I'm quite curious now if it would offer any improvements.

I started on my own full implementation there since I wasn't too happy with any of the existing implementations (the codebases in itself were messy and lacked things like KV-caching). I still haven't managed to get results as I'm quashing one bug at a time on my own since its from scratch.

Yeah; I'm still not too happy with the VALL-E implementation I forked, since all my modifications are a bit spaghetti'd into the trainer, and I can't bring myself to spend a weekend to remake everything to train the model from square one with DeepSpeed (despite my qualms with DeepSpeed, I'm kinda married to it for ZeRO and the quantizers). I still have a gut feeling that there's some bug lurking around that I'll probably only stumble upon when I do rewrite the training script.

In any case I have access to a good chunk of compute (upto 2xA100) on the off chance you're being bottlenecked by compute feel free to let me know. Since this repository is being very active let me know if I can help out in any way.
Edit: I just noticed your comment about not accepting outside compute. Will leave my comment just as a +1 on compute in the future.

mmm, I'll keep it in mind. I have an idea to work around that pesky distributed training bug in dual-GPU systems by having one GPU be responsible for the AR, and another for the NAR, but I need to actually jerryrig the training script to allow that, and to cram back in my 6800XTs into my training system (which, the last time I put one back in, it was a huge ordeal).


In other news, the fullsize model at 2 RVQ bins seems to be very promising* image

It's only at half an epoch after I think two days(?), but it seems to be going down beyond that pesky floor of around AR loss=3.6 or so, and the accuracy is hovering around 77% so far right now (with a peak of 80%), while the NAR's accuracy is around 62%. (with a peak of like 68%) I checked some samples at it=2500 earlier this morning, and it's decent speech with somewhat decent cloning, better than the quarter sized run. It's just a bit of a pain, since the training process restarted twice, so it's not going to be a clean epoch (I really need to get around to storing the dataloader seed and index it was at to the checkpoint).

...however, the actual quality of the audio seems pretty shit.

  • On one hand, it could have stopped using Vocos, since the reference audio too sounds terrible.
  • On the other hand, it could also just be that 500 sampling steps is too low, since the default argument for it in the AR's code is 1000.

If it really is that my ears suddenly hate 2 RVQ bins, then at the absolute worst, I can compromise by using a NAR fit for higher RVQ bins, and thus increasing quality, and this could even be a quarter sized model too. I suppose this is another victory for the AR+NAR stack, since using one model wouldn't have this advantage.

Aside from that, I have my hopes for once.


And I almost had a scare. I was looking at an old copy of my fork, trying to figure out where in the AR does it actually care about more RVQ bin levels, and it actually does in the case of the input prompt here. The scare came because it was showing it was still hard set to 8, but the canonical version has it right, so all is good.

If I wanted to, I suppose I could decouple this from the outputted residual levels, so I could either:

  • have a NAR that would only care about the first two bins on the input prompt, and output the full eight still, for quality purpose.
  • have an AR or NAR that would care for the full eight bins on the input prompt, for accuracy purposes, but have the NAR target a bandwidth different from its input prompts.

Although, I'm not too sure how much of an improvement.

> Just wanted to chime in any say this discussion has been a gold mine. I've spent the last hour pouring through all the the updates and discussions. > But after reading your progress report here, I'm motivated to see my progress through. Glad it was of some use. > I've been working on my own implementation of Vall-E for sometime now. I initially started off implementing something like the partial-delayed-pattern defined in MusicGen paper Funny; sometimes in my idle thoughts I think about how I *could* apply the general structure of VALL-E to repurpose it for music gen, but I imagine if I ever grow a very wild hair and go about it, that paper would be of use. I just imagine it being quite complex in terms of "mapping" a song to an input, and even just getting the data for it. > I wanted to take a stab at implementing a single AR model that can predict all the RVQ channels. This didn't really bear any fruit, so I decided to work on getting Vall-E working. Yeah; the more I try and think about "how can I get away from this AR+NAR spaghetti", I remember a detail that reminds me there really isn't an alternative to it and being stuck with the synergy of the two, since other VALL-E inspired neural voice synthesis seems to have some caveat when doing away with it (I can't quite recall Bark's stack, but Voicebox seems to be purely NAR, but it still needs a model for the duration of an audio clip). Now, that paper mentions interleaving the residuals into one dimension, but without it being the clever way they're doing it, I'm not too sure what improvements that would offer. I suppose if I grow a wild hair I can see what it does with just the AR, since the model (and its modules) itself shouldn't require any tweaking, just a interleave-er/deinterleave-er routine. Although, I'm quite curious now if it would offer any improvements. > I started on my own full implementation there since I wasn't too happy with any of the existing implementations (the codebases in itself were messy and lacked things like KV-caching). I still haven't managed to get results as I'm quashing one bug at a time on my own since its from scratch. Yeah; I'm still not too happy with the VALL-E implementation I forked, since all my modifications are a bit spaghetti'd into the trainer, and I can't bring myself to spend a weekend to remake everything to train the model from square one with DeepSpeed (despite my qualms with DeepSpeed, I'm kinda married to it for ZeRO and the quantizers). I still have a gut feeling that there's some bug lurking around that I'll probably only stumble upon when I do rewrite the training script. > In any case I have access to a good chunk of compute (upto 2xA100) on the off chance you're being bottlenecked by compute feel free to let me know. Since this repository is being very active let me know if I can help out in any way. > Edit: I just noticed your comment about not accepting outside compute. Will leave my comment just as a +1 on compute in the future. mmm, I'll keep it in mind. I have an idea to work around that pesky distributed training bug in dual-GPU systems by having one GPU be responsible for the AR, and another for the NAR, but I need to actually jerryrig the training script to allow that, and to cram back in my 6800XTs into my training system (which, the last time I put one back in, it was a huge ordeal). --- In other news, the fullsize model at 2 RVQ bins seems to be very promising* ![image](/attachments/911330e5-57a8-4484-9d47-6e6bae6c1f46) It's only at half an epoch after I think two days(?), but it seems to be going down beyond that pesky floor of around AR loss=3.6 or so, and the accuracy is hovering around 77% so far right now (with a peak of 80%), while the NAR's accuracy is around 62%. (with a peak of like 68%) I checked some samples at it=2500 earlier this morning, and it's decent speech with somewhat decent cloning, better than the quarter sized run. It's just a bit of a pain, since the training process restarted twice, so it's not going to be a clean epoch (I really need to get around to storing the dataloader seed and index it was at to the checkpoint). ...however, the actual quality of the audio seems pretty shit. * On one hand, it could have stopped using Vocos, since the reference audio too sounds terrible. * On the other hand, it could also just be that 500 sampling steps is too low, since the default argument for it in the AR's code is 1000. If it really is that my ears suddenly hate 2 RVQ bins, then at the ***absolute worst***, I can compromise by using a NAR fit for higher RVQ bins, and thus increasing quality, and this could even be a quarter sized model too. I suppose this is another victory for the AR+NAR stack, since using one model wouldn't have this advantage. Aside from that, I have my hopes for once. --- And I almost had a scare. I was looking at an old copy of my fork, trying to figure out where in the AR does it actually care about more RVQ bin levels, and it actually does in the case of the input prompt [here](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/vall_e/base.py#L318). The scare came because it was showing it was still hard set to `8`, but the canonical version has it right, so all is good. ***If*** I wanted to, I suppose I could decouple this from the outputted residual levels, so I could either: * have a NAR that would only care about the first two bins on the input prompt, and output the full eight still, for quality purpose. * have an AR or NAR that would care for the full eight bins on the input prompt, for accuracy purposes, but have the NAR target a bandwidth different from its input prompts. Although, I'm not too sure how much of an improvement.
Author
Owner

mmm, this isn't necessarily related to the model, but moreso EnCodec in general, but I probably should have done more tests on EnCodec (and Vocos), as I found a teensy little oversight: voices that are "low quality" (robotic ones) tend to have noticeable issues with EnCodec, I'm pretty sure from there being insufficient data to resolve any nuances in the audio. Obviously, increasing the amount of residual layers "solves" most issues, but it's something I think is a little overkill for a niche that only I probably would care for.

Below are all encoded with EnCodec, and then decoded with Vocos:

  • GLaDOS: Reference / 2 layers / 4 layers / 8 layers / 16 layers
    • adding more residual levels makes things sound less harsh. 2 RVQs makes everything rather harsh, while it eases out at 16 RVQs, notably "button"
  • SHODAN (a): Reference / 2 layers / 4 layers / 8 layers / 16 layers
    • the noise in "hacker" stops sounding so terrible as the residual levels increase (that god awful noise in terrible AR-only output), while aspects of SHODAN's acoustics are very muted at 16 RVQs, and pretty much removed at 8.
  • SHODAN (b): Reference / 2 layers / 4 layers / 8 layers / 16 layers
    • I think this is a very good example on how well (or how un-well) EnCodec is at preserving all nuances of the acoustics, as it's barely present even at 16 RVQs, and practically removed at 8. The "processing" effect in the background gets eaten away and treated as a buzz, while the warbling in the stutter of "servant" is barely there at 16, and just a stutter at 8. Bummer.
  • HEV: Reference / 2 layers / 4 layers / 8 layers / 16 layers
    • Surprisingly, I can't tell any discrete issues outside of all re-encoded audio losing a very small nuance in the audio, but you have to really be looking out for it to notice.
  • James + Mary SH2: Reference / 2 layers / 4 layers / 8 layers / 16 layers
    • same as the HEV sample, there's functionally no difference between all levels, but it's all missing a very small nuance in the "Lynchian" very slight reverb effect applied that's in every line in Silent Hill 2. Again, you'll really need to have your ear tuned to really notice without hearing the reference.

This multi-hour endeavor caught my attention when I was trying to dig for a specific speaker amongst all of my outputted audio, and noticed the reference audio it exported in the evaluation output sounding very terrible. I also had some notes, but it actually was because I had remuxed the outputs to MP3s, and those issues actually were because of the rotational velocidensity emergent in MP3s, so I remuxed them to OGGs (because I needed a format that Chromium wouldn't automatically download, and instead have an HTML5 audio element play it without downloading).

In hindsight, though, it seems it's just a lot of nitpicking for a very niche application, since EnCodec works fine for normal human speech and not voices that are processed in unnatural ways. 2 RVQ bins should inherently be fine with Vocos, and if I really need every aspect of SHODAN's acoustic nuances preserved, then fuck me I guess, I'll just need to bake a NAR that outputs more residual levels.

mmm, this isn't necessarily related to the model, but moreso EnCodec in general, but I probably should have done more tests on EnCodec (and Vocos), as I found a *teensy* little oversight: voices that are "low quality" (robotic ones) tend to have noticeable issues with EnCodec, I'm pretty sure from there being insufficient data to resolve any nuances in the audio. Obviously, increasing the amount of residual layers "solves" most issues, but it's something I think is a little overkill for a niche that only I probably would care for. Below are all encoded with EnCodec, and then decoded with Vocos: * GLaDOS: [Reference](https://files.catbox.moe/re4p89.ogg) / [2 layers](https://files.catbox.moe/i6d5lo.ogg) / [4 layers](https://files.catbox.moe/9uaqyz.ogg) / [8 layers](https://files.catbox.moe/mv6f01.ogg) / [16 layers](https://files.catbox.moe/tj96pm.ogg) - adding more residual levels makes things sound less harsh. 2 RVQs makes everything rather harsh, while it eases out at 16 RVQs, notably "button" * SHODAN (a): [Reference](https://files.catbox.moe/zrq8nb.ogg) / [2 layers](https://files.catbox.moe/j5x0f8.ogg) / [4 layers](https://files.catbox.moe/pt1khc.ogg) / [8 layers](https://files.catbox.moe/84jnio.ogg) / [16 layers](https://files.catbox.moe/umeljm.ogg) - the noise in "hacker" stops sounding so terrible as the residual levels increase (that god awful noise in terrible AR-only output), while aspects of SHODAN's acoustics are very muted at 16 RVQs, and pretty much removed at 8. * SHODAN (b): [Reference](https://files.catbox.moe/2w5z5j.ogg) / [2 layers](https://files.catbox.moe/1vlgdi.ogg) / [4 layers](https://files.catbox.moe/3w8jsw.ogg) / [8 layers](https://files.catbox.moe/i9m8vf.ogg) / [16 layers](https://files.catbox.moe/ukzxna.ogg) - I think this is a very good example on how well (or how un-well) EnCodec is at preserving all nuances of the acoustics, as it's barely present even at 16 RVQs, and practically removed at 8. The "processing" effect in the background gets eaten away and treated as a buzz, while the warbling in the stutter of "servant" is barely there at 16, and just a stutter at 8. Bummer. * HEV: [Reference](https://files.catbox.moe/rlyeyh.ogg) / [2 layers](https://files.catbox.moe/77vhes.ogg) / [4 layers](https://files.catbox.moe/zofow2.ogg) / [8 layers](https://files.catbox.moe/tzp0o3.ogg) / [16 layers](https://files.catbox.moe/7kfuim.ogg) - Surprisingly, I can't tell any discrete issues outside of all re-encoded audio losing a *very* small nuance in the audio, but you have to ***really*** be looking out for it to notice. * James + Mary SH2: [Reference](https://files.catbox.moe/91woh1.ogg) / [2 layers](https://files.catbox.moe/9lqrk2.ogg) / [4 layers](https://files.catbox.moe/9bk79b.ogg) / [8 layers](https://files.catbox.moe/deffzs.ogg) / [16 layers](https://files.catbox.moe/oyuysu.ogg) - same as the HEV sample, there's functionally no difference between all levels, but it's all missing a *very* small nuance in the "Lynchian" very slight reverb effect applied that's in every line in Silent Hill 2. Again, you'll really need to have your ear tuned to really notice without hearing the reference. This multi-hour endeavor caught my attention when I was trying to dig for a specific speaker amongst all of my outputted audio, and noticed the reference audio it exported in the evaluation output sounding very terrible. I also had some notes, but it actually was because I had remuxed the outputs to MP3s, and those issues actually were because of the rotational velocidensity emergent in MP3s, so I remuxed them to OGGs (because I needed a format that Chromium wouldn't automatically download, and instead have an HTML5 audio element play it without downloading). In hindsight, though, it seems it's just a lot of nitpicking for a very niche application, since EnCodec works fine for normal human speech and not voices that are processed in unnatural ways. 2 RVQ bins should inherently be fine with Vocos, and if I ***really*** need every aspect of SHODAN's acoustic nuances preserved, then fuck me I guess, I'll just need to bake a NAR that outputs more residual levels.
Author
Owner

Weekly evaluation time: image

Naturally, reducing the amount of RVQ bins to attend to makes it much better for the model. In my head I've been trying to figure out if it's because:

  • the attention can narrow in better to the bins that matter (the first two) rather than "find" meaning in the later bins that don't actually matter. For the AR, this could greatly harm its output, since why should it really matter about the later residual levels when it's only outputting the first one. I suppose if I had a spare machine, I could verify if this is true by having the AR only receive the first residual level from the input prompt.
    • ...but I think at minimum 2 levels are needed. Given the pure AR output, there's barely enough information as-is in the raw output.
  • for the NAR specifically, when computing the loss from the logits, any deviations in the higher levels (the ones that matter less and less to the final audio) will be treated equally to the lower levels (the one that matter more).
    • I'll also need to re-audit the model code in more depth, but I wonder if the AR is only comparing against the first RVQ layer of the reference audio. The NAR does, but I can't quite see it also apply for the AR.

I'll probably need to figure out if the lifeiteng/vall-e implementation does have any fancy witchcraft that does de-emphasize the higher residual levels, or if the VALL-E paper does mention any mechanism for doing so, but I doubt it. I imagine, like every nut in the ML space, the paradigm of throwing moar compute at the problem bandaided a problem no one really noticed. But again, I am in no way an expert at anything related to machine learning or whatever you call it. I'm just a programmer.

The outputs are... mmm.

  • the outputs are mostly fine. Like, the 78% accuracy it averages around fine. It consistently at least speaks English, but it's not uncommon for a slight lingual error somewhere. There's a few outputs that have a crackle or a pop that poorly trained checkpoints will spit out, but not much, thankfully, but when there are some, they're very distracting.
  • it definitely favors speakers from audiobook sources. I'm not sure if there was a regression to completely favor against the non-audiobook speakers (the ones I've sourced myself), but it makes a bit of sense, because a huge majority of the dataset are from audiobooks. I imagine I could always train a few epochs on the pre-LibriTTS dataset, but who knows.
  • despite it favoring the audiobook speakers, clonability of them is a bit inconsistent. Some speakers will clone rather close, some, not so much. The validation dataset (reminder, this is a dataset that the model does not train, fresh data) is pretty terrible in cloning. I suppose the model at the very least can be used as a normal neural voice synthesis, and it's still a bit shocking hearing it speak English fine.

Regardless, here's them in their non-cherrypicked form from the last 1250 iterations: here

I think when the model is actually trained one a guaranteed full epoch (rather than an epochs-worth of data), then:

  • I'll probably go about and seeing if finetuning for a specific voice will hold fruitful. I just worry about doing it when it's not baked enough. I imagine if it works, then the model should be fine enough to actually finally release with the huge caveat that it needs to be finetuned, as a pure zero-shot will need more time (and probably more speakers).
  • I'll see how it pairs with cramming the AR with a NAR that outputs the full remaining RVQ bins, since I feel like the audio output at 2 RVQ bins isn't quite consistently good. The audiobooks sound quite narrow, despite my cherrypicking in the previous post sounding fine. The plus side with the AR+NAR being split is that I can do this without needing to completely retrain from scratch, and I do have the previous NAR still.
    • however, I'm not sure if I should also bother with a NAR that only attends to 2 RVQ bins from the input prompt, but outputs and functions for the remaining RVQ bins. Although, now that I think about it, I wonder if the NAR does actually have an inherent mechanism to attend to specifically one level given the embeddings do bind to a specific level.

I think the crux of the matter that's been nagging me for the past two days is the (possibly) unbiased attending to levels that don't really matter all that much, and losses being computed for levels that don't really matter all that much dragging down the rest of the model.

Weekly evaluation time: ![image](/attachments/df87da58-1fcd-4c41-86db-0ea777f29032) Naturally, reducing the amount of RVQ bins to attend to makes it much better for the model. In my head I've been trying to figure out if it's because: * the attention can narrow in better to the bins that matter (the first two) rather than "find" meaning in the later bins that don't actually matter. For the AR, this could greatly harm its output, since why should it really matter about the later residual levels when it's only outputting the first one. I suppose if I had a spare machine, I could verify if this is true by having the AR *only* receive the first residual level from the input prompt. - ...but I think at minimum 2 levels are needed. Given the pure AR output, there's barely enough information as-is in the raw output. * for the NAR specifically, when computing the loss from the logits, any deviations in the higher levels (the ones that matter less and less to the final audio) will be treated equally to the lower levels (the one that matter more). - I'll also need to re-audit the model code in more depth, but I wonder if the AR is *only* comparing against the first RVQ layer of the reference audio. The NAR does, but I can't quite see it also apply for the AR. I'll probably need to figure out if the [lifeiteng/vall-e](https://github.com/lifeiteng/vall-e/) implementation does have any fancy witchcraft that does de-emphasize the higher residual levels, or if the VALL-E paper does mention any mechanism for doing so, but I doubt it. I imagine, like every nut in the ML space, the paradigm of throwing moar compute at the problem bandaided a problem no one really noticed. But again, I am in no way an expert at anything related to machine learning or whatever you call it. I'm just a programmer. The outputs are... mmm. * the outputs are *mostly* fine. Like, the 78% accuracy it averages around *fine*. It consistently at least speaks English, but it's not uncommon for a slight lingual error somewhere. There's a few outputs that have a crackle or a pop that poorly trained checkpoints will spit out, but not much, thankfully, but when there are some, they're very distracting. * it ***definitely*** favors speakers from audiobook sources. I'm not sure if there was a regression to completely favor against the non-audiobook speakers (the ones I've sourced myself), but it makes a bit of sense, because a *huge* majority of the dataset are from audiobooks. I imagine I could always train a few epochs on the pre-LibriTTS dataset, but who knows. * despite it favoring the audiobook speakers, clonability of them is a bit inconsistent. Some speakers will clone rather close, some, not so much. The validation dataset (reminder, this is a dataset that the model does not train, fresh data) is pretty terrible in cloning. I suppose the model at the very least *can* be used as a normal neural voice synthesis, and it's still a bit shocking hearing it speak English fine. Regardless, here's them in their non-cherrypicked form from the last 1250 iterations: [here](https://files.catbox.moe/mn2dnt.7z) I think when the model is actually trained one a guaranteed full epoch (rather than an epochs-worth of data), then: * I'll probably go about and seeing if finetuning for a specific voice will hold fruitful. I just worry about doing it when it's not baked enough. I imagine if it works, then the model should be fine enough to actually finally release with the huge caveat that it needs to be finetuned, as a pure zero-shot will need more time (and probably more speakers). * I'll see how it pairs with cramming the AR with a NAR that outputs the full remaining RVQ bins, since I *feel* like the audio output at 2 RVQ bins isn't *quite* consistently good. The audiobooks sound quite narrow, despite my cherrypicking in the previous post sounding *fine*. The plus side with the AR+NAR being split is that I can do this without needing to completely retrain from scratch, and I do have the previous NAR still. - however, I'm not sure if I should also bother with a NAR that only attends to 2 RVQ bins from the input prompt, but outputs and functions for the remaining RVQ bins. Although, now that I think about it, I wonder if the NAR does actually have an inherent mechanism to attend to specifically one level given the embeddings do bind to a specific level. I think the crux of the matter that's been nagging me for the past two days is the (possibly) unbiased attending to levels that don't really matter all that much, and losses being computed for levels that don't really matter all that much dragging down the rest of the model.

Out of interest, do you eventually plan to release a stable model once you got it something work to a good level?

Out of interest, do you eventually plan to release a stable model once you got it something work to a good level?
Author
Owner

do you eventually plan to release a stable model once you got it something work to a good level?

Yeah.

My primary fear is that it'll just end up not performing as nicely as TorToiSe, either as a zero-shot model or as a model for finetuning. There's some decent output from the eval / val output, but not consistent enough for me to be confident in it at the moment.


I'm waiting until tomorrow to do my weekly evaluation on how the model is shaping up to be. I have it training for a few days on a dataset with the donated audiobooks pruned to try and bolster the other 2439 speakers (the 572 hours one). If the model isn't fucked, then I suppose I can move onto trying to finetune it and seeing how it fares. I honestly have zero idea on how it'll go.

But for sure, if I need a zero-shot model, I need both more speakers, and to modify my training strategy.

> do you eventually plan to release a stable model once you got it something work to a good level? Yeah. My primary fear is that it'll just end up not performing as nicely as TorToiSe, either as a zero-shot model or as a model for finetuning. There's some decent output from the eval / val output, but not consistent enough for me to be confident in it at the moment. --- I'm waiting until tomorrow to do my weekly evaluation on how the model is shaping up to be. I have it training for a few days on a dataset with the donated audiobooks pruned to try and bolster the other 2439 speakers (the 572 hours one). If the model isn't fucked, then I suppose I can move onto trying to finetune it and seeing how it fares. I honestly have zero idea on how it'll go. But for sure, if I need a zero-shot model, I need both more speakers, and to modify my training strategy.
Author
Owner

mmm. I did the daily evaluation listening-to, and this was quite a gem to hear (ironic, given what's said):

Given an seemingly utterly useless prompt it selected and trimmed randomly, it assumed it was actually the HEV suit and cloned that voice. I think I should be a bit concerned, but I can't really discern what the implications of this is right now. Well, at the very least:

  • this definitely means it can associate the acoustics of an input prompt with a speaker of similar acoustics and reference it later, being a net good for finetuning.
mmm. I did the daily evaluation listening-to, and this was quite a gem to hear (ironic, given what's said): * G-Man (Half Life): [reference](https://vocaroo.com/19c0CIz8eKDb) / [output](https://vocaroo.com/17XNKwoqpLbF) / [prompt](https://vocaroo.com/17UwJLe6UuwL) Given an seemingly utterly useless prompt it selected and trimmed randomly, it assumed it was actually the HEV suit and cloned that voice. I think I should be a bit concerned, but I can't really discern what the implications of this is right now. Well, at the very least: * this definitely means it *can* associate the acoustics of an input prompt with a speaker of similar acoustics and reference it later, being a net good for finetuning.

I am throwing some money and compute at this to try to reproduce the vall-e paper with the full 50k hour dataset. Any sense on whether this repo will achieve parity with that? (or do worse, or even outperform?)

The full model outputs from the original paper seem a little sus in places but great in other places. Do you think it will outperform open source repos like tortoise and piper?

I am throwing some money and compute at this to try to reproduce the vall-e paper with the full 50k hour dataset. Any sense on whether this repo will achieve parity with that? (or do worse, or even outperform?) The full model outputs from the original paper seem a little sus in places but great in other places. Do you think it will outperform open source repos like tortoise and piper?
Author
Owner

I swear the moment I started typing this, my noggin decided to really not want to express my thoughts, so bear in mind if it sounds rough.

Any sense on whether this repo will achieve parity with that? (or do worse, or even outperform?)

I'm pretty confident that my fork should, at the very least, perform at parity with the original VALL-E's samples. The only hinderance with it now are:

  • training time / compute.
    • my downfall is being so stubborn from coping with my 4070Ti.
  • a sufficiently large / ""diverse"" enough dataset.
    • a dataset at parity with TorToiSe's training set (if I recall is ~50K) or the full LibriLight's 60K should definitely have a competent enough zero-shot model. The lifeiteng/vall-e homebrewed model fails at zero-shot because it was only trained on ~550 hours / ~2000 speakers.
  • distributed training I still think is broken, but I'll need to dedicate a day to actually ensure it works.
    • and I think I need to spend a day, as well, to clean up things to make it user-friendly to train and inference with. It's usable right now, but I think there's a some eccentricities with using it that I may or may not have accepted as normal.

I don't think my fork / the base-I-forked-from's model code deviates all that much from the paper. I'd say the only intentional deviation would be from me training my model at EnCodec's 1.5kbps rather than the 6kbps the paper uses, but Vocos helps supplement the loss in inherent quality.

Do you think it will outperform open source repos like tortoise and piper?

Definitely. I've been very pleased with my results, when they do crop up. Out of all the outputs I've had listened to during training, I don't think I ever really had any of the issues that cropped up from TorToiSe.

Although, I could just be dickriding VALL-E. I just really like how elegant and robust the stack is, and doesn't have to use any intermediary to represent a voice (like TorToiSe's conditioning latents and Bark's whatever it's called). It Just Works.

I swear the moment I started typing this, my noggin decided to really not want to express my thoughts, so bear in mind if it sounds rough. > Any sense on whether this repo will achieve parity with that? (or do worse, or even outperform?) I'm pretty confident that [my fork](https://git.ecker.tech/mrq/vall-e) should, at the very least, perform at parity with the original VALL-E's samples. The only hinderance with it now are: * training time / compute. - my downfall is being so stubborn from coping with my 4070Ti. * a sufficiently large / ""diverse"" enough dataset. - a dataset at parity with TorToiSe's training set (if I recall is ~50K) or the full LibriLight's 60K should definitely have a competent enough zero-shot model. The [lifeiteng/vall-e](https://github.com/lifeiteng/vall-e/) homebrewed model fails at zero-shot because it was only trained on ~550 hours / ~2000 speakers. * distributed training I still think is broken, but I'll need to dedicate a day to actually ensure it works. - and I think I need to spend a day, as well, to clean up things to make it user-friendly to train and inference with. It's usable right now, but I think there's a some *eccentricities* with using it that I may or may not have accepted as normal. I don't think my fork / the base-I-forked-from's model code deviates all that much from the paper. I'd say the only intentional deviation would be from me training my model at EnCodec's 1.5kbps rather than the 6kbps the paper uses, but Vocos helps supplement the loss in inherent quality. > Do you think it will outperform open source repos like tortoise and piper? Definitely. I've been very pleased with my results, when they do crop up. Out of all the outputs I've had listened to during training, I don't think I ever really had any of the issues that cropped up from TorToiSe. Although, I could just be dickriding VALL-E. I just *really* like how elegant and robust the stack is, and doesn't have to use any intermediary to represent a voice (like TorToiSe's conditioning latents and Bark's whatever it's called). It Just Works.
Author
Owner

Thank fucking god, finetuning is quite a success for how little time I did run the training on, although:

  • I did run the training script for like, an hour.
  • I did zero tweaks to the hyperparameters. To reiterate:
    • I kept the LR at the same 2.5e-4 I used for training
    • the same batch size and gradient accumulation factors were used, to where it would be a few epochs worth of data before the parameters are updated.
  • the optimizer states are reused from training, although I don't think it matter much.
  • this kinda shows how "meh", for lack of a better term, using 2 RVQs / 1.5kbps is. I could probably either add in another NAR model to levels 3+, or see if the mixed quantization levels approach I mentioned would fare. Although, it could just be because I lazily converted to OGGs. I'll need to double check that Vocos was used again.
  • there's some slight issues with the speech in general, but I imagine a better base would work, or properly finetuning it, rather than my bruteish approach.

But here's the quick test output from it: evaluation / validation output. The impressive portion is the validation output, where it pretty much just uses the transcript from the validation dataset, and I suppose it's the same with TorToiSe to where you can pretty much """mix""" a voice of a given input prompt with a voice you finetune on, since there's some variance in the output.

I'll try and finetune on some other voices, and if that works, I'll have to try and fix the web UI to get inference working again, since I think I broke it when I added Vocos. And then most likely I can release my weights to finetune from.


I also had some notes from my evaluation I was going to say yesterday (or the other day) but elected to have it put off for my weekly evaluation, but I'm a bit fried on 4 or 5 hours of sleep, and seeing that finetuning is pretty possible made me forget what I was going to report on anyways.

Thank fucking god, finetuning is quite a success for how little time I *did* run the training on, although: * I did run the training script for like, an hour. * I did zero tweaks to the hyperparameters. To reiterate: - I kept the LR at the same 2.5e-4 I used for training - the same batch size and gradient accumulation factors were used, to where it would be a few epochs worth of data before the parameters are updated. * the optimizer states are reused from training, although I don't think it matter much. * this kinda shows how "meh", for lack of a better term, using 2 RVQs / 1.5kbps is. I could *probably* either add in another NAR model to levels 3+, or see if the mixed quantization levels approach I mentioned would fare. Although, it could just be because I lazily converted to OGGs. I'll need to double check that Vocos was used again. * there's some ***slight*** issues with the speech in general, but I imagine a better base would work, or properly finetuning it, rather than my bruteish approach. But here's the quick test output from it: [evaluation / validation output](https://files.catbox.moe/kk7kav.7z). The impressive portion is the validation output, where it pretty much just uses the transcript from the validation dataset, and I ***suppose*** it's the same with TorToiSe to where you can pretty much """mix""" a voice of a given input prompt with a voice you finetune on, since there's some variance in the output. I'll try and finetune on some other voices, and if that works, I'll have to try and fix the web UI to get inference working again, since I think I broke it when I added Vocos. And ***then*** most likely I can release my weights to finetune from. --- I also had some notes from my evaluation I was going to say yesterday (or the other day) but elected to have it put off for my weekly evaluation, but I'm a bit fried on 4 or 5 hours of sleep, and seeing that finetuning is pretty possible made me forget what I was going to report on anyways.
Author
Owner

Another test with a partial finetune of GLaDOS (iteration 250, batch size 8). Some validation outputs I found quite interesting:

With a little bit of finetuning, a lot of voices it would receive as an input prompt will carry over additional traits of the finetuned voice. My ear isn't quite tuned to take note if the acoustics themselves changed too, but the general speech of each voice changes to the target (GLaDOS). And this isn't even a full finetune yet.

I used a much lower LR rate of 1.0e-5 and a gradient accumulation of I think 24, just so it would finetune a little nicer,

Figured to share it before I go fuck off again with more finetuned tests, but I'm very, very pleased even with an barely adequate base.

I tried a finetune with SHODAN but I didn't get favorable results. I'll have to try her again with less aggressive hyperparameters.

Another test with a partial finetune of GLaDOS (iteration 250, batch size 8). Some validation outputs I found quite interesting: * [output](https://vocaroo.com/18cMfxHFR49p) / [reference](https://vocaroo.com/15LtpDCLS0yv) * [output](https://vocaroo.com/17mD2RoyQWya) / [reference](https://vocaroo.com/1htt9OTTJK83) With a little bit of finetuning, a lot of voices it would receive as an input prompt will carry over additional traits of the finetuned voice. My ear isn't quite tuned to take note if the acoustics themselves changed too, but the general speech of each voice changes to the target (GLaDOS). And this isn't even a full finetune yet. I used a much lower LR rate of 1.0e-5 and a gradient accumulation of I think 24, just so it would finetune a little nicer, Figured to share it before I go fuck off again with more finetuned tests, but I'm very, very pleased even with an ***barely adequate*** base. I tried a finetune with SHODAN but I didn't get favorable results. I'll have to try her again with less aggressive hyperparameters.

Have you messed with Mangio's RVC fork? https://github.com/Mangio621/Mangio-RVC-Fork

I notice if I run output from here through a model trained on a similar dataset, it improves even more the overall quality and makes the voice more consistent throughout the speech, using the Harvest model. It also allows for dynamic pitch modification on audio input.

Have you messed with Mangio's RVC fork? https://github.com/Mangio621/Mangio-RVC-Fork I notice if I run output from here through a model trained on a similar dataset, it improves even more the overall quality and makes the voice more consistent throughout the speech, using the Harvest model. It also allows for dynamic pitch modification on audio input.
Author
Owner

Have you messed with Mangio's RVC fork?

I've actually thought about running it through an RVC to see how things are cleaned up. The output (finetuned or not) is fine, but both the actual audio quality is a bit lacking, and there's occasional issues in the actual speech here and there, so I imagine running it through RVC would help clean up things a lot. If it works out, I suppose it'll end up getting added to the web UI anyhow, and it can be used for TorToiSe in lieu of VoiceFixer (which I would like to replace, since for however long I've had it in the stack, it would consistently have some crackle at the end).

It would be a nice way to try and bridge the gap between "fine" enough output from my VALL-E model and "good enough to use", as I worry manually training the model some more would take an astronomical amount of more time (or data).

> Have you messed with Mangio's RVC fork? I've actually thought about running it through an RVC to see how things are cleaned up. The output (finetuned or not) is *fine*, but both the actual audio quality is a bit lacking, and there's occasional issues in the actual speech here and there, so I imagine running it through RVC would help clean up things a lot. If it works out, I suppose it'll end up getting added to the web UI anyhow, and it can be used for TorToiSe in lieu of VoiceFixer (which I would like to replace, since for however long I've had it in the stack, it would consistently have some crackle at the end). It would be a nice way to try and bridge the gap between "fine" enough output from my VALL-E model and "good enough to use", as I worry manually training the model some more would take an astronomical amount of more time (or data).

Have you messed with Mangio's RVC fork?

I've actually thought about running it through an RVC to see how things are cleaned up. The output (finetuned or not) is fine, but both the actual audio quality is a bit lacking, and there's occasional issues in the actual speech here and there, so I imagine running it through RVC would help clean up things a lot. If it works out, I suppose it'll end up getting added to the web UI anyhow, and it can be used for TorToiSe in lieu of VoiceFixer (which I would like to replace, since for however long I've had it in the stack, it would consistently have some crackle at the end).

It would be a nice way to try and bridge the gap between "fine" enough output from my VALL-E model and "good enough to use", as I worry manually training the model some more would take an astronomical amount of more time (or data).

What I've found more specifically is that I can skate with faster output from here (lower samples and lower iterations) because rvc seems to "boil" down the input audio and then reapply its own latents to it. If the input audio is already in the ballpark, then it will come out nicer.

How do I know this? I have tortoise audio trained on one dataset and rvc trained on different dataset from 20 years in the future (same speaker). Despite the sound difference due to age, it can still blend very very well on a different dataset output because the speaker is the same. I've tried likewise using the same dataset for both and of course sounds good as well, but I just prefer the voice from the two datasets blended in my case.

I definitely can understand the challenge for trying to train two models... RVC takes a couple hours in my experience for 200ish epochs. That said, it's mandatory for me now because the quality is just night and day better as a final polish. Oh, and I also normalize the audio volume inbetween.

> > Have you messed with Mangio's RVC fork? > > I've actually thought about running it through an RVC to see how things are cleaned up. The output (finetuned or not) is *fine*, but both the actual audio quality is a bit lacking, and there's occasional issues in the actual speech here and there, so I imagine running it through RVC would help clean up things a lot. If it works out, I suppose it'll end up getting added to the web UI anyhow, and it can be used for TorToiSe in lieu of VoiceFixer (which I would like to replace, since for however long I've had it in the stack, it would consistently have some crackle at the end). > > It would be a nice way to try and bridge the gap between "fine" enough output from my VALL-E model and "good enough to use", as I worry manually training the model some more would take an astronomical amount of more time (or data). What I've found more specifically is that I can skate with faster output from here (lower samples and lower iterations) because rvc seems to "boil" down the input audio and then reapply its own latents to it. If the input audio is already in the ballpark, then it will come out nicer. How do I know this? I have tortoise audio trained on one dataset and rvc trained on different dataset from 20 years in the future (same speaker). Despite the sound difference due to age, it can still blend very very well on a different dataset output because the speaker is the same. I've tried likewise using the same dataset for both and of course sounds good as well, but I just prefer the voice from the two datasets blended in my case. I definitely can understand the challenge for trying to train two models... RVC takes a couple hours in my experience for 200ish epochs. That said, it's mandatory for me now because the quality is just night and day better as a final polish. Oh, and I also normalize the audio volume inbetween.
Author
Owner

Oh right, I forgot to do my weekend assessment. I swear the moment I started playing around with finetunes, I had other endeavors I needed to attend to, and my brain just stopped wanting to keep up with lengthy posts and went mush, but I did get enough out of the tests to use for my weekly assessment.

  1. To reiterate, I am reminded about how, in the land of LLaMA, models are reported in terms of token count, and how the first epoch is what really matters (maybe two, maybe). I really do not think I should keep trying to let things train and "see how it goes". It's cope. From the couple of days I had it train on the non-audiobook/reduced dataset, it doesn't seem to really get any better, and if anything I feel like it kinda got worse. I only have it training on the reduced dataset again just because I couldn't really attend to the model the past few days, so might as well just use the time to see how it'd progress. It looks fine, but I'm not expecting the output to get better.
  2. Finetuning works, and I'm thrilled it does. Even in my partial tests where I didn't really do things right, I'm a bit amazed that it mostly just works. It can't replace TorToiSe out right, right now (for reasons I'll get into later). I honestly forgot the specifics of the details of finetuning, but the thoughts I gave before still apply and haven't changed.
  3. I don't think I should release the weights as they currently are at, as there's some inherent problems with it (but can be fixed within the dataset itself). There's a bit of inherent problems in the model from when I was using the web UI to generate outputs, both with the "base" model and the finetuned model, and it's mostly a mix of:
    • phonemizer: the current phonemizer is kinda shit. I guess I don't have puncuation preserved, and trying to "pad" with spaces instead of commas doesn't work. I always didn't like my actual phonemizer approach, so it gives me an excuse to axe it. I'm pretty sure any pauses are incidental from what it "learned" from existing phoneme sequences. I think the issue lies in the fact I'm phonemizing normalized text, which would erase punctuation.
    • text length: I suppose this suffers the same problem as TorToiSe, or much less the issue of attention layers, but the length of the segments trained on will definitely determine the maximum you can have your output. Exceeding it is bad news. This can be remedied by training on data with longer text / utterances, and in code you can adjust the hard limits, as they're re-using TorToiSe's hard limits. Also, ensure your training YAML has a sufficiently large max_phoneme_length value or whatever.
    • padding around the audio: the outputted audio leaves very little spacing between the output, so in the web UI, when it combines lines, it sounds unnatural and dogshit. A fix with the dataset is have less tight slices, but I think in code I can pad it with tokens.
    • inferencing: it's kind of a pain, since the "easy" way requires leveraging a voice that was already processed in the Train > Prepare Dataset tab, like I have to do with Bark, but it's just to grab input prompts. TorToiSe at least has the beauty of "averaging" out all utterances in the latents. However, I do have ideas to make it work "better" by using the transcription to pick the closest utterance to what you're trying to output using embedding comparisons.
    • output quality (speech): The output doesn't sound very good in practice. Zero shot sounds terrible, and heavily favors the voices from the LibriTTS dataset (or audiobook-y speakers where it sounds much better than the voices I've sources, but still not that impressive. Mitsuru sounds fine, the Half-Life tram speaker voice doesn't work, GLaDOS is outright wrong, but SHODAN somehow has some semblance to her, if you ignore the terrible noise it tries to replicate. The speech is typically stilted, it sounds like an alien trying to mimic speech sometimes, where it's better with audiobook-y speakers, but falls apart with voices it's not too familiar with. Although, the finetunes helped a lot in the regard, but as a zero shot, my model I baked isn't very good.
    • output quality (raw waveform): It's.... serviceable at best. I guess I fell for the meme that 2 RVQ bins is good enough. However, because additional RVQ bins lies in the NAR, I could always chain NARs together, have a strong one for the 2nd layer, and then another one that will handle the rest. I could look into reusing the one I have for 8 RVQ bins.
  4. On the flipside, inferencing is remarkably fast, at least, on adequately lengthed text. The AR will get slower the longer the text is (naturally), but still, it's fast and easy. I can also leverage batching with being able to combine multiple lines and inference at once, but even without it, I was able to chew through a ton of lines quickly. Inference speed at least isn't tied to the actual model quality itself, so that's a good.
  5. I think I alluded to having to re-evaluate my training process, because I kept entertaining the idea of, when a speaker gets sampled, do it with replacement, so that way the LibriTTS and audiobook speakers don't entirely overpower the speakers I've sourced myself. I don't think it actually matters. I mentioned earlier that what seems to really matter is new data, not repeating on the existing data with slight variations through changing the input prompt. Instead, I think the "paradigm shift in training" is pretty much fixing the issues I've raised above in the dataset itself. What good is a fancy new way to train if the data it's trained on is bunk.
  6. I'm going to need to source LibriLight, probably the full 60K hours. The model currently is pathetic when it comes to zero shot, and a strong base will lead to an even stronger finetune, or, in other words, finetuning does help a lot, but it can only do so much currently.

I'm a bit bummed, since I'm definitely going to have to retrain again, but it only took a few days at least for one epoch (the fullsized model didn't seem to benefit in the overall time it took to train one epoch). But hey, it's still a learning process.

What I'm not looking forward to, is processing LibriLight. Processing the 2000+ hour audiobooks I was donated barely was able to fit on my SSD when processing the full thing, and the process took quite a while if I remember right. What I'm also not looking forward to is trying to nail in my tweaks to the dataset creation process. I think I can get away with re-phonemizing text, since I still have my transcriptions, but I don't know. It's kind of daunting.

I'm also hindered a bit with doing anything outside of a terminal on my training system, as I made the grave mistake of pacman -Syu and now I can't get Chrome Remote Desktop working again, and the fixes of the past won't work now. I refuse to settle for VNC.


However, RetNet seems like a new shiny toy that'll come around soon, and it seems there's an implementation for it too. I am not looking forward to try and cram it in place, as the actual architecture of a model is my weakest point (again, I am not an expert, I'm just a programmer). However, I'm kinda certain that it can be sort of drop in place for TorToiSe, so that sounds like a distraction I can indulge in.

Oh right, I forgot to do my weekend assessment. I swear the moment I started playing around with finetunes, I had other endeavors I needed to attend to, and my brain just stopped wanting to keep up with lengthy posts and went mush, but I did get enough out of the tests to use for my weekly assessment. 1. To reiterate, I am reminded about how, in the land of LLaMA, models are reported in terms of token count, and how the first epoch is what really matters (*maybe* two, maybe). I really do not think I should keep trying to let things train and "see how it goes". It's cope. From the couple of days I had it train on the non-audiobook/reduced dataset, it doesn't seem to really get any better, and if anything I feel like it kinda got worse. I only have it training on the reduced dataset again just because I couldn't really attend to the model the past few days, so might as well just use the time to see how it'd progress. It looks *fine*, but I'm not expecting the output to get better. 2. Finetuning *works*, and I'm thrilled it does. Even in my partial tests where I didn't really do things right, I'm a bit amazed that it mostly just works. It can't replace TorToiSe out right, right now (for reasons I'll get into later). I honestly forgot the specifics of the details of finetuning, but the thoughts I gave before still apply and haven't changed. 3. I don't think I *should* release the weights as they currently are at, as there's some inherent problems with it (but can be fixed within the dataset itself). There's a bit of inherent problems in the model from when I was using the web UI to generate outputs, both with the "base" model and the finetuned model, and it's mostly a mix of: * phonemizer: the current phonemizer is kinda shit. I guess I don't have puncuation preserved, and trying to "pad" with spaces instead of commas doesn't work. I always didn't like my actual phonemizer approach, so it gives me an excuse to axe it. I'm pretty sure any pauses are incidental from what it "learned" from existing phoneme sequences. I think the issue lies in the fact I'm phonemizing normalized text, which would erase punctuation. * text length: I suppose this suffers the same problem as TorToiSe, or much less the issue of attention layers, but the length of the segments trained on will definitely determine the maximum you can have your output. Exceeding it is bad news. This can be remedied by training on data with longer text / utterances, and in code you can adjust the hard limits, as they're re-using TorToiSe's hard limits. Also, ensure your training YAML has a sufficiently large `max_phoneme_length` value or whatever. * padding around the audio: the outputted audio leaves very little spacing between the output, so in the web UI, when it combines lines, it sounds unnatural and dogshit. A fix with the dataset is have less tight slices, but I think in code I can pad it with tokens. * inferencing: it's kind of a pain, since the "easy" way requires leveraging a voice that was already processed in the `Train > Prepare Dataset` tab, like I have to do with Bark, but it's just to grab input prompts. TorToiSe at least has the beauty of "averaging" out all utterances in the latents. However, I do have ideas to make it work "better" by using the transcription to pick the closest utterance to what you're trying to output using embedding comparisons. * output quality (speech): The output doesn't sound very good in practice. Zero shot sounds terrible, and heavily favors the voices from the LibriTTS dataset (or audiobook-y speakers where it sounds much better than the voices I've sources, but still not that impressive. Mitsuru sounds *fine*, the Half-Life tram speaker voice doesn't work, GLaDOS is outright wrong, but SHODAN somehow has some semblance to her, if you ignore the terrible noise it tries to replicate. The speech is typically stilted, it sounds like an alien trying to mimic speech sometimes, where it's better with audiobook-y speakers, but falls apart with voices it's not too familiar with. Although, the finetunes helped a lot in the regard, but as a zero shot, my model I baked isn't very good. * output quality (raw waveform): It's.... serviceable at best. I guess I fell for the meme that 2 RVQ bins is good enough. However, because additional RVQ bins lies in the NAR, I *could* always chain NARs together, have a strong one for the 2nd layer, and then another one that will handle the rest. I could look into reusing the one I have for 8 RVQ bins. 4. On the flipside, inferencing is remarkably fast, at least, on adequately lengthed text. The AR will get slower the longer the text is (naturally), but still, it's fast and easy. I can also leverage batching with being able to combine multiple lines and inference at once, but even without it, I was able to chew through a ton of lines quickly. Inference speed at least isn't tied to the actual model quality itself, so that's a good. 5. I think I alluded to having to re-evaluate my training process, because I kept entertaining the idea of, when a speaker gets sampled, do it with replacement, so that way the LibriTTS and audiobook speakers don't entirely overpower the speakers I've sourced myself. I don't think it actually matters. I mentioned earlier that what seems to really matter is new data, not repeating on the existing data with slight variations through changing the input prompt. Instead, I think the "paradigm shift in training" is pretty much fixing the issues I've raised above in the dataset itself. What good is a fancy new way to train if the data it's trained on is bunk. 6. I'm going to need to source LibriLight, probably the full 60K hours. The model currently is pathetic when it comes to zero shot, and a strong base will lead to an even stronger finetune, or, in other words, finetuning does help a lot, but it can only do so much currently. I'm a bit bummed, since I'm definitely going to have to retrain again, but it only took a few days at least for one epoch (the fullsized model didn't seem to benefit in the overall time it took to train one epoch). But hey, it's still a learning process. What I'm not looking forward to, is processing LibriLight. Processing the 2000+ hour audiobooks I was donated ***barely*** was able to fit on my SSD when processing the full thing, and the process took quite a while if I remember right. What I'm also not looking forward to is trying to nail in my tweaks to the dataset creation process. I think I can get away with re-phonemizing text, since I still have my transcriptions, but I don't know. It's kind of daunting. I'm also hindered a bit with doing anything outside of a terminal on my training system, as I made the grave mistake of `pacman -Syu` and now I can't get Chrome Remote Desktop working again, and the fixes of the past won't work now. I refuse to settle for VNC. --- ***However***, [RetNet](https://github.com/microsoft/unilm/tree/master/retnet) seems like a new shiny toy that'll come around soon, and it seems there's an [implementation](https://github.com/Ronsor/nanoretnet/tree/master) for it too. I am not looking forward to try and cram it in place, as the actual architecture of a model is my weakest point (again, I am not an expert, I'm just a programmer). However, I'm kinda certain that it can be sort of drop in place for TorToiSe, so that sounds like a distraction I can indulge in.
Author
Owner

Don't know how I missed this, I guess it was submitted in the middle of my writeup.

What I've found more specifically is that I can skate with faster output from here (lower samples and lower iterations) because rvc seems to "boil" down the input audio and then reapply its own latents to it. If the input audio is already in the ballpark, then it will come out nicer.

Ah right, that reminds me, I need to confidently check that the max_steps passed into the AR actually is for the length of its output or if it's just a quality thing, although I'm very sure it's the former, since there's no sampling done.

The thing with TorToiSe, and I don't think I ever caught on until much, much later, is that the "sample count" for its AR are technically samples insofar as you're picking the best out of the bunch, but inherently aren't what boosts quality. What the TorToiSe stack does is generate a bunch of potential candidates (which is where it takes the most time and VRAM), and then the "best" of the bunch gets compared through the CLVP/CVVP. I still think it's an interesting approach, but it's still a cope approach.

  • the sort of catch is that, for finetunes, this approach doesn't seem to make much of a difference, because finetunes will make the variance between choices smaller. I always found finetunes to never make a difference when increasing the AR sample count.
  • sort of incidental, I think my VALL-E tests the other day kind of also has this issue, where the base model has a ton of variance between samples, but finetuning narrows down said variance.

I definitely can understand the challenge for trying to train two models... RVC takes a couple hours in my experience for 200ish epochs.

That doesn't sound too bad. I didn't take exact notes on how long I was running finetunes on the weights at the time, but it felt like a mix between "wow, my iteration rate is definitely faster than TorToiSe" and the reality of "my god this is actually going to take a bit, since the losses don't seem to go down as far as they do when finetuning TorToiSe".

But yeah, having to tote an additional model to finetune and support is a bit daunting to try and implement into the web UI. I'm still feeling guilty of having training VALL-E """supported""" through it, but I have never actually used it, since it's just better to train from the command line instead.

But I'll keep it in my mind at the end of my next training run, hoping that there aren't any more tweaks needed.

Oh, and I also normalize the audio volume inbetween.

Fuck, that's right. The one thing I forgot to do with my training dataset is normalize for audio. Another thing on my to-do list I suppose.


And a sort of addendum note to my last report. I was mulling over it, and I don't quite understand why my inference output sounded that gimped on the base model (after training it a bit with the non-audiobook dataset), yet all the finetunes didn't seem to have that glaring issue (or rather, what I remember of it). All of the validation output sounded fine (semi-especially for the finetune tests), but I think the last I checked of it was before I pivoted to a reduced dataset. So I might actually need to do my tests again from the base model before that, and if it's better, then I really did fuck it up with training on the reduced dataset. At least that checkpoint is still around.

Don't know how I missed this, I guess it was submitted in the middle of my writeup. > What I've found more specifically is that I can skate with faster output from here (lower samples and lower iterations) because rvc seems to "boil" down the input audio and then reapply its own latents to it. If the input audio is already in the ballpark, then it will come out nicer. Ah right, that reminds me, I need to confidently check that the `max_steps` passed into the AR actually is for the length of its output or if it's just a quality thing, although I'm very sure it's the former, since there's no sampling done. The thing with TorToiSe, and I don't think I ever caught on until much, much later, is that the "sample count" for its AR are *technically* samples insofar as you're picking the best out of the bunch, but inherently aren't what boosts quality. What the TorToiSe stack does is generate a bunch of potential candidates (which is where it takes the most time and VRAM), and then the "best" of the bunch gets compared through the CLVP/CVVP. I still think it's an interesting approach, but it's still a cope approach. * the sort of catch is that, for finetunes, this approach doesn't seem to make much of a difference, because finetunes will make the variance between choices smaller. I always found finetunes to never make a difference when increasing the AR sample count. * sort of incidental, I think my VALL-E tests the other day kind of also has this issue, where the base model has a ton of variance between samples, but finetuning narrows down said variance. > I definitely can understand the challenge for trying to train two models... RVC takes a couple hours in my experience for 200ish epochs. That doesn't sound *too bad*. I didn't take exact notes on how long I was running finetunes on the weights at the time, but it felt like a mix between "wow, my iteration rate is definitely faster than TorToiSe" and the reality of "my god this is actually going to take a bit, since the losses don't seem to go down as far as they do when finetuning TorToiSe". But yeah, having to tote an additional model to finetune and support is a bit daunting to try and implement into the web UI. I'm still feeling guilty of having training VALL-E """supported""" through it, but I have never actually used it, since it's just better to train from the command line instead. But I'll keep it in my mind at the end of my next training run, hoping that there aren't any more tweaks needed. > Oh, and I also normalize the audio volume inbetween. Fuck, that's right. The one thing I forgot to do with my training dataset is normalize for audio. Another thing on my to-do list I suppose. --- And a sort of addendum note to my last report. I was mulling over it, and I don't quite understand why my inference output sounded that gimped on the base model (after training it a bit with the non-audiobook dataset), yet all the finetunes didn't seem to have that glaring issue (or rather, what I remember of it). All of the validation output sounded *fine* (semi-especially for the finetune tests), but I think the last I checked of it was before I pivoted to a reduced dataset. So I might actually need to do my tests again from the base model before that, and if it's better, then I really did fuck it up with training on the reduced dataset. At least that checkpoint is still around.

Thanks, mrq, as always! Reading your writeups are always an interesting bit of my day. I don't have as many hobbyist experts around and it's nice to read something with that level of passion... even though you're much farther along than me!

I'm also very excited as I just noticed that RVC has a new pitch extraction model called rmvpe.... I can't find much info on its technical specifics but it is MUCH faster!!! for the voice conversion. As in, 10 minutes of audio converted to target speaker in under 1 minute processing time. Faster than real time !

Thanks, mrq, as always! Reading your writeups are always an interesting bit of my day. I don't have as many hobbyist experts around and it's nice to read something with that level of passion... even though you're much farther along than me! I'm also very excited as I just noticed that RVC has a new pitch extraction model called rmvpe.... I can't find much info on its technical specifics but it is MUCH faster!!! for the voice conversion. As in, 10 minutes of audio converted to target speaker in under 1 minute processing time. Faster than real time !

Can vouch for the Tortoise TTS to RVC rmvpe pipeline. Gives results on par and sometimes even better than 11labs, sounds absolutely amazing.

Can vouch for the Tortoise TTS to RVC rmvpe pipeline. Gives results on par and sometimes even better than 11labs, sounds absolutely amazing.
Author
Owner

Alright, I'll see about adding it into the web UI if I get a moment over the weekend. I think it should be easy to slot in something, at least, if I don't spaghetti over the web UI code again. I really need to rewrite it, but that's another day.


I got around to trying to listen to the evaluation / validation output (for real this time) from the model while I was training on the reduced dataset, and after when I pivoted back to the full dataset and... my god was that a mistake. It's pretty sad hearing the general quality degrade over time, despite the loss / accuracy being about the same. I guess that strat won't work very well, although I don't know if it just means additional epochs will eventually degrade it (unless I train at a smaller LR).

But here's the "final" graph for the past two-ish weeks. You can see the points where I pivoted between the two when the loss would shift a small amount. But I suppose I'll have to shelve this model, as it's still inherently flawed: image

I did fix a few issues on the inferencing side within half an hour of fucking with it again:

  • my input text normalized, and it seemed to help a bit, although I need to go back and re-phonemize my training transcription to include punctuation (sans period). I think commas would help the model figure out pauses, while question marks and exclamation marks will hint at appropriate inflections.
    • This also seemed to help "fix" whatever general speech issues I was having. I wonder why specifically.
    • I also don't seem to really have much issues with "long" sentences. I also wonder why.
  • I figured out the "inconsistent voices between lines" issue when stitching. In the web UI's generate_valle function, I don't have it reuse the same input prompt, so any voices it's already not so familiar with will vary greatly if the reference voice clips it pulls from vary enough. I also lied, and you don't actually need to process your voice and transcribe, it handles ./voices/{voice}/ fine.
  • "samples" mapping to step count might as well be removed. Step count is what determines the length of the sequence the AR generates, so anything shorter than it should and the clip is truncated too early. The AR temperature has very little room to play around with too, so I should just fix it to be 1.0, and have the slider only determine the NAR's. So, reducing the step count is a bad thing, and I should just see about having it be a while True: instead in the appropriate model files.

I did a "lingual test" some Harvard sentences on the Mitsuru P3 finetune, since an adequate finetune seems to help a lot, and it's it's not that bad (the audio quality leaves a lot to be desired, however, but that's just from being 2 RVQ bins I'm sure):

Anywho, I think I need to:

  • figure out a better way to normalize text while keeping punctuation intact. Whisper's English normalizer will eat them, while the "basic" one might not be so very good.
    • re-phonemize everything. This should be easy, as my transcriptions are still around.
  • figure out an elegant way to pad the reference audio. I think I'll just run a silent waveform into EnCodec and stick each half on each end to "pad" it. I can either pad it within the HDF5 dataset, or at sample time in the dataloader.
  • finally get around to including LibriLight. The extra speakers will really help bolster the zero-shot capabilities, and probably the time it takes to train it a full epoch will help bolster the model and not have it be kinda crunchy at times.
  • play around with slotting out NARs and see how it affects the output quality. If it seems that a lesser trained NAR doesn't matter all that much, then I might be able to only train the AR and not worry so much on another NAR. Also, play around with multi-NARing.

I don't know if I can get away with resume training with the above changes, and see how well the model adapts (maybe, it seemed to be fine with finetuning), or if I should listen to my gut and stop trying to concat onto the weights.

Although, I just realized I might not get much free time the next week, but I'll see what I can do while waiting for LibriLight to transcribe, since I should get to doing that in the background.

Alright, I'll see about adding it into the web UI if I get a moment over the weekend. I think it should be easy to slot in something, at least, if I don't spaghetti over the web UI code again. I *really* need to rewrite it, but that's another day. --- I got around to trying to listen to the evaluation / validation output (for real this time) from the model while I was training on the reduced dataset, and after when I pivoted back to the full dataset and... my god was that a mistake. It's pretty sad hearing the general quality degrade over time, despite the loss / accuracy being about the same. I guess that strat won't work very well, although I don't know if it just means additional epochs will eventually degrade it (unless I train at a smaller LR). But here's the "final" graph for the past two-ish weeks. You can see the points where I pivoted between the two when the loss would shift a small amount. But I suppose I'll have to shelve this model, as it's still inherently flawed: ![image](/attachments/c60c295d-1649-4e51-99d0-258d4c6c9ca1) I did fix a few issues on the inferencing side within half an hour of fucking with it again: * my input text normalized, and it seemed to help a bit, although I need to go back and re-phonemize my training transcription to include punctuation (sans period). I think commas would help the model figure out pauses, while question marks and exclamation marks will hint at appropriate inflections. - This also seemed to help "fix" whatever general speech issues I was having. I wonder why specifically. - I also don't seem to really have much issues with "long" sentences. I also wonder why. * I figured out the "inconsistent voices between lines" issue when stitching. In the web UI's `generate_valle` function, I don't have it reuse the same input prompt, so any voices it's already not so familiar with will vary greatly if the reference voice clips it pulls from vary enough. I also lied, and you don't actually need to process your voice and transcribe, it handles `./voices/{voice}/` fine. * "samples" mapping to step count might as well be removed. Step count is what determines the length of the sequence the AR generates, so anything shorter than it should and the clip is truncated too early. The AR temperature has very little room to play around with too, so I should just fix it to be 1.0, and have the slider only determine the NAR's. So, reducing the step count is a bad thing, and I should just see about having it be a `while True:` instead in the appropriate model files. I did a "lingual test" some Harvard sentences on the Mitsuru P3 finetune, since an adequate finetune seems to help a lot, and it's it's not that bad (the audio quality leaves a lot to be desired, however, but that's just from being 2 RVQ bins I'm sure): * https://vocaroo.com/11v98eckUpVh * https://vocaroo.com/1oR8YVowSujr Anywho, I think I need to: * figure out a better way to normalize text while keeping punctuation intact. Whisper's English normalizer will eat them, while the "basic" one might not be so very good. - re-phonemize everything. This should be easy, as my transcriptions are still around. * figure out an elegant way to pad the reference audio. I think I'll just run a silent waveform into EnCodec and stick each half on each end to "pad" it. I can either pad it within the HDF5 dataset, or at sample time in the dataloader. * finally get around to including LibriLight. The extra speakers will ***really*** help bolster the zero-shot capabilities, and probably the time it takes to train it a full epoch will help bolster the model and not have it be kinda crunchy at times. * play around with slotting out NARs and see how it affects the output quality. If it seems that a lesser trained NAR doesn't matter all that much, then I might be able to only train the AR and not worry so much on another NAR. Also, play around with multi-NARing. I don't know if I can get away with resume training with the above changes, and see how well the model adapts (maybe, it seemed to be fine with finetuning), or if I should listen to my gut and stop trying to concat onto the weights. Although, I just realized I might not get much free time the next week, but I'll see what I can do while waiting for LibriLight to transcribe, since I should get to doing that in the background.
Author
Owner

Off to a good start I feel with the new dataset.

  • disabled normalizing the text in the dataset processing routine where it phonemizes (something like normalize = False somewhere, I don't have a line number because my copy hasn't been committed to the repo in a while). I don't even remember why I do have the text normalized, since everything is inherently normalized through Whisper anyhow.
  • downloaded LibriLight's medium (as I already had small).
  • did some pre-processing to format the filenames under a similar LibriTTS style of ${speaker_id}_${book_id}, and that way I can have another script to eventually prune the duplicates in LibriLight (since the LibriTTS copies are higher quality).

And then I realized two core issues.

  • I'm going to ruin my remaining disk space if I convert the flacs into wavs. I can easily load them from flacs, no problem. But AIVC will save a preprocessed copy from the ./voices/{voice}/ into ./training/{voice}/audio/ as PCM_S 16-bit WAV at 24K, and then the slices (because TorToiSe/DLAS requires it this way). I was barely able to make the donated audiobooks work with some nasty kludge with my code, but I don't think I can do that.
    • I think my solution is just to simply load the FLAC (torchaudio under soundfile backend loads it fine, it just can't save it), and do the slicing and quantizing in memory, rather than slice to disk and load those slices to quantize. Implemented having it load from directly from ./voices/{voice}/, do the resampling and slices if necessary, and then quantize the audio to disk for VALL-E backends.
  • Training time. The ~3400 hour dataset took about 4 to 5 days to process an epoch, which, I didn't necessarily mind (compared to the other implementation's homebrewed model of 8xA100s for 4 days for 100 epochs to get to where it was there, I have a huge uplift with my single 4070Ti). But, throwing in LibriLight-6K would 2.6x my dataset, so I'd say I have an ETA of two weeks to crunch out an epoch. Kinda making it pretty hard to keep coping about my stubbornness, but oh well.

I have the transcription process running for LibriLight-6K while I end up doing nothing from choice paralysis. I might fuck around with RVC on my actual desktop, since I can't really touch the web UI on my GPU slave right now, since the "new" way I'm transcribing/processing is just call cli.py on every voice to do the processing, since doing a bulk transcribe/processing will eventually make the process hang up and die (although maybe more system RAM will fix it, but better safe than sorry now).

Off to a good start I feel with the new dataset. * disabled normalizing the text in the dataset processing routine where it phonemizes (something like `normalize = False` somewhere, I don't have a line number because my copy hasn't been committed to the repo in a while). I don't even remember why I do have the text normalized, since everything is inherently normalized through Whisper anyhow. * downloaded [LibriLight](https://github.com/facebookresearch/libri-light/blob/main/data_preparation/README.md)'s `medium` (as I already had `small`). * did some pre-processing to format the filenames under a similar LibriTTS style of `${speaker_id}_${book_id}`, and that way I can have another script to eventually prune the duplicates in LibriLight (since the LibriTTS copies are higher quality). And then I realized two core issues. * I'm going to ruin my remaining disk space if I convert the flacs into wavs. I can easily load them from flacs, no problem. But AIVC will save a preprocessed copy from the `./voices/{voice}/` into `./training/{voice}/audio/` as PCM_S 16-bit WAV at 24K, and then the slices (because TorToiSe/DLAS requires it this way). I was *barely* able to make the donated audiobooks work with some nasty kludge with my code, but I don't think I can do that. - ~~I think my solution is just to simply load the FLAC (torchaudio under `soundfile` backend loads it fine, it just can't save it), and do the slicing and quantizing in memory, rather than slice to disk and load those slices to quantize.~~ Implemented having it load from directly from `./voices/{voice}/`, do the resampling and slices if necessary, and then quantize the audio to disk for VALL-E backends. * Training time. The ~3400 hour dataset took about 4 to 5 days to process an epoch, which, I didn't necessarily mind (compared to the other implementation's homebrewed model of 8xA100s for 4 days for 100 epochs to get to where it was there, I have a ***huge*** uplift with my single 4070Ti). But, throwing in LibriLight-6K would 2.6x my dataset, so I'd say I have an ETA of two weeks to crunch out an epoch. Kinda making it pretty hard to keep coping about my stubbornness, but oh well. I have the transcription process running for LibriLight-6K while I end up doing nothing from choice paralysis. I *might* fuck around with RVC on my actual desktop, since I can't really touch the web UI on my GPU slave right now, since the "new" way I'm transcribing/processing is just call `cli.py` on every voice to do the processing, since doing a bulk transcribe/processing will eventually make the process hang up and die (although maybe more system RAM will fix it, but better safe than sorry now).
Author
Owner

Fug, I didn't get a chance to play around with RVC. Whatever time I did have was spent between other things and getting LibriLight-6K hours transcribed and processed and everything re-phonemized, and increased the bounds for a slice to be processed (instead of using TorToiSe/DLAS's text and duration lengths, I'll just have it determined in the YAML at train initialization time).

Transcribing LibriLight-6K went fine, a little too fine. I only realized, after the fact when trying to do cleanup for disk space, that I fucked up with my pre-processing the LibriLight-6K dataset and neglected that a lot of book folders had more than one piece of audio, so when I went to run the script to rename them to ${speaker_id}_${book_id}, it would overwrite things. So this should explain how transcribing it all took only a day and a half compared to the donated audiobooks taking a few days to process through it all.

Regardless, this partial LibriLight-6K (without pruning for text and duration lengths) brings the dataset to:

  • a measly 3605 speakers ( +1166 speakers )
  • 3794963 samples
  • 4753 hours ( +1281 hours )
  • a very inconsistent ETA

I'm not too sure if I should bother going back and re-process LibriLight-6K (from scratch, or at least, intelligently pick the folders that did have multiple files in them), or just suck it up, as this next training "test" is mostly looking out for the better phonemizing method (not phonemizing normalized text without punctuation), and having some more speakers to play with for zero-shotting. But oh well, I'll see how it goes.

Fug, I didn't get a chance to play around with RVC. Whatever time I did have was spent between other things and getting LibriLight-6K hours transcribed and processed and everything re-phonemized, and increased the bounds for a slice to be processed (instead of using TorToiSe/DLAS's text and duration lengths, I'll just have it determined in the YAML at train initialization time). Transcribing LibriLight-6K went fine, a little too fine. I only realized, after the fact when trying to do cleanup for disk space, that I fucked up with my pre-processing the LibriLight-6K dataset and neglected that a lot of book folders had more than one piece of audio, so when I went to run the script to rename them to `${speaker_id}_${book_id}`, it would overwrite things. So this should explain how transcribing it all took only a day and a half compared to the donated audiobooks taking a few days to process through it all. Regardless, this partial LibriLight-6K (without pruning for text and duration lengths) brings the dataset to: * a measly 3605 speakers ( +1166 speakers ) * 3794963 samples * 4753 hours ( +1281 hours ) * a very inconsistent ETA I'm not too sure if I should bother going back and re-process LibriLight-6K (from scratch, or at least, intelligently pick the folders that did have multiple files in them), or just suck it up, as this next training "test" is mostly looking out for the better phonemizing method (not phonemizing normalized text without punctuation), and having some more speakers to play with for zero-shotting. But oh well, I'll see how it goes.
Author
Owner

mmm... maybe I was a little too hasty to get training back up again. Not only did I have the partial LibriLight-6K, I also forgot I wanted to:

  • re-enable Vocos.
    • I'll cry about it, it's just for the evaluation / validation audio.
  • actually have the state of the dataloader saved, or at least, store the seed and the current iteration when checkpointing, so I don't feel bad for killing the trainer and having "partial" epochs.
    • I'll just keep crying about it.
  • wrap the evaluation / validation in a try/catch block because I noticed sometimes in the previous training run and during inferencing, that it would throw an exception a very small some of the time but nowhere near all the time to be consistent.
    • I'll just cry about it when it happens.
  • pad the reference audio in the dataloader, but in theory I could always just pad when combining lines at inference time.
    • I think with the preparing of LibriLight-6K, I did modify my slice times to be a little better about not being sliced too tight, so that portion might be fine.
    • I honestly don't think this is a big deal, as it's just only an issue when combining utterances.
  • fuck around with RetNet.

As for the latter, I'm (finally) dipping my toes into the intricacies of the model and, unless I'm neglecting something, it seems all I really need to do is just slot slot out this "Block" (which just looks like the transformer-y bits) and supplant it with a RetNet (this implementation looks rather clean and not boilerplated to high hell with faux-DeepSpeed isms as with the official M$ implementation). The beauty of the Jamie-Stirling/RetNet implementation is, just like how the original enhuiz/vall-e implemented his own transformer-block stuff, I can easily just custom tailor it to use the shit like AdaLN and whatnot.

Unfortunately, I'm kinda gonna be stuck with training the model for the next few days, I might as well see it through and see if my adjustments mattered. On the plus side, 5% of the way through the epoch (and 10 hours), it's already at AR accuracy=~66% and NAR accuracy=~44% already. I don't know if this is because I accidentally upped the LR from 2.5e-4 to 3.25e-4 so it's training faster or that actually adding in punctuations help, but I'll take what I can get.

mmm... maybe I was a little too hasty to get training back up again. Not only did I have the partial LibriLight-6K, I also forgot I wanted to: * re-enable Vocos. - I'll cry about it, it's just for the evaluation / validation audio. * actually have the state of the dataloader saved, or at least, store the seed and the current iteration when checkpointing, so I don't feel bad for killing the trainer and having "partial" epochs. - I'll just keep crying about it. * wrap the evaluation / validation in a try/catch block because I noticed sometimes in the previous training run and during inferencing, that it would throw an exception a very small some of the time but nowhere near all the time to be consistent. - I'll just cry about it when it happens. * pad the reference audio in the dataloader, but in theory I could always just pad when combining lines at inference time. - I think with the preparing of LibriLight-6K, I did modify my slice times to be a little better about not being sliced too tight, so that portion might be fine. - I honestly don't think this is a big deal, as it's just only an issue when combining utterances. * fuck around with RetNet. As for the latter, I'm (finally) dipping my toes into the intricacies of the model and, unless I'm neglecting something, it seems all I really need to do is just slot slot out [this "Block"](https://git.ecker.tech/mrq/vall-e/src/commit/47076c11df1a3aa2478b0ab58d1ced5ccea0fee9/vall_e/vall_e/base.py#L206) (which just looks like the transformer-y bits) and supplant it with a [RetNet](https://github.com/Jamie-Stirling/RetNet/blob/main/src/retnet.py#L6) (this implementation looks rather clean and not [boilerplated to high hell with faux-DeepSpeed isms](https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/retnet.py#L199) as with the official M$ implementation). The beauty of the [Jamie-Stirling/RetNet](https://github.com/Jamie-Stirling/RetNet/) implementation is, just like how the original [enhuiz/vall-e](https://github.com/enhuiz/vall-e) implemented his own transformer-block stuff, I can easily just custom tailor it to use the shit like AdaLN and whatnot. ***Unfortunately***, I'm kinda gonna be stuck with training the model for the next few days, I might as well see it through and see if my adjustments mattered. On the plus side, 5% of the way through the epoch (and 10 hours), it's already at AR accuracy=~66% and NAR accuracy=~44% already. I don't know if this is because I accidentally upped the LR from 2.5e-4 to 3.25e-4 so it's training faster or that actually adding in punctuations help, but I'll take what I can get.
Author
Owner

For real should be my last update for a while unless something catastrophic happens (I doubt it), but I figured this should be it's own block as this is more about integrating RetNet.

I bit the bullet and spent some time cramming the previously mentioned RetNet implementation into my fork. It... works. Everything seems to be in order, but it's missing some of the base tech like:

  • PreNormResiduals
    • I did add back in AdaLN instead of plain LayerNorm for the NAR, so I suppose that's fine.
  • SinusoidalEmbedding
    • I just have to cross my finger and hope that XPOS can cover the load.
  • Attention masking and whatever special sauce is in the attention implementation
    • retention seems to make its own masks, so shrug I guess.

I could preserve the PreNormResiduals and SinusoidalEmbedding by replacing the Attention portion with a MultiScaleRetention instead, but there's some argument chicanery when I first tried it (desu though, I didn't have a good grasp on it yet).

For any weirdo interested in cramming RetNet into some other project similar to this VALL-E implementation, I did have to make some slight modifications to the RetNet implementation:

  • you'll need to have the retention._get_D's outputted tensor to the right dtype and device.
  • there's no checkpointing of the retentions during the forward pass, so you will OOM easily.
  • passing the other variables from the original implementation (causal => recurrent (unused), dropout, norm_type, quant levels for NAR's AdaLN).
  • using GELU instead of the SiLU.
  • adding dropout.

The other benefit seems to be that it has significantly recovered some VRAM for training (down from a tight ~11.5GiB to a nice ~8GiB with the fullsized model, so I can now up my batch size again).

For any poor soul training a model at home, you can enable using RetNet with setting use_retnet: True in the training YAML, although I hope not, since I did modify the tokenizer map with my new dataset with punctuation.

*For real* should be my last update for a while unless something catastrophic happens (I doubt it), but I figured this should be it's own block as this is more about integrating RetNet. I bit the bullet and spent some time cramming [the previously mentioned RetNet implementation](https://github.com/Jamie-Stirling/RetNet/) into my fork. It... works. Everything seems to be in order, but it's missing some of the base tech like: * PreNormResiduals - I did add back in AdaLN instead of plain LayerNorm for the NAR, so I suppose that's fine. * SinusoidalEmbedding - I just have to cross my finger and hope that XPOS can cover the load. * Attention masking and whatever special sauce is in the attention implementation - retention seems to make its own masks, so shrug I guess. I *could* preserve the PreNormResiduals and SinusoidalEmbedding by replacing the Attention portion with a MultiScaleRetention instead, but there's some argument chicanery when I first tried it (desu though, I didn't have a good grasp on it yet). For any weirdo interested in cramming RetNet into some other project similar to this VALL-E implementation, I did have to make some slight modifications to the RetNet implementation: * you'll need to have the `retention._get_D`'s outputted tensor to the right dtype and device. * there's no checkpointing of the retentions during the forward pass, so you *will* OOM easily. * passing the other variables from the original implementation (causal => recurrent (unused), dropout, norm_type, quant levels for NAR's AdaLN). * using GELU instead of the SiLU. * adding dropout. The other benefit seems to be that it has significantly recovered some VRAM for training (down from a tight ~11.5GiB to a nice ~8GiB with the fullsized model, so I can now up my batch size again). For any poor soul training a model at home, you can enable using RetNet with setting `use_retnet: True` in the training YAML, although I hope not, since I did modify the tokenizer map with my new dataset with punctuation.

Thanks for your great work! Is it possible to train a vall-e model with single RTX3060(12GB)? I don't care how long it will take for training.

Thanks for your great work! Is it possible to train a vall-e model with single RTX3060(12GB)? I don't care how long it will take for training.

For real should be my last update for a while unless something catastrophic happens (I doubt it), but I figured this should be it's own block as this is more about integrating RetNet.

I bit the bullet and spent some time cramming the previously mentioned RetNet implementation into my fork. It... works. Everything seems to be in order, but it's missing some of the base tech like:

  • PreNormResiduals
    • I did add back in AdaLN instead of plain LayerNorm for the NAR, so I suppose that's fine.
  • SinusoidalEmbedding
    • I just have to cross my finger and hope that XPOS can cover the load.
  • Attention masking and whatever special sauce is in the attention implementation
    • retention seems to make its own masks, so shrug I guess.

I could preserve the PreNormResiduals and SinusoidalEmbedding by replacing the Attention portion with a MultiScaleRetention instead, but there's some argument chicanery when I first tried it (desu though, I didn't have a good grasp on it yet).

For any weirdo interested in cramming RetNet into some other project similar to this VALL-E implementation, I did have to make some slight modifications to the RetNet implementation:

  • you'll need to have the retention._get_D's outputted tensor to the right dtype and device.
  • there's no checkpointing of the retentions during the forward pass, so you will OOM easily.
  • passing the other variables from the original implementation (causal => recurrent (unused), dropout, norm_type, quant levels for NAR's AdaLN).
  • using GELU instead of the SiLU.
  • adding dropout.

The other benefit seems to be that it has significantly recovered some VRAM for training (down from a tight ~11.5GiB to a nice ~8GiB with the fullsized model, so I can now up my batch size again).

For any poor soul training a model at home, you can enable using RetNet with setting use_retnet: True in the training YAML, although I hope not, since I did modify the tokenizer map with my new dataset with punctuation.

You're not just a programmer, you're a genius. And we're very fortunate to have someone so open and engaging as you. You have so much enthusiasm and put so much of your energy into this project - it's amazing. Hopefully once you have mastered cloning voices, you can chill a bit.

> *For real* should be my last update for a while unless something catastrophic happens (I doubt it), but I figured this should be it's own block as this is more about integrating RetNet. > > I bit the bullet and spent some time cramming [the previously mentioned RetNet implementation](https://github.com/Jamie-Stirling/RetNet/) into my fork. It... works. Everything seems to be in order, but it's missing some of the base tech like: > * PreNormResiduals > - I did add back in AdaLN instead of plain LayerNorm for the NAR, so I suppose that's fine. > * SinusoidalEmbedding > - I just have to cross my finger and hope that XPOS can cover the load. > * Attention masking and whatever special sauce is in the attention implementation > - retention seems to make its own masks, so shrug I guess. > > I *could* preserve the PreNormResiduals and SinusoidalEmbedding by replacing the Attention portion with a MultiScaleRetention instead, but there's some argument chicanery when I first tried it (desu though, I didn't have a good grasp on it yet). > > For any weirdo interested in cramming RetNet into some other project similar to this VALL-E implementation, I did have to make some slight modifications to the RetNet implementation: > * you'll need to have the `retention._get_D`'s outputted tensor to the right dtype and device. > * there's no checkpointing of the retentions during the forward pass, so you *will* OOM easily. > * passing the other variables from the original implementation (causal => recurrent (unused), dropout, norm_type, quant levels for NAR's AdaLN). > * using GELU instead of the SiLU. > * adding dropout. > > The other benefit seems to be that it has significantly recovered some VRAM for training (down from a tight ~11.5GiB to a nice ~8GiB with the fullsized model, so I can now up my batch size again). > > For any poor soul training a model at home, you can enable using RetNet with setting `use_retnet: True` in the training YAML, although I hope not, since I did modify the tokenizer map with my new dataset with punctuation. You're not just a programmer, you're a genius. And we're very fortunate to have someone so open and engaging as you. You have so much enthusiasm and put so much of your energy into this project - it's amazing. Hopefully once you have mastered cloning voices, you can chill a bit.
Author
Owner

I genuinely had to do a double take when I woke back up and saw the AR already this far: 6 hours to get to what I imagine would be an astronomical time: image

Although... the NAR seems a little lacking. I wonder if the included AdaLN is actually bunk, and that's why this entire time it's been the one to suffer a little bit in terms of training. There was a comment in the AdaLN function mentioning something like that from the original dev of the implementation, but I didn't expect it to mean currently it's wrong.

I might pivot to the quarter sized model with how fast it trained, since I think training the rest of this epoch would be rather silly.


And catastrophe struck. Despite testing it with the little mini trainers for each model, the evaluation / validation process and inferencing broke. I'll have to see what I can do about it, but I have pivoted to the quarter sized model to train without AdaLN for the NAR and seeing how that goes.

I think I fixed inferencing? Not sure what happened. I also had a few little gremlins but I doubt it caused it. Still going to try a quarter sized model without using AdaLN for the NAR instead. The output from the test run, though, was quite dreadful, I imagine from the NAR being bad.

Also, I don't know if it's just me being forgetful, but the inferencing times seem... quite dreadful. It seems fine during evaluation / validation in a batch, but just generating one line with the RetNet feels worse than I remember it being with the normal Transformer.


Is it possible to train a vall-e model with single RTX3060(12GB)? I don't care how long it will take for training.

Mhm. My misfortuned 4070Ti has 12GiB, and it's been able to handle it fine, not as much as I'd like, but it definitely works. I'm not sure about the speed between Ampere and Turing, but I can't imagine it being that much of an issue.

And we're very fortunate to have someone so open and engaging as you. You have so much enthusiasm and put so much of your energy into this project - it's amazing. Hopefully once you have mastered cloning voices, you can chill a bit.

Nah, I still need to clean up (rewrite) AIVC's code from how much of a mess it ended up, and probably also re-write my VALL-E fork, as there's a lot of bits I ended up just not really needing.

I genuinely had to do a double take when I woke back up and saw the AR already this far: 6 hours to get to what I imagine would be an astronomical time: ![image](/attachments/96aeeb12-c259-441c-832e-e0158e1f5f84) Although... the NAR seems a little lacking. I wonder if the included AdaLN is actually bunk, and that's why this entire time it's been the one to suffer a little bit in terms of training. There was a comment in the AdaLN function mentioning something like that from the original dev of the implementation, but I didn't expect it to mean currently it's wrong. I might pivot to the quarter sized model with how fast it trained, since I think training the rest of this epoch would be rather silly. --- ~~And catastrophe struck. Despite testing it with the little mini trainers for each model, the evaluation / validation process and inferencing broke. I'll have to see what I can do about it, but I have pivoted to the quarter sized model to train without AdaLN for the NAR and seeing how that goes.~~ I think I fixed inferencing? Not sure what happened. I also had a few little gremlins but I doubt it caused it. Still going to try a quarter sized model without using AdaLN for the NAR instead. The output from the test run, though, was quite dreadful, I imagine from the NAR being bad. Also, I don't know if it's just me being forgetful, but the inferencing times seem... quite dreadful. It seems fine during evaluation / validation in a batch, but just generating one line with the RetNet feels worse than I remember it being with the normal Transformer. --- > Is it possible to train a vall-e model with single RTX3060(12GB)? I don't care how long it will take for training. Mhm. My misfortuned 4070Ti has 12GiB, and it's been able to handle it *fine*, not as much as I'd like, but it definitely works. I'm not sure about the speed between Ampere and Turing, but I can't imagine it being that much of an issue. > And we're very fortunate to have someone so open and engaging as you. You have so much enthusiasm and put so much of your energy into this project - it's amazing. Hopefully once you have mastered cloning voices, you can chill a bit. Nah, I still need to clean up (rewrite) AIVC's code from how much of a mess it ended up, and probably also re-write my VALL-E fork, as there's a lot of bits I ended up just not really needing.
266 KiB

Just wanted to say, I love what you're doing and your detailed updates. I wish I could do something similar, but I have my day job which gets in the way.
How are you able to juggle this with work and other responsibilities?

Just wanted to say, I love what you're doing and your detailed updates. I wish I could do something similar, but I have my day job which gets in the way. How are you able to juggle this with work and other responsibilities?
Author
Owner

Had to do some needed cleanup with the config class and YAML, so I did that, and I can now easily pivot between models of different RVQ bin targets and if it uses RetNet or not. I also found that my gut feeling of just slotting out the Attention for Retention would have been the easiest idea, since it looks like the original "Transformer" bits did relatively the same thing as the RetNet (push through the *tention, and then feedforwards, each with their own layer norms). And this solved my issues with inferencing / evaluation / validation output inconsistently failing.

The quarter sized model is also blasting away really fast, so I have my hopes for it turning out good.


How are you able to juggle this with work and other responsibilities?

Very carefully. I feel I barely am able to pivot between them all, but yesterday and today I've just spend my time in the night to poke at it, since I don't think there's really a better time to do so at the moment.

Had to do some needed cleanup with the config class and YAML, so I did that, and I can now easily pivot between models of different RVQ bin targets and if it uses RetNet or not. I also found that my gut feeling of just slotting out the Attention for Retention would have been the easiest idea, since it looks like the original "Transformer" bits did relatively the same thing as the RetNet (push through the *tention, and then feedforwards, each with their own layer norms). And this solved my issues with inferencing / evaluation / validation output inconsistently failing. The quarter sized model is also blasting away really fast, so I have my hopes for it turning out good. --- > How are you able to juggle this with work and other responsibilities? Very carefully. I feel I barely am able to pivot between them all, but yesterday and today I've just spend my time in the night to poke at it, since I don't think there's really a better time to do so at the moment.
Author
Owner

I feel some slight uneasiness.

I feel like there's something I'm missing with the implementation.

  • I probably shouldn't be using the sinusoidal position embedding before the inputs get passed through the retnet, since the retentions have their own xpos positional embedding, so it seems a little superfluous to do so.
    • disabling the sinusoidal positioning had the loss spike a bit before they normalized, so I imagined just from that it made an effect. The evaluation / validation also seems to be generating longer before hitting a stop token, so I hope it did fix things.
  • While the attentions did make use of the mask when preparing the initial input (masking away padding), the retentions do their own masking and doesn't seem to make use of it. I'll see if I can just, plop it back in, but the mask gets used anyways after running it through the classifier at the end of the chain, so it might not matter.
    • although, the output from the retention gets multiplied by the mask anyways. I think the only side effect from leaving it in for the retention is that it'll retain anything padded, but it gets culled anyways.
  • AdaLN vs normal LN for the NAR doesn't seem to impact anything.
  • I think the AR is correct? I haven't listened to raw AR output in a while, but I think my only qualm is that it seems to be generating them too short, as in, a stop token is reached a little too quickly. I don't know if I have an issue somewhere with the dataset loading, or the retentions (or implementation) are bunk.
  • the output sounds pretty shit. I guess since, despite the accuracy being """decent""" (decent, as in given other model tests at similar accuracies/losses), it sounds like shit, as if it was actually ~AR 40%/NAR 30% accuracies. I'm pretty sure it's just a matter of the model not actually going through the epoch far enough, but still, what's the point when the reported losses/accuracies are saying they're good?
    • additionally, some of the evaluation "loss" (aural loss) reported """decent""" values some of the time, but again, they're significantly too short.
  • despite training on a quarter sized model, the progression of the loss seems roughly the same as the full size (although I don't know the extent of having a larger batch size and higher LR rate).
  • I can't really re-use a transformer-based (non-retnet) model I've had before. I'd have to make adjustments to allow per-model symmaps if I want to validate the AR and NAR separately with a known working model. But I think I have to do this to make sure things are working, as the outputs currently sound sort of correct, but I'm still not confident.

The upside is that training through the entire epoch on the quarter sized model should take 22 hours, so working with the model should be much much faster to make tweaks with the retnet implementation. Although, that's the issue, since I don't know what would be considered "wrong" or "just give it more time".


I guess I'll give it more time. The evaluation / validation output is sounding a little more clearer over time, so I suppose everything is in order, just needing more time.


I am very positive my issue actually is the fact I need to actually use the specific recurrent (causal) forward pass rather than just naively reusing the existing AR forward pass to handle things, which would explain the discrepancy between a well reported AR and the output being too short and shit.

My only qualm is that I really need to try and wrap my head around how to cram it in, since it requires keeping track of a separate list and values, and that isn't necessarily something easy to do unless I explicitly pivot back to using the RetNet class itself rather than the wrapped PreNormResidual class that handles it's own tention+feed-forward shenanigans. Which sucks, because I just torched the previous model that was in the middle of being trained using the "full RetNet".

At least I can keep training the model, since the normal forward pass is fine.

I feel some slight uneasiness. I *feel* like there's something I'm missing with the implementation. * I probably shouldn't be using the sinusoidal position embedding before the inputs get passed through the retnet, since the retentions have their own xpos positional embedding, so it seems a little superfluous to do so. - disabling the sinusoidal positioning had the loss spike a bit before they normalized, so I imagined just from that it made an effect. The evaluation / validation also seems to be generating longer before hitting a stop token, so I hope it did fix things. * While the attentions did make use of the mask when preparing the initial input (masking away padding), the retentions do their own masking and doesn't seem to make use of it. I'll see if I can just, plop it back in, but the mask gets used anyways after running it through the classifier at the end of the chain, so it might not matter. - although, the output from the retention gets multiplied by the mask anyways. I think the only side effect from leaving it in for the retention is that it'll retain anything padded, but it gets culled anyways. * AdaLN vs normal LN for the NAR doesn't seem to impact anything. * I ***think*** the AR is correct? I haven't listened to raw AR output in a while, but I think my only qualm is that it seems to be generating them too short, as in, a stop token is reached a little too quickly. I don't know if I have an issue somewhere with the dataset loading, or the retentions (or implementation) are bunk. * the output sounds pretty shit. I guess since, despite the accuracy being """decent""" (decent, as in given other model tests at similar accuracies/losses), it sounds like shit, as if it was actually ~AR 40%/NAR 30% accuracies. I'm pretty sure it's just a matter of the model not actually going through the epoch far enough, but still, what's the point when the reported losses/accuracies are saying they're good? - additionally, some of the evaluation "loss" (aural loss) reported """decent""" values some of the time, but again, they're significantly too short. * despite training on a quarter sized model, the progression of the loss seems roughly the same as the full size (although I don't know the extent of having a larger batch size and higher LR rate). * I can't really re-use a transformer-based (non-retnet) model I've had before. I'd have to make adjustments to allow per-model symmaps if I want to validate the AR and NAR separately with a known working model. But I think I *have* to do this to make sure things are working, as the outputs currently sound *sort of correct*, but I'm still not confident. The upside is that training through the entire epoch on the quarter sized model should take 22 hours, so working with the model should be much much faster to make tweaks with the retnet implementation. Although, that's the issue, since I don't know what would be considered "wrong" or "just give it more time". --- I guess I'll give it more time. The evaluation / validation output is sounding a little more clearer over time, so I suppose everything is in order, just needing more time. --- I am very positive my issue actually is the fact I need to actually use the specific recurrent (causal) forward pass rather than just naively reusing the existing AR forward pass to handle things, which would explain the discrepancy between a well reported AR and the output being too short and shit. My only qualm is that I really need to try and wrap my head around how to cram it in, since it requires keeping track of a separate list and values, and that isn't necessarily something easy to do unless I explicitly pivot back to using the RetNet class itself rather than the wrapped `PreNormResidual` class that handles it's own tention+feed-forward shenanigans. Which sucks, because I just torched the previous model that was in the middle of being trained using the "full RetNet". At least I can keep training the model, since the normal forward pass is fine.
Author
Owner

Ugh. I suppose my current training needs to be scrapped, since it turns out my "just replace the Attention with Retention, it'll be fine, it did fix inferencing" approach to integrate the RetNet is inherently bunk.

I wrangled with trying to """properly""" do the forward pass for the AR using the provided forward_recurrent routines, but no matter what, the output with the partial RetNet (non-full_retnet) always produced repetitive garbage.

However, I pivoted back to using full_retnet (it replaces the layers of PreNormResiduals wrapping around the Attention/Retention + feed forwards, which, in theory are effectively what's done anyways in the RetNet class), and with the tiny trainer, it sounds right now AND inferences without issues. It seems to work fine both with the provided forward_recurrent and naively without, so I really don't know the issue. However, outside of the tiny trainer, it'll consistently return zero-lengthed output.

Peeved, since I have to scrap the current model again. Oh well.


Turns out a mix of masking the output from the classifier at the end with the RetNet integration, is bad, and using the provided forward_recurrent (the non-naive pathway) is also bad, as both will output wrong output. Scrapping the test model again. I might be able to at least re-use my current training I had over the night, since it's just the last step being "wrong".

The evaluation / validation output sounds fine given how little it's trained: it seems that with RetNet, it can copy the acoustics pretty fast, hence the loss being pretty high pretty fast, but it still needs to learn how to actually speak.

Fingers crossed.


Alright, I think I got things figured out.

  • I'm not going back to try and do the "replace the attention with retentions" approach with the rest of the transformer, since that's inherently flawed.
    • the generated AR output is inherently flawed, as it just repeats, regardless of the "naive" approach or the provided recurrent forward.
  • replacing the transformer with the retnet requires a little bit of elbow grease and not multiplying by the mask when merging the tokens=>embeddings lists.
    • It works, but when inferencing, any batch size over 1 is prone to issues.
  • I bit the bullet and implemented microsoft/torchscale, and it resolves the above issue of issues when using bigger-than-one batch sizes. However:
    • documentation, in typical corpo-FOSS fashion, is extremely lacking.
      • there's "examples" but it's just effectively boilerplate wrappers for LLaMA model sizes.
      • there's quite a lot of unused arguments, for both the constructor and the forward pass.
      • the primary argument being named prev_output_tokens is... oddly named for a very nice way to put it.
        • if no token_embeddings is provided (your token=>embedding), it'll create it for you (nice I guess for non-merged sequences).
        • you need to pass a tensor of at least (b, t), because it derives the output sequence length from it. Heaven forbid it being a named argument like token_embeddings to avoid having to craft another tensor, but I think I can cheat by having it sized but empty, or on the CPU, since all it does it check for its size.
    • the loss hasn't gone down as fast as the Jamie-Stirling/RetNet implementation (the accuracies started slowing down at AR 67%/NAR 45%), but I think it's for the best as it means it's not training too fast.

I think, right now, RetNets aren't meme snakeoil, but golly was it pain with a lot of red herrings. At the very least, this endeavor has taught me a lot more about how the VALL-E implementation works, and I think I am now confident in writing my own implementation from scratch* (naturally, reusing the transformer/retnet bits). It's rather quite robust.

I'm also not sure if RetNets just inherently are better at "learning" a style but needing some time for learning language, but I think it's just me forgetting that the previous transformer-based models did start with learning a style first and then the language emerged after quite a while, in addition to the fact I'm training a quarter sized model just to validate how it fares after an epoch, so I'm sure things will get much better with the full sized model.

I think finally I can just let the damn thing train and not freak out or panic. I suppose this will give me a couple of days between training the quarter sized a full epoch through, and then pivoting to the full sized one and seeing what happens from that after an epoch. There doesn't seem to be any more inherent flaws from the RetNet, and whenever I get a chance I can see about fiddling with chunkwise recurrent forwards, as I think that would be a very neat way to speed up inferencing with the AR.

Ugh. I suppose my current training needs to be scrapped, since it turns out my "just replace the Attention with Retention, it'll be fine, it did fix inferencing" approach to integrate the RetNet is inherently bunk. I wrangled with trying to """properly""" do the forward pass for the AR using the provided `forward_recurrent` routines, but no matter what, the output with the partial RetNet (non-`full_retnet`) always produced repetitive garbage. However, I pivoted back to using `full_retnet` (it replaces the layers of PreNormResiduals wrapping around the Attention/Retention + feed forwards, which, in theory are effectively what's done anyways in the RetNet class), and with the tiny trainer, it sounds right now AND inferences without issues. ~~It seems to work fine both with the provided `forward_recurrent` and naively without, so I really don't know the issue.~~ However, outside of the tiny trainer, it'll consistently return zero-lengthed output. Peeved, since I have to scrap the current model again. Oh well. --- Turns out a mix of masking the output from the classifier at the end with the RetNet integration, is bad, and using the provided `forward_recurrent` (the non-naive pathway) is also bad, as both will output wrong output. ~~Scrapping the test model again.~~ I might be able to at least re-use my current training I had over the night, since it's just the last step being "wrong". The evaluation / validation output sounds *fine* given how little it's trained: it seems that with RetNet, it can copy the acoustics pretty fast, hence the loss being pretty high pretty fast, but it still needs to learn how to actually speak. Fingers crossed. --- Alright, I think I got things figured out. * I'm not going back to try and do the "replace the attention with retentions" approach with the rest of the transformer, since that's inherently flawed. - the generated AR output is inherently flawed, as it just repeats, regardless of the "naive" approach or the provided recurrent forward. * replacing the transformer with the retnet requires a little bit of elbow grease and *not* multiplying by the mask when merging the tokens=>embeddings lists. - It *works*, but when inferencing, any batch size over 1 is prone to issues. * I bit the bullet and implemented [microsoft/torchscale](https://github.com/microsoft/torchscale), and it resolves the above issue of issues when using bigger-than-one batch sizes. However: - documentation, in typical corpo-FOSS fashion, is extremely lacking. + there's "examples" but it's just effectively boilerplate wrappers for LLaMA model sizes. + there's quite a lot of unused arguments, for both the constructor and the forward pass. + the primary argument being named `prev_output_tokens` is... oddly named for a very nice way to put it. * if no `token_embeddings` is provided (your token=>embedding), it'll create it for you (nice I guess for non-merged sequences). * you ***need*** to pass a tensor of at least `(b, t)`, because it derives the output sequence length from it. Heaven forbid it being a named argument like `token_embeddings` to avoid having to craft *another* tensor, but I think I can cheat by having it sized but empty, or on the CPU, since all it does it check for its size. - the loss hasn't gone down as fast as the [Jamie-Stirling/RetNet](https://github.com/Jamie-Stirling/RetNet/) implementation (the accuracies started slowing down at AR 67%/NAR 45%), but I think it's for the best as it means it's not training *too fast*. I think, right now, RetNets aren't meme snakeoil, but golly was it pain with a lot of red herrings. At the very least, this endeavor has taught me a lot more about how the VALL-E implementation works, and I think I am now confident in writing my own implementation from scratch* (naturally, reusing the transformer/retnet bits). It's rather quite robust. I'm also not sure if RetNets just inherently are better at "learning" a style but needing some time for learning language, but I think it's just me forgetting that the previous transformer-based models did start with learning a style first and then the language emerged after quite a while, in addition to the fact I'm training a quarter sized model just to validate how it fares after an epoch, so I'm sure things will get much better with the full sized model. I think ***finally*** I can just let the damn thing train and not freak out or panic. I suppose this will give me a couple of days between training the quarter sized a full epoch through, and then pivoting to the full sized one and seeing what happens from that after an epoch. There doesn't seem to be any more inherent flaws from the RetNet, and whenever I get a chance I can see about fiddling with chunkwise recurrent forwards, as I think that would be a very neat way to speed up inferencing with the AR.
Author
Owner

Pain.

I kept going "maybe I should use the Jamie-Stirling/RetNet implementation, it just seems to run a little faster and nicer, and the loss seemed to be really low, the only issue is inferencing with the AR, and I can work around that", and trained a model using that as the backbone.

I was wrong.

I might have to poke at it more when I'm not borderline fried, but I think what's happened is that the AR wasn't actually wanting to be an AR. This would explain why it would generate very short sequences, and the loss being extremely low, I suppose. I feel like every time I poked at it to get something right, my understanding is wrong, and it actually didn't end up being fixed somehow, despite extensive debug prints and tiny test trainings.

The microsoft/torchscale, ironically, just works without any more headaches. It's still much lighter and faster than the previous transformer stack, but...

  • I think it only would work for the NAR because It's using normal LayerNorm and not AdaLN. A NAR that governs more than 1 RVQ bin should benefit from AdaLN, but I don't know how to have it utilize that without hacking it in the implementation.
  • it's still rather untested. It seems to do acoustics/style copying fine, but I haven't gotten so far to see it "learn" speech yet. I don't know if I should see about it with a quarter sized model or suffer with a full sized model.

Now, I'm not knocking the previous RetNet implementation. I think it's very lightweight and very neat. I want to use it, but without an example on how to actually utilize it, I'm not confident enough to continue trying and wasting even more days I could be using to train. It could very well be not mature enough and it's not even a me issue.

But oh well. I burnt another week, but I did learn more about how the VALL-E implementation I forked works. I feel stupid, since, there's a lot more control on how I can go about my own implementation when I write it from "scratch", like using the text/mel loss split that's in DLAS/TorToiSe since, and it feels silly in hindsight, the model actually can learn the input text sequence too, so it'd be kinda imperative if wanting to try and finetune another language on top of it.


As an aside, I bit the bullet to give P*p*rsp*ce a try, and it's slightly faster than my 4070Ti, so I suppose I'll have it train something alongside my local training.

Now, I am getting an itch to try my 6800XTs with the RetNet implementation since the paper boasted that it was using Instinct cards, although CDNA is quite different than RDNA, it doesn't hurt to see if RetNet favors ROCm over CUDA.

Oh well. I'll just shut up for real this time and let the damn model train properly and not try and take shortcuts.

Pain. I kept going "maybe I should use the [Jamie-Stirling/RetNet](https://github.com/Jamie-Stirling/RetNet/) implementation, it just seems to run a little faster and nicer, and the loss seemed to be really low, the only issue is inferencing with the AR, and I can work around that", and trained a model using that as the backbone. I was wrong. I might have to poke at it more when I'm not borderline fried, but I think what's happened is that the AR wasn't *actually* wanting to be an AR. This would explain why it would generate very short sequences, and the loss being extremely low, I suppose. I feel like every time I poked at it to get something right, my understanding is wrong, and it actually didn't end up being fixed somehow, despite extensive debug prints and tiny test trainings. The [microsoft/torchscale](https://github.com/microsoft/torchscale/), ironically, just works without any more headaches. It's still much lighter and faster than the previous transformer stack, but... * I think it only would work for the NAR because It's using normal LayerNorm and not AdaLN. A NAR that governs more than 1 RVQ bin should benefit from AdaLN, but I don't know how to have it utilize that without hacking it in the implementation. * it's still rather untested. It seems to do acoustics/style copying fine, but I haven't gotten so far to see it "learn" speech yet. I don't know if I should see about it with a quarter sized model or suffer with a full sized model. Now, I'm not knocking the previous RetNet implementation. I think it's very lightweight and very neat. I *want* to use it, but without an example on how to actually utilize it, I'm not confident enough to continue trying and wasting even more days I could be using to train. It could very well be not mature enough and it's not even a me issue. But oh well. I burnt another week, but I did learn more about how the VALL-E implementation I forked works. I feel stupid, since, there's a lot more control on how I can go about my own implementation when I write it from "scratch", like using the text/mel loss split that's in DLAS/TorToiSe since, and it feels silly in hindsight, the model actually *can* learn the input text sequence too, so it'd be kinda imperative if wanting to try and finetune another language on top of it. --- As an aside, I bit the bullet to give P\*p\*rsp\*ce a try, and it's slightly faster than my 4070Ti, so I suppose I'll have it train something alongside my local training. Now, I am getting an itch to try my 6800XTs with the RetNet implementation since the paper boasted that it was using Instinct cards, although CDNA is quite different than RDNA, it doesn't hurt to see if RetNet favors ROCm over CUDA. Oh well. I'll just shut up for real this time and let the damn model train properly and not try and take shortcuts.
Author
Owner

Midweek progress report, since things are going somewhat swimmingly, somewhat.

Retnet:

  • Immediately after going "I won't try and wrangle the Jamie-Stirling/RetNet implementation, it's just not going to work", I tried again and, it won't work for the AR. Any further investments into it will just lead to a waste of time. Again, I think it's very neat, and I want it to work out, but I can't make it work out in my favor.
  • On the flipside, the microsoft/torchscale RetNet implementation seems to be holding up fine... partially. There's zero documentation on how to actually make use of the "kv-cache" (previous recurrent state) without having to dig through an example extended class for FairSeq, and documentation for FairSeq is buried under some esoteric G**gle results pointing to some Markdown file within the FairSeq repo. I'm so sick and tired of corpocucks just being so deathly allergic to proper documentation and proper example implementations. Everything's so fucked.
    • I have yet to actually see the inference improvements directly with making use of the incremental_state. I wouldn't be surprised if it just, didn't actually work. Using the tiny trainer for the AR on the CPU showed no significant uplifts, and I can't be assed to restart the quarter sized model training locally, and I keep forgetting to check on the full sized model training on P*p*rsp*c*e.

Speaking of P*p*rsp*c*, I almost fell for the allure of actually paying per hour for a card with zero hassle, since, the only problem with trying to train the fullsize model is that there's only about 4% of the epoch trained per 6-hour session, but I don't think I should be throwing $300 for an epoch to a service that has cucked me at least three times now, so when I feel the full sized model needs some love over the quarter sized model, I'll pivot to that locally and suffer for the ~5-or-6 day ETA for it now.

Model wise:

  • the quarter sized RetNet-based model seems to be exceeding my expectations compared to the previous tries: it took about an epoch-and-a-half's worth of training (I think two and a half days) for it to start to try and resemble speech. It was pretty decent at cloning the styles it knew already (the dry text recitals), but the hurdle is getting the model to "learn" how to actually speak words. I'm impressed, since my biggest issue is how absolutely small the model is.
    • however, as it stands now, it can only maintain speech for like, the first few words / a second before it falters. I'm hoping more training will correct this, and that it's not RetNet related. I honestly don't recall the transformer-based models having this type of "partialness" to it. I'd be very surprised if it's a RetNet limitation since, this would mean that it cannot maintain long sequences for long. Again, I'll just need to wait and see.
  • the fullsize model, ironically for the same amount of time trained, is getting there. It sounds fine from what I managed to get eval'd (a bug has it throw an exception after exporting the first audio clip, it should be fixed now, but it was done after the latest session), but it's only at I-think 30% of the epoch, after probably eight 6-hour sessions? It sounds about how the quarter-sized model was a bit before it started to learn speech, so I'm a bit hopeful for when it does emerge.

Evaluation:

  • I'm not too sure where the improvement lies. On one hand, the reduced amount of parameters needed for RetNet over a attention-based-transformers is significantly less, so there's faster forward times and less gradients to account for, so faster throughput + bigger batch sizes. On the other hand, it could very well just be the inherent uplift RetNets have. I would have to do some fancy metric comparisons (probably one for tokens processed) between the two to get a good idea.
    • And on the third hand, literature says that RetNets only really have an improvement over attention-based-transformers with models of parameter counts over like, 2B. The fullsized AR and NARs each have a parameter count of like, 200M with a RetNet (I haven't gotten a chance to check the transformer-based parameter count, but I imagine at most it's 400M each). So, in reality (at least, according to the literature), the RetNet actually should be performing worse apples-to-apples, but again, the uplift would come from having it train faster overall.
  • as things are standing, RetNets work. I have yet to actually test the other fancy things like longer sequences and what-not, but the quarter sized model actually being able to show some speech at the present gives me hope. It just sucks that there's nothing out there that's using a RetNet anyways. Oh well.

I'll just have to let the models bake and see where it takes me.

In the meantime, since I'm very confident now at knowing how exactly the implementation works now (after doing a deep nasty painful dive with the internals of the model arch), I might finally go about "rewriting" my fork from "scratch" and make it my own (in reality, I'm probably just going to copy paste and restructure things, since most of the code has already been combed and modified heavily anyhow).

AIVC is also going to be rewritten... eventually. Every time I look at it, I'm reminded at how much of a mess I've left it in, and let it grow into. I'll have to throw it into a new repo, unfortunately, since such a huge change is just going to cause more problems.

I'll post samples whenever the either model starts to output consistent speech. I foresee that happening by... Sunday? I hope.


https://github.com/facebookresearch/audiocraft

Neato. I'm surprised they released weights for i-

The provided AudioGen reimplementation follows the LM model architecture introduced in MusicGen and is a single stage auto-regressive Transformer model trained over a 16kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. This model variant reaches similar audio quality than the original implementation introduced in the AudioGen publication while providing faster generation speed given the smaller frame rate.

Important note: The provided models are NOT the original models used to report numbers in the AudioGen publication. Refer to the model card to learn more about architectural changes.

I suppose for general audio it's fine, but again: general audio. 16KHz speech is going to sound awful, but at least it's 4 RVQ bins.

On the other side, that's quite interesting. A single AR for the 4 RVQ bins. I'm assuming it's that interleaved shit that I vaguely remember, which I suppose I could replicate, but I'm actually quite fine with an AR + NAR.

Midweek progress report, since things are going somewhat swimmingly, somewhat. Retnet: * Immediately after going "I won't try and wrangle the [Jamie-Stirling/RetNet](https://github.com/Jamie-Stirling/RetNet/) implementation, it's just not going to work", I tried again and, it won't work for the AR. Any further investments into it will just lead to a waste of time. Again, I think it's very neat, and I want it to work out, but I can't make it work out in my favor. * On the flipside, the [microsoft/torchscale](https://github.com/microsoft/torchscale/) RetNet implementation seems to be holding up fine... partially. There's ***zero*** documentation on how to actually make use of the "kv-cache" (previous recurrent state) without having to dig through an example extended class for FairSeq, and documentation for FairSeq is buried under some esoteric G\*\*gle results pointing to some Markdown file within the FairSeq repo. I'm so sick and tired of corpocucks just being so deathly allergic to proper documentation and proper example implementations. Everything's so fucked. - I have yet to actually see the inference improvements directly with making use of the `incremental_state`. I wouldn't be surprised if it just, didn't actually work. Using the tiny trainer for the AR on the CPU showed no significant uplifts, and I can't be assed to restart the quarter sized model training locally, and I keep forgetting to check on the full sized model training on P\*p\*rsp\*c\*e. Speaking of P\*p\*rsp\*c\*, I almost fell for the allure of actually paying per hour for a card with zero hassle, since, the only problem with trying to train the fullsize model is that there's only about 4% of the epoch trained per 6-hour session, but I don't think I should be throwing $300 for an epoch to a service that has cucked me at least three times now, so when I feel the full sized model needs some love over the quarter sized model, I'll pivot to that locally and suffer for the ~5-or-6 day ETA for it now. Model wise: * the quarter sized RetNet-based model seems to be exceeding my expectations compared to the previous tries: it took about an epoch-and-a-half's worth of training (I think two and a half days) for it to *start* to try and resemble speech. It was pretty decent at cloning the styles it knew already (the dry text recitals), but the hurdle is getting the model to "learn" how to actually speak words. I'm impressed, since my biggest issue is how absolutely small the model is. - however, as it stands now, it can only maintain speech for like, the first few words / a second before it falters. I'm hoping more training will correct this, and that it's not RetNet related. I honestly don't recall the transformer-based models having this type of "partialness" to it. I'd be very surprised if it's a RetNet limitation since, this would mean that it cannot maintain long sequences for long. Again, I'll just need to wait and see. * the fullsize model, ironically for the same amount of time trained, is getting there. It sounds *fine* from what I managed to get eval'd (a bug has it throw an exception after exporting the first audio clip, it should be fixed now, but it was done after the latest session), but it's only at I-think 30% of the epoch, after probably eight 6-hour sessions? It sounds about how the quarter-sized model was a bit before it started to learn speech, so I'm a bit hopeful for when it does emerge. Evaluation: * I'm not too sure where the improvement lies. On one hand, the reduced amount of parameters needed for RetNet over a attention-based-transformers is significantly less, so there's faster forward times and less gradients to account for, so faster throughput + bigger batch sizes. On the other hand, it could very well just be the inherent uplift RetNets have. I would have to do some fancy metric comparisons (probably one for tokens processed) between the two to get a good idea. - And on the third hand, literature says that RetNets only really have an improvement over attention-based-transformers with models of parameter counts over like, 2B. The fullsized AR and NARs each have a parameter count of like, 200M with a RetNet (I haven't gotten a chance to check the transformer-based parameter count, but I imagine at *most* it's 400M each). So, in reality (at least, according to the literature), the RetNet actually should be performing worse apples-to-apples, but again, the uplift would come from having it train faster overall. * as things are standing, RetNets work. I have yet to actually test the other fancy things like longer sequences and what-not, but the quarter sized model actually being able to show some speech at the present gives me hope. It just sucks that there's ***nothing*** out there that's using a RetNet anyways. Oh well. I'll just have to let the models bake and see where it takes me. In the meantime, since I'm very confident now at knowing how exactly the implementation works now (after doing a deep nasty painful dive with the internals of the model arch), I might finally go about "rewriting" my fork from "scratch" and make it my own (in reality, I'm probably just going to copy paste and restructure things, since most of the code has already been combed and modified heavily anyhow). AIVC is also going to be rewritten... eventually. Every time I look at it, I'm reminded at how much of a mess I've left it in, and let it grow into. I'll have to throw it into a new repo, unfortunately, since such a huge change is just going to cause more problems. I'll post samples whenever the either model starts to output consistent speech. I foresee that happening by... Sunday? I hope. --- > https://github.com/facebookresearch/audiocraft Neato. I'm surprised they released weights for i- > The provided AudioGen reimplementation follows the LM model architecture introduced in MusicGen and is a single stage auto-regressive Transformer model trained over a 16kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. This model variant reaches similar audio quality than the original implementation introduced in the AudioGen publication while providing faster generation speed given the smaller frame rate. > > Important note: The provided models are NOT the original models used to report numbers in the AudioGen publication. Refer to the model card to learn more about architectural changes. I *suppose* for general audio it's fine, but again: general audio. 16KHz speech is going to sound awful, but at least it's 4 RVQ bins. On the other side, that's quite interesting. A single AR for the 4 RVQ bins. I'm assuming it's that interleaved shit that I vaguely remember, which I *suppose* I could replicate, but I'm actually quite fine with an AR + NAR.

Trying to get proper transcriptions right now for this repo.

I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? I see you're using whisperX for... timestamps? Are those needed before/after running the phonemizer? Thanks!

Trying to get proper transcriptions right now for this repo. I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? I see you're using whisperX for... timestamps? Are those needed before/after running the phonemizer? Thanks!

Trying to get proper transcriptions right now for this repo.

I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? I see you're using whisperX for... timestamps? Are those needed before/after running the phonemizer? Thanks!

It depends... Whisper works best on English and it's possible to get away with using a smaller model than on a different language like Russian or Chinese... That said, I always use the largest model for the most robust transcription, especially if there are many audio files being transcribed. It's a PITA to go back through and check everything if you've got like a 100+ files in the dataset.

> Trying to get proper transcriptions right now for this repo. > > I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? I see you're using whisperX for... timestamps? Are those needed before/after running the phonemizer? Thanks! It depends... Whisper works best on English and it's possible to get away with using a smaller model than on a different language like Russian or Chinese... That said, I always use the largest model for the most robust transcription, especially if there are many audio files being transcribed. It's a PITA to go back through and check everything if you've got like a 100+ files in the dataset.

Trying to get proper transcriptions right now for this repo.

I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? I see you're using whisperX for... timestamps? Are those needed before/after running the phonemizer? Thanks!

It depends... Whisper works best on English and it's possible to get away with using a smaller model than on a different language like Russian or Chinese... That said, I always use the largest model for the most robust transcription, especially if there are many audio files being transcribed. It's a PITA to go back through and check everything if you've got like a 100+ files in the dataset.

Yeah the problem with whisper is that it's hard to scale. Seems like batch inference doesn't seem to affect inference times much, and I can't get multi-GPU to work...

I'm using the smallest model, and I have a feeling it's too low quality of a transcription.

That said, all I'm doing is getting the text transcription and nothing else. A bit concerned maybe I need timestamps or something going through the history in this thread.

> > Trying to get proper transcriptions right now for this repo. > > > > I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? I see you're using whisperX for... timestamps? Are those needed before/after running the phonemizer? Thanks! > > It depends... Whisper works best on English and it's possible to get away with using a smaller model than on a different language like Russian or Chinese... That said, I always use the largest model for the most robust transcription, especially if there are many audio files being transcribed. It's a PITA to go back through and check everything if you've got like a 100+ files in the dataset. Yeah the problem with whisper is that it's hard to scale. Seems like batch inference doesn't seem to affect inference times much, and I can't get multi-GPU to work... I'm using the smallest model, and I have a feeling it's too low quality of a transcription. That said, all I'm doing is getting the text transcription and nothing else. A bit concerned maybe I need timestamps or something going through the history in this thread.
Author
Owner

I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient?

Depends. The bigger used the better the transcription quality, but desu I've been using small for the past dataset batches, since, a sufficiently large enough dataset will be resilient to any transcription problems (which are already sort of smudged out when phonemized).

But again, that's only with a sufficiently large enough dataset, and I don't really know still how big that is. As a reminder, my current dataset is clocked at 4065 hours of audio, and that's still magnitudes smaller than the original paper's 60K with LibriLight.

I see you're using whisperX for... timestamps? Are those needed before/after running the phonemizer? Thanks!

This repo's web UI handles it fine with the Train > Prepare Dataset tab (or whatever I ended up calling it again). It'll handle the entire stack from transcribing with Whisper (or preferably, WhisperX), using the outputted timestamps to to trim down utterances if requested, and exporting the quantized audio and phonemized text in a format that my VALL-E fork implementation's trainer can use, as long as you start the web UI with --tts-backend="vall-e".

Shilling aside, there's a way to do it without the web UI as documented in the README, but I think it's a bit of a chore if you're leveraging Whisper, since you'll need to yank out the text itself.

Yeah the problem with whisper is that it's hard to scale. Seems like batch inference doesn't seem to affect inference times much, and I can't get multi-GPU to work...

From what I remember, Whisper doesn't have batch sizing, WhisperX originally only had batching with the VAD filter (which required a HF token), but currently supports batching with its faster-whisper backend, but I don't recall either having multi-GPU support.

A bit concerned maybe I need timestamps or something going through the history in this thread.

Nah, it's only if you're on an extremely tight VRAM budget and need your audio trimmed as tightly as you can get. The original paper seems to be feeding in 20-30 second utterances, which I think the maximum Whisper will spit out is like, 16 seconds? before it'll segment it itself as best as it can.

I will give a bit of a caution that, I still don't feel quite comfortable with my fork implementation being used to spin out models, only because I still need to get around to figuring out why distributed training won't work, and desu I think it'll be a little ways off before I can get around to it from odd circumstances at the moment (I suppose I can bite the bullet and r*nt out a multi-GPU system for a few hours to test with without going through the agony of throwing in my 2x6800XTs back into my training system).


Speaking of qualms about suggesting the VALL-E implementation from being used: I can finally stop calling it a fork and comfortably call it my own implementation. I've done my rewrite, overhaul, and restructuring through almost every facet of the code, save for (in short, anything credited at the top of their respective files): the original transformer-based implementation, the dataset sampler code, the training loop that handles stdin and saving/evaling X iterations, and some helper functions and boilerplate creature comforts.

I think the implementation (like my understanding of VALL-E now) has matured enough that I won't be breaking anything anytime soon, since the config YAML is cleaned up in a way I like it, the RetNet-based model seems stable enough, and anything pertaining to the dataset isn't going to change anytime soon (although I don't think it ever actually changed).

Additionally, I've set it up in a way that, if I wanted/needed to, I can pivot away from DeepSpeed and use a different framework, like HF's solution, or Lightning, or my own + BitsAndBytes (which I suppose technically is already in, I just need to extensively test it). As much as I like DeepSpeed, I think that, in the context of the model's largest preset size being around 400M parameters each, I don't think I need to leverage any of the ZeRO features.

I just hope that putting my faith in a RetNet for the current training pays off. I'm still rather impressed, but still holding my breath for it paying off.

> I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? Depends. The bigger used the better the transcription quality, but desu I've been using `small` for the past dataset batches, since, a sufficiently large enough dataset will be resilient to any transcription problems (which are already sort of smudged out when phonemized). But again, that's only with a sufficiently large enough dataset, and I don't really know still how big that is. As a reminder, my current dataset is clocked at 4065 hours of audio, and that's still magnitudes smaller than the original paper's 60K with LibriLight. > I see you're using whisperX for... timestamps? Are those needed before/after running the phonemizer? Thanks! This repo's web UI handles it fine with the Train > Prepare Dataset tab (or whatever I ended up calling it again). It'll handle the entire stack from transcribing with Whisper (or preferably, WhisperX), using the outputted timestamps to to trim down utterances if requested, and exporting the quantized audio and phonemized text in a format that my VALL-E ~~fork~~ implementation's trainer can use, as long as you start the web UI with `--tts-backend="vall-e"`. Shilling aside, there's a way to do it without the web UI as documented in the [README](https://git.ecker.tech/mrq/vall-e/src/branch/master/README.md#leverage-your-own), but I think it's a bit of a chore if you're leveraging Whisper, since you'll need to yank out the text itself. > Yeah the problem with whisper is that it's hard to scale. Seems like batch inference doesn't seem to affect inference times much, and I can't get multi-GPU to work... From what I remember, Whisper doesn't have batch sizing, WhisperX originally only had batching with the VAD filter (which required a HF token), but currently supports batching with its faster-whisper backend, but I don't recall either having multi-GPU support. > A bit concerned maybe I need timestamps or something going through the history in this thread. Nah, it's only if you're on an extremely tight VRAM budget and need your audio trimmed as tightly as you can get. The original paper seems to be feeding in 20-30 second utterances, which I think the maximum Whisper will spit out is like, 16 seconds? before it'll segment it itself as best as it can. I will give a bit of a caution that, I still don't feel *quite* comfortable with my ~~fork~~ implementation being used to spin out models, only because I still need to get around to figuring out why distributed training won't work, and desu I think it'll be a little ways off before I can get around to it from odd circumstances at the moment (I suppose I can bite the bullet and r\*nt out a multi-GPU system for a few hours to test with without going through the agony of throwing in my 2x6800XTs back into my training system). --- Speaking of qualms about suggesting the VALL-E implementation from being used: I can finally stop calling it a fork and comfortably call it my own implementation. I've done my rewrite, overhaul, and restructuring through almost every facet of the code, save for (in short, anything credited at the top of their respective files): the original transformer-based implementation, the dataset sampler code, the training loop that handles stdin and saving/evaling X iterations, and some helper functions and boilerplate creature comforts. I think the implementation (like my understanding of VALL-E now) has matured enough that I won't be breaking anything anytime soon, since the config YAML is cleaned up in a way I like it, the RetNet-based model seems stable enough, and anything pertaining to the dataset isn't going to change anytime soon (although I don't think it ever actually changed). Additionally, I've set it up in a way that, if I wanted/needed to, I can pivot away from DeepSpeed and use a different framework, like HF's solution, or Lightning, or my own + BitsAndBytes (which I suppose technically is already in, I just need to extensively test it). As much as I like DeepSpeed, I think that, in the context of the model's largest preset size being around 400M parameters each, I don't think I need to leverage any of the ZeRO features. I just hope that putting my faith in a RetNet for the current training pays off. I'm still rather impressed, but still holding my breath for it paying off.

This repo's web UI handles it fine with the Train > Prepare Dataset tab (or whatever I ended up calling it again). It'll handle the entire stack from transcribing with Whisper (or preferably, WhisperX), using the outputted timestamps to to trim down utterances if requested, and exporting the quantized audio and phonemized text in a format that my VALL-E fork implementation's trainer can use, as long as you start the web UI with --tts-backend="vall-e".

Shilling aside, there's a way to do it without the web UI as documented in the README, but I think it's a bit of a chore if you're leveraging Whisper, since you'll need to yank out the text itself.

Actually, I was only making use of the readme and repo. Where is the web UI?

> This repo's web UI handles it fine with the Train > Prepare Dataset tab (or whatever I ended up calling it again). It'll handle the entire stack from transcribing with Whisper (or preferably, WhisperX), using the outputted timestamps to to trim down utterances if requested, and exporting the quantized audio and phonemized text in a format that my VALL-E fork implementation's trainer can use, as long as you start the web UI with --tts-backend="vall-e". > Shilling aside, there's a way to do it without the web UI as documented in the README, but I think it's a bit of a chore if you're leveraging Whisper, since you'll need to yank out the text itself. Actually, I was only making use of the readme and repo. Where is the web UI?
Author
Owner

Actually, I was only making use of the readme and repo. Where is the web UI?

This repo; the one originally for a web UI for TorToiSe.

> Actually, I was only making use of the readme and repo. Where is the web UI? This repo; the one originally for a web UI for TorToiSe.
Author
Owner

So upset that I keep ending up having to constantly reset the epoch with training the full.

  • The time last week was because I forgot to clean up the previous checkpoints and ate up the remaining disk space, about a quarter way through.
  • The time after that I just so happened to catch why the reported model loss was never quite the same as the reported loss.nll (because it was summing up the precision and accuracy metrics too as the loss) until I used my training framework for a ResNet-based side project, about another quarter way through.
  • The time after that (the most recent time) I've had Windows decide to freeze what was drawn in every window, causing me to close them all and re-open them. Training was still fine after that, but I got a little too antsy and tried to suspend + disown + re-own with reptyr, but I guess it just suddenly died, crashed, and burned after a while, about a third way through.

I just want one full clean epoch to train against.

The evaluation output is sounding better in terms of actual speech, but it's having a bit of a hard time capturing the style right; I'm going to cope and say it needs more time because of the botched epochs (I should really look into finding a way to be able to "resume" an epoch exactly from where it was last left off). I'll probably share the outputs again in a few days, or when it starts sounding fine.

I'm very confident the RetNet-based approach is perfectly usable, it's just with a bunch of headaches and hurdles getting to a """decently""" trained model is quite the pain.

Quite the pain.

So upset that I keep ending up having to constantly reset the epoch with training the full. * The time last week was because I forgot to clean up the previous checkpoints and ate up the remaining disk space, about a quarter way through. * The time after that I just so happened to catch *why* the reported model loss was never quite the same as the reported `loss.nll` (because it was summing up the precision and accuracy metrics too as the loss) until I used my training framework for a ResNet-based side project, about another quarter way through. * The time *after* that (the most recent time) I've had Windows decide to freeze what was drawn in every window, causing me to close them all and re-open them. Training was still fine after that, but I got a little too antsy and tried to suspend + disown + re-own with reptyr, but I guess it just suddenly died, crashed, and burned after a while, about a third way through. I just want *one* full clean epoch to train against. The evaluation output is sounding *better* in terms of actual speech, but it's having a bit of a hard time capturing the style right; I'm going to cope and say it needs more time because of the botched epochs (I should really look into finding a way to be able to "resume" an epoch exactly from where it was last left off). I'll probably share the outputs again in a few days, or when it starts sounding fine. I'm very confident the RetNet-based approach is perfectly usable, it's just with a bunch of headaches and hurdles getting to a """decently""" trained model is quite the pain. Quite the pain.

Does it seem the RetNet approach is better / more data efficient, or better to use the original vall-e implementation?

Also, I am using the phonemizer, but is keeps coming up with None values occasionally (I guess there are some tokens or substrings it's not expecting?). I modified it to just delete / ignore None entries, but maybe that's a bad idea?

Does it seem the RetNet approach is better / more data efficient, or better to use the original vall-e implementation? Also, I am using the phonemizer, but is keeps coming up with None values occasionally (I guess there are some tokens or substrings it's not expecting?). I modified it to just delete / ignore None entries, but maybe that's a bad idea?
Author
Owner

Does it seem the RetNet approach is better / more data efficient, or better to use the original vall-e implementation?

It's a bit complicated.

desu I tried to express my sentiments, but both:

  • my brain turned to mush after trying for half an hour
  • I can't confidently draw any conclusions right now, and only really give what the paper says.
    • the transformer-based model was probably trained a little too perfectly, while the RetNet based model has had misfortune after misfortune interrupting its training, so it's all mushed up.
    • ignoring the above, the datasets aren't similar enough, so that's also a comparison.
    • despite being biased to want a RetNet to succeed, I'm a bit biased against it as the output still sounds a little bad.

I'll probably give better thoughts whenever (ifever) the model gets to a good point.

Also, I am using the phonemizer, but is keeps coming up with None values occasionally (I guess there are some tokens or substrings it's not expecting?). I modified it to just delete / ignore None entries, but maybe that's a bad idea?

I don't quite follow, since I haven't had any output issues with the phonemizer (technical issues before, yes, but not outright output problems).

But when I get a chance I'll see about looking into it.

> Does it seem the RetNet approach is better / more data efficient, or better to use the original vall-e implementation? It's a bit complicated. desu I tried to express my sentiments, but both: * my brain turned to mush after trying for half an hour * I can't confidently draw any conclusions right now, and only really give what the paper says. - the transformer-based model was probably trained a little too perfectly, while the RetNet based model has had misfortune after misfortune interrupting its training, so it's all mushed up. - ignoring the above, the datasets aren't similar enough, so that's also a comparison. - despite being biased to want a RetNet to succeed, I'm a bit biased against it as the output still sounds a little bad. I'll probably give better thoughts whenever (ifever) the model gets to a good point. > Also, I am using the phonemizer, but is keeps coming up with None values occasionally (I guess there are some tokens or substrings it's not expecting?). I modified it to just delete / ignore None entries, but maybe that's a bad idea? I don't quite follow, since I haven't had any output issues with the phonemizer (technical issues before, yes, but not outright output problems). But when I get a chance I'll see about looking into it.
Author
Owner

mmm. Maybe I'm being a bit of a baby about it. Giving another listen to the most recent evaluation / validation output and listening to the reference for direct comparison, the current, RetNet-based model sounds about the same "cloning" quality as the transformer-based model has (from what I remember): samples (17250 is from a few days ago, 32750 is right now).

  • despite that the current, RetNet-based model isn't quite as trained as where I left off the previous, transformer-based one.
  • despite that the current, RetNet-based model has a lower average loss / higher average accuracy than where I left off the previous, transformer-based one.
    • I can write this off as the actual raw audio quality + the acoustics are preserved very well, I feel.

It definitely has a lot more room for improvement.

  • the zeroshot-ness of the model is still bad; a lot of voices aren't quite right.
  • there's some speech issues that come through, but I write that off as being able to be fixed in due time.
  • the audio quality is still meh, a NAR that outputs more RVQ bins (3 onwards) can resolve this, and I imagine even stapling a previous one of my NARs will help. But, given this is only 2 RVQ bins with vocos, it's still impressive.

I suppose that, empirically, a RetNet-based model outperforms a transformer-based model, as technically right now the RetNet-based model has less training than the transformer-based model did.

I still don't really have any inferencing tests, like how fast it performs in comparison, how well it can extend past the "context limit" of ~12 seconds it was trained on right now, etc. (the things a RetNet boasts).


But of course, the next batch of evaluation / validation output, without listening to the reference, sounds pretty shit.

mmm. Maybe I'm being a bit of a baby about it. Giving another listen to the most recent evaluation / validation output and listening to the reference for direct comparison, the current, RetNet-based model sounds about the same "cloning" quality as the transformer-based model has (from what I remember): [samples](https://files.catbox.moe/jaw4cf.7z) (17250 is from a few days ago, 32750 is right now). * despite that the current, RetNet-based model isn't quite as trained as where I left off the previous, transformer-based one. * despite that the current, RetNet-based model has a lower average loss / higher average accuracy than where I left off the previous, transformer-based one. - I can write this off as the actual raw audio quality + the acoustics are preserved very well, I feel. It definitely has a lot more room for improvement. * the zeroshot-ness of the model is still bad; a lot of voices aren't quite right. * there's some speech issues that come through, but I write that off as being able to be fixed in due time. * the audio quality is still meh, a NAR that outputs more RVQ bins (3 onwards) can resolve this, and I imagine even stapling a previous one of my NARs will help. But, given this is only 2 RVQ bins with vocos, it's still impressive. I ***suppose*** that, empirically, a RetNet-based model outperforms a transformer-based model, as *technically* right now the RetNet-based model has less training than the transformer-based model did. I still don't really have any inferencing tests, like how fast it performs in comparison, how well it can extend past the "context limit" of ~12 seconds it was trained on right now, etc. (the things a RetNet boasts). --- But of course, the next batch of evaluation / validation output, without listening to the reference, sounds pretty shit.
Author
Owner

ETA 30 more hours until a full epoch is fully processed (despite the trainer saying it's currently 1.7 epochs in from the previous runs), but the model seems to still be better paced than the previous transformer-based one. Some notes that are quite silly in hindsight once more (and samples):

  • about two days ago I had a gut feeling to try and drop the LR an order of magnitude, as the losses and accuracies weren't changing all that much, and barely averaging out. Doing so did help significantly with the consistency of speech sounding fine, and a bit of the "clone-ability" (nothing major, but it did help).
    • I imagine the next epoch to try and train it would greatly benefit from the LR dropping another order or magnitude to help smooth things out. I think my biggest problem is not knowing where to keep my LR as high as possible, and when to drop it, as evident with my botched trainings where the scheduler will decay way too soon and too low, yet the other issue where I'll keep being focused on having as high of an LR as comfortably possible before the losses spike. The only issue, though, is that I would need to manually adjust the LR mid-training (which is fine, I added a command to the training loop to let me adjust it), since I don't think I can rely on a scheduler to do so. There's one that seems to drop the LR if it detects no changes in the loss, but it seems to be only per-epoch, and I need something to do it mid-epoch.
  • I think one of my main gripes right now is that the smaller, LibriTTS voices are kind of shite, as the LibriTTS/LibriLight voices are what's consistent. Even the donated audiobooks have consistency issues it seems, and the vidya shit are definitely weak too, at least from whatever passes through the evaluation pass. I imagine this has something to do with the sampler method trying to balance per speaker, and there being lots of LbiriTTS voices in comparison to everything else, so everything else won't be visited anywhere near as often. I do remember past trainings being able to clone the smaller part of the dataset decently over the epoch, so who knows.
    • I also imagine, once again, that modifying how the dataloader works would improve things after the initial epoch (or two). I'm very sure I mentioned it once or twice, but a dataloader that would sample from speakers and then sample from a speaker's dataset (with replacement or not) would be the better pick to help with zeroshot voice-clone-ability as it would balance out for different speakers rather than "oh oops guess these speakers had its data already exhausted, you won't be seeing that speaker for a few days in training", since it's the speakers that gets sampled without replacement first and foremost with this method.
  • Playing around with using the trainer for other endeavors, I might have found out the issue with distributed training. Seems I just need to use the right sampler so it'll actually provide unique batches per GPU, which I think was the issue I saw? Regardless, if the stars align with this training session ends in a timely fashion, I'll go back and throw my 6800XTs into my training machine and fiddle with it.
    • I say "aligns right", because the last time I tried putting my 6800XTs back into it, it was a two hour, sweaty ordeal. I also don't know if I'm going to go back into testing finetuning again. I also don't know if I'll pivot back to the quarter sized model and see how well it responds to just dropping the LR and going from there.
  • Also, while playing with that other endeavor, it seems that I found an issue with inferencing, where, I forgot to set the models into eval mode, and the inferencing was severely gimped for that endeavor. Although, I imagine smaller models like in that instance are much more susceptible to issues from dropout, rather than large models being much more resilient to it. Still, it's a factor that probably caused my inference tests in the web UI to sound bad.
  • And as a side note in my wandering thoughts, one thing that, again in hindsight seems obvious, I did notice when digging deeper into the implementation's model code (which, in hindsight again seems very, very rudimentary as it's just a transformer/resnet, nothing fancy), that I could be able to implement an analogue to TorToiSe's random voice option by using either:
    • during inference, start with just the text prompt, and have the model sample out a new input prompt, and continuing to sequence through until a stop token in the response. This, in theory, works since the model will be able to spit out a sequence from nothing anyways. This might not work, since there's no loss calculated for the input prompts, so who knows how well it can "learn" from there, like it "learns" from the text prompt, as losses are calculated for that.
    • just feed it an empty input prompt and cross my fingers it also works that way.

There's just one more thing I don't quite understand. From the homebrewed lifeiteng/vall-e implementation, I remember the training was something like "4xA100s for four days and 100 epochs". I feel my previous transformer-based model and current ResNet-based model have performed much better in much less compute time. I don't know if this is just a testament to my "optimizations" (tweaks) contributing to a much better throughput, or a testament to (what I imagine is) the crux of every "researcher" and throw their oodles of compute at any task and just bruteforce through it.

Oh well. I just hope I do get enough time to do whatever I do expect to do (finetune test, quarter sized retrain, muck around with distributed training, etc).

ETA 30 more hours until a full epoch is fully processed (despite the trainer saying it's currently 1.7 epochs in from the previous runs), but the model seems to still be better paced than the previous transformer-based one. Some notes that are quite silly in hindsight once more (and [samples](https://files.catbox.moe/utd89p.7z)): * about two days ago I had a gut feeling to try and drop the LR an order of magnitude, as the losses and accuracies weren't changing all that much, and barely averaging out. Doing so did help significantly with the consistency of speech sounding fine, and a bit of the "clone-ability" (nothing major, but it did help). - I imagine the next epoch to try and train it would greatly benefit from the LR dropping another order or magnitude to help smooth things out. I think my biggest problem is not knowing where to keep my LR as high as possible, and when to drop it, as evident with my botched trainings where the scheduler will decay way too soon and too low, yet the other issue where I'll keep being focused on having as high of an LR as comfortably possible before the losses spike. The only issue, though, is that I would need to manually adjust the LR mid-training (which is fine, I added a command to the training loop to let me adjust it), since I don't think I can rely on a scheduler to do so. There's one that seems to drop the LR if it detects no changes in the loss, but it seems to be only per-epoch, and I need something to do it mid-epoch. * I think one of my main gripes right now is that the smaller, LibriTTS voices are kind of shite, as the LibriTTS/LibriLight voices are what's consistent. Even the donated audiobooks have consistency issues it seems, and the vidya shit are definitely weak too, at least from whatever passes through the evaluation pass. I imagine this has something to do with the sampler method trying to balance per speaker, and there being lots of LbiriTTS voices in comparison to everything else, so everything else won't be visited anywhere near as often. I do remember past trainings being able to clone the smaller part of the dataset decently over the epoch, so who knows. - I also imagine, once again, that modifying how the dataloader works would improve things after the initial epoch (or two). I'm very sure I mentioned it once or twice, but a dataloader that would sample from speakers and then sample from a speaker's dataset (with replacement or not) would be the better pick to help with zeroshot voice-clone-ability as it would balance out for different speakers rather than "oh oops guess these speakers had its data already exhausted, you won't be seeing that speaker for a few days in training", since it's the speakers that gets sampled without replacement first and foremost with this method. * Playing around with using the trainer for [other](https://git.ecker.tech/mrq/resnet-classifier) endeavors, I might have found out the issue with distributed training. Seems I just need to use the right [sampler](https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html#distributing-input-data) so it'll actually provide unique batches per GPU, which I think was the issue I saw? Regardless, if the stars align with this training session ends in a timely fashion, I'll go back and throw my 6800XTs into my training machine and fiddle with it. - I say "aligns right", because the last time I tried putting my 6800XTs back into it, it was a two hour, sweaty ordeal. I also don't know if I'm going to go back into testing finetuning again. I also don't know if I'll pivot back to the quarter sized model and see how well it responds to just dropping the LR and going from there. * Also, while playing with that other endeavor, it seems that I found an issue with inferencing, where, I forgot to set the models into eval mode, and the inferencing was severely gimped for that endeavor. Although, I imagine smaller models like in that instance are much more susceptible to issues from dropout, rather than large models being much more resilient to it. Still, it's a factor that probably caused my inference tests in the web UI to sound bad. * And as a side note in my wandering thoughts, one thing that, again in hindsight seems obvious, I did notice when digging deeper into the implementation's model code (which, in hindsight again seems very, very rudimentary as it's just a transformer/resnet, nothing fancy), that I *could* be able to implement an analogue to TorToiSe's `random` voice option by using either: - during inference, start with just the text prompt, and have the model sample out a new input prompt, and continuing to sequence through until a stop token in the response. This, in theory, works since the model *will* be able to spit out a sequence from nothing anyways. This might not work, since there's no loss calculated for the input prompts, so who knows how well it can "learn" from there, like it "learns" from the text prompt, as losses are calculated for that. - just feed it an empty input prompt and cross my fingers it also works that way. There's just one more thing I don't quite understand. From the homebrewed [lifeiteng/vall-e](https://github.com/lifeiteng/vall-e/tree/main/valle) implementation, I remember the training was something like "4xA100s for four days and 100 epochs". I *feel* my previous transformer-based model and current ResNet-based model have performed much better in much less compute time. I don't know if this is just a testament to my "optimizations" (tweaks) contributing to a much better throughput, or a testament to (what I imagine is) the crux of every "researcher" and throw their oodles of compute at any task and just bruteforce through it. Oh well. I just hope I do get enough time to do whatever I do expect to do (finetune test, quarter sized retrain, muck around with distributed training, etc).
Author
Owner

Alrighty, the full epoch finished, and I was able to finally get off my ass and put my 2x6800XTs back into my training rig to muck around with getting distributed training working. The whole time it really was just that I needed to use DistributedSampler and make sure the corresponding batch received was on the right device. Everything seems to be in order, as the total iteration count aligned with being half the size of what it was before with one device. Woo.

On the flipside, what a nightmare. First with finally diagnosing why Xorg didn't want to launch when using an AMD card (because Xorg.conf was set to explicitly load Nvidia drivers, I suppose when I had mhwd install them, it also configured Xorg.conf "proper"), and then remembering that torch2.0+ still doesn't work with ROCm, as I'll keep getting nans in the forward pass. Going back to torch1.13.0 fixed this, but with a nasty performance hit. It'd be nice to use newer torch against newer ROCm, as my iteration rate now is 18s/it for an effective batch size of 32. Woo.

There were also some fixes like making sure exporting actually worked cross-device, and inferencing works again, as I imagine I broke that too earlier.


As a side note, I'm still not really sure what to do after this point. I think I should go and throw my shit onto a rental machine again and eat another epoch or two from there. Using P*p*rsp*c* to train in 6 hour intervals was bad, but it might be fine if I did pivot to my proposed idea of "have the dataloader sample against speakers themselves, and then randomly pick the target from the speaker's pool", since the major issue was that an interrupted epoch is very le bad.
But since I have my 2x6800XTs in my training rig again, I did kind of want to muck around with LLaMa again but for real on a GPU with exllama and not painfully on my personal rig with tight system RAM, despite it being DDR5.

Oh well, I think for the night I'll let my 4070Ti rest a bit, although it doesn't seem to break 50C during training anyhow. It might benefit from a modest overclock, if anything one for memory, since I imagine it's memory bandwidth limited.

Alrighty, the full epoch finished, and I was able to finally get off my ass and put my 2x6800XTs back into my training rig to muck around with getting distributed training working. The whole time it really was just that I needed to use `DistributedSampler` and make sure the corresponding batch received was on the right device. Everything *seems* to be in order, as the total iteration count aligned with being half the size of what it was before with one device. Woo. On the flipside, what a nightmare. First with finally diagnosing why Xorg didn't want to launch when using an AMD card (because Xorg.conf was set to explicitly load Nvidia drivers, I suppose when I had `mhwd` install them, it also configured Xorg.conf "proper"), and then remembering that torch2.0+ *still* doesn't work with ROCm, as I'll keep getting `nan`s in the forward pass. Going back to torch1.13.0 fixed this, but with a nasty performance hit. It'd be nice to use newer torch against newer ROCm, as my iteration rate now is 18s/it for an effective batch size of 32. Woo. There were also some fixes like making sure exporting actually worked cross-device, and inferencing works again, as I imagine I broke that too earlier. --- As a side note, I'm still not really sure what to do after this point. I think I *should* go and throw my shit onto a rental machine again and eat another epoch or two from there. Using P\*p\*rsp\*c\* to train in 6 hour intervals was bad, but it might be fine if I did pivot to my proposed idea of "have the dataloader sample against speakers themselves, and then randomly pick the target from the speaker's pool", since the major issue was that an interrupted epoch is very le bad. But since I have my 2x6800XTs in my training rig again, I did kind of want to muck around with LLaMa again but for real on a GPU with exllama and not painfully on my personal rig with tight system RAM, despite it being DDR5. Oh well, I think for the night I'll let my 4070Ti rest a bit, although it doesn't seem to break 50C during training anyhow. It might benefit from a modest overclock, if anything one for memory, since I imagine it's memory bandwidth limited.
Author
Owner

It seems the ROCm + pytorch 2.0 issues derived from an issue with GPUs whose PCIe lanes go through the chipset rather than the CPU itself, effectively dashing any hope of homelabbing since all chipsets on Ryzen boards do that I'm not too sure how that was causing the nan issues, but I've got other ROCm + pytorch 2.0 things working on a single 6800XT. Not a big deal, since my 4070Ti works on the 2nd PCIe slot.

I resumed training on a smaller range of the dataset (I think I set it to only pick utterances with phoneme lengths under 64 tokens, and durations under 8 seconds?) to try and chew through an epoch faster, as I didn't have much time left to work on pivoting from a "have a dataset pool from paths rather than speakres" approach before conking out. Doing this lets me comfortably double the batch size to 16, and the ETA is about half of what it was before with a "full" dataset (ETA 60 hours).

I did some inferencing tests before this and... zeroshot is pretty inconsistent with the non-LibriTTS voices, and using LibriTTS voices has it consistently fall apart. I'll just cope and say it's from the model not being as trained as the transformer-based model was when I did do inference tests. Although I'm now realizing I might have not been loading the right copy of the model, but I doubt that's the issue.

Some voices that should not have had enough training time compared to the audiobooks performed a little too well for what it's worth at least, which I guess is a good sign, but it's still an issue of getting consistent speech that doesn't sound like some alien trying to mimic human speech.

Testing the RetNet's capabilities of extending well past the context window it was trained against seems promising. It's just hard to judge how well it works without the baseline performing consistently during inference. It doesn't sound like it completely falls apart, but it seems that after an arbitrary point, and for an arbitrary length, the voice will sound different, and then snap back to "normal" after (normal insofar being that it sounds like the output in the beginning).

I suppose the ultimate test is seeing how training on a much smaller context window will fare with the RetNet. Again though, I'm just not too sure how to go about this again, since the last time I tried pivoting on a narrower dataset, it seemed to be for the worse, but I don't think it's feasible to do another full epoch for marginal improvement. I suppose I might need to dump in more data to help with the speech, but I'm not too sure how necessary that is, as I've demonstrated the models can speak before.


runpod.io has H100s, so I suppose I'll paypig and give a real GPU a spin.

desu it doesn't seem to be that bad of a deal, compared to P*p*rsp*c*, so a few hours to dip my toes in the water shouldn't be too bad. RTX 6000s Adas and 4090s seem much more cheaper than I remember, so I can also pivot to either before scaling up even more if I find them to be a little bit of a better value proposition compared to H100s.


The throughput increase was... not as much as I was expecting. The ETA with the full dataset on one H100 racked up to 80 hours, and throwing four H100s had it go down to a little above 22 hours without specifying any ZeRO configurations or special flavors of optimizers.

I'm going to pivot towards renting 4090s and see if it's a better value, since I'm just thinking training these models just don't scale all that well, both with throwing more compute at something from cards of a same arch/similar arches, or increasing the batch size.

I suppose $10 down the drain isn't so bad.


I burned another $10 on nothing over night for the 2x4090s because the training got botched two hours in and ran out of disk space. I suppose I should really re-implement the "keep the X latest checkpoints" from AIVC into the VALL-E trainer.

While it's not quite apples to apples of a comparison because:

  • the dataset is using the reduced dataset to speed up throughput (capping the duration to be no bigger than 8 seconds of audio)
  • I'm using ZeRO 3 / ZeRO++

the ETA when using 2x4090s rounds out to about 20 hours. With the 2x reduction in the ETA from locally pivoting the full dataset to this throughput-focused one, this should be about comparable with the ETA 22 hours the 4xH100s were able to estimated-clock-at for the full dataset (I think, I'm pretty sure I pivoted to it for that test).

While I think I should give the 4xH100s another shot with ZeRO, I don't think it's going to change much as the ~4x in price isn't worth it for a menial change in throughput. I suppose the purpose of large enterprise GPUs like the H100s are for extremely large models, rather than tiny ones, so they're not a good fit for me. If anything, I suppose I can get away with renting a cluster of 4070Ti's, or a giant cluster of V100s.

I suppose I'll have it eat out another full epoch as I have the quarter-sized model baking on my local machine. It's starting to sound better after putting more time into it, so I guess it's just a matter of giving it some more time to develop.

It seems the ROCm + pytorch 2.0 issues derived from an issue with GPUs whose PCIe lanes go through the chipset rather than the CPU itself, effectively dashing any hope of homelabbing since all chipsets on Ryzen boards do that I'm not too sure how that was causing the `nan` issues, but I've got other ROCm + pytorch 2.0 things working on a single 6800XT. Not a big deal, since my 4070Ti works on the 2nd PCIe slot. I resumed training on a smaller range of the dataset (I think I set it to only pick utterances with phoneme lengths under 64 tokens, and durations under 8 seconds?) to try and chew through an epoch faster, as I didn't have much time left to work on pivoting from a "have a dataset pool from paths rather than speakres" approach before conking out. Doing this lets me comfortably double the batch size to 16, and the ETA is about half of what it was before with a "full" dataset (ETA 60 hours). I did some inferencing tests before this and... zeroshot is pretty inconsistent with the non-LibriTTS voices, and using LibriTTS voices has it consistently fall apart. I'll just cope and say it's from the model not being as trained as the transformer-based model was when I did do inference tests. Although I'm now realizing I might have not been loading the right copy of the model, but I doubt that's the issue. Some voices that should not have had enough training time compared to the audiobooks performed a little too well for what it's worth at least, which I guess is a good sign, but it's still an issue of getting consistent speech that doesn't sound like some alien trying to mimic human speech. Testing the RetNet's capabilities of extending well past the context window it was trained against seems promising. It's just hard to judge how well it works without the baseline performing consistently during inference. It doesn't sound like it completely falls apart, but it seems that after an arbitrary point, and for an arbitrary length, the voice will sound different, and then snap back to "normal" after (normal insofar being that it sounds like the output in the beginning). I suppose the ultimate test is seeing how training on a much smaller context window will fare with the RetNet. Again though, I'm just not too sure how to go about this again, since the last time I tried pivoting on a narrower dataset, it seemed to be for the worse, but I don't think it's feasible to do another full epoch for marginal improvement. I suppose I might need to dump in more data to help with the speech, but I'm not too sure how necessary that is, as I've demonstrated the models *can* speak before. --- [runpod.io](https://runpod.io/) has H100s, so I ***suppose*** I'll paypig and give a real GPU a spin. desu it doesn't seem to be that bad of a deal, compared to P\*p\*rsp\*c\*, so a few hours to dip my toes in the water shouldn't be too bad. RTX 6000s Adas and 4090s seem much more cheaper than I remember, so I can also pivot to either before scaling up even more if I find them to be a little bit of a better value proposition compared to H100s. --- The throughput increase was... not as much as I was expecting. The ETA with the full dataset on one H100 racked up to 80 hours, and throwing four H100s had it go down to a little above 22 hours without specifying any ZeRO configurations or special flavors of optimizers. I'm going to pivot towards renting 4090s and see if it's a better value, since I'm just thinking training these models just don't scale all that well, both with throwing more compute at something from cards of a same arch/similar arches, or increasing the batch size. I suppose $10 down the drain isn't so bad. --- I burned another $10 on nothing over night for the 2x4090s because the training got botched two hours in and ran out of disk space. I suppose I should really re-implement the "keep the X latest checkpoints" from AIVC into the VALL-E trainer. While it's not quite apples to apples of a comparison because: * the dataset is using the reduced dataset to speed up throughput (capping the duration to be no bigger than 8 seconds of audio) * I'm using ZeRO 3 / ZeRO++ the ETA when using 2x4090s rounds out to about 20 hours. With the 2x reduction in the ETA from locally pivoting the full dataset to this throughput-focused one, this should be about comparable with the ETA 22 hours the 4xH100s were able to estimated-clock-at for the full dataset (I think, I'm pretty sure I pivoted to it for that test). While I think I should give the 4xH100s another shot with ZeRO, I don't think it's going to change much as the ~4x in price isn't worth it for a menial change in throughput. I suppose the purpose of large enterprise GPUs like the H100s are for extremely large models, rather than tiny ones, so they're not a good fit for me. If anything, I suppose I can get away with renting a cluster of 4070Ti's, or a giant cluster of V100s. I suppose I'll have it eat out another full epoch as I have the quarter-sized model baking on my local machine. It's starting to sound better after putting more time into it, so I guess it's just a matter of giving it some more time to develop.
Author
Owner

I decided to move this out of the """blog""" update comment above, since it should be it's own section for me to continue updating my thoughts on:


It seems M$ has an answer to Zucc's voicebox or whatever it's called: https://www.microsoft.com/en-us/research/project/speechx/

Giving a look at it, it seems like a rather simpler method to Zucc's "flow-matching" based model voicebox uses: with some light adapting, they turned an existing VALL-E model into one that can do more than zero-shot voice cloning by making use of special tokens and formatting in the input prompt and procedurally generating output that fits the task at hand. It's something I could even try and pivot to, since the paper explicitly mentions using an existing VALL-E model to start with, and detailed how it goes about preparing the target outputs to train against, and even mentions how well it boosted the zero-shot capabilities from it.

It's something I should probably look into once the model is good enough, since the methodology seems very straightforward. I think all the necessary changes can be done in the dataloader's getitem to all the input data accordingly as a procedural post processing. I think the only issue would be any of the noise-related tasks, as I would have to decode the EnCodec tokens, apply the noise to the waveform, and then reencode them. Sure, I could have that as an ahead-of-time dataset, but the noise wouldn't be random per utterance sampled.

It also reminds me I should give the VALL-E X paper another look, since, knowing so much more now than I do before, I should be able to have insight on how they went about it.

I decided to move this out of the """blog""" update comment above, since it should be it's own section for me to continue updating my thoughts on: --- It seems M$ has an answer to Zucc's voicebox or whatever it's called: https://www.microsoft.com/en-us/research/project/speechx/ Giving a look at it, it seems like a rather simpler method to Zucc's "flow-matching" based model voicebox uses: with some light adapting, they turned an existing VALL-E model into one that can do more than zero-shot voice cloning by making use of special tokens and formatting in the input prompt and procedurally generating output that fits the task at hand. It's something I could even try and pivot to, since the paper explicitly mentions using an existing VALL-E model to start with, and detailed how it goes about preparing the target outputs to train against, and even mentions how well it boosted the zero-shot capabilities from it. It's something I should probably look into once the model is good enough, since the methodology seems very straightforward. I think all the necessary changes can be done in the dataloader's __getitem__ to all the input data accordingly as a procedural post processing. I think the only issue would be any of the noise-related tasks, as I would have to decode the EnCodec tokens, apply the noise to the waveform, and then reencode them. Sure, I could have that as an ahead-of-time dataset, but the noise wouldn't be random per utterance sampled. It also reminds me I should give the VALL-E X paper another look, since, knowing so much more now than I do before, I should be able to have insight on how they went about it.

I decided to move this out of the """blog""" update comment above, since it should be it's own section for me to continue updating my thoughts on:


It seems M$ has an answer to Zucc's voicebox or whatever it's called: https://www.microsoft.com/en-us/research/project/speechx/

Giving a look at it, it seems like a rather simpler method to Zucc's "flow-matching" based model voicebox uses: with some light adapting, they turned an existing VALL-E model into one that can do more than zero-shot voice cloning by making use of special tokens and formatting in the input prompt and procedurally generating output that fits the task at hand. It's something I could even try and pivot to, since the paper explicitly mentions using an existing VALL-E model to start with, and detailed how it goes about preparing the target outputs to train against, and even mentions how well it boosted the zero-shot capabilities from it.

It's something I should probably look into once the model is good enough, since the methodology seems very straightforward. I think all the necessary changes can be done in the dataloader's getitem to all the input data accordingly as a procedural post processing. I think the only issue would be any of the noise-related tasks, as I would have to decode the EnCodec tokens, apply the noise to the waveform, and then reencode them. Sure, I could have that as an ahead-of-time dataset, but the noise wouldn't be random per utterance sampled.

It also reminds me I should give the VALL-E X paper another look, since, knowing so much more now than I do before, I should be able to have insight on how they went about it.

Thanks for everything your doing to replicate this project- Microcuck will never give us access to these tools.

> I decided to move this out of the """blog""" update comment above, since it should be it's own section for me to continue updating my thoughts on: > > --- > > It seems M$ has an answer to Zucc's voicebox or whatever it's called: https://www.microsoft.com/en-us/research/project/speechx/ > > Giving a look at it, it seems like a rather simpler method to Zucc's "flow-matching" based model voicebox uses: with some light adapting, they turned an existing VALL-E model into one that can do more than zero-shot voice cloning by making use of special tokens and formatting in the input prompt and procedurally generating output that fits the task at hand. It's something I could even try and pivot to, since the paper explicitly mentions using an existing VALL-E model to start with, and detailed how it goes about preparing the target outputs to train against, and even mentions how well it boosted the zero-shot capabilities from it. > > It's something I should probably look into once the model is good enough, since the methodology seems very straightforward. I think all the necessary changes can be done in the dataloader's __getitem__ to all the input data accordingly as a procedural post processing. I think the only issue would be any of the noise-related tasks, as I would have to decode the EnCodec tokens, apply the noise to the waveform, and then reencode them. Sure, I could have that as an ahead-of-time dataset, but the noise wouldn't be random per utterance sampled. > > It also reminds me I should give the VALL-E X paper another look, since, knowing so much more now than I do before, I should be able to have insight on how they went about it. Thanks for everything your doing to replicate this project- Microcuck will never give us access to these tools.

I decided to move this out of the """blog""" update comment above, since it should be it's own section for me to continue updating my thoughts on:


It seems M$ has an answer to Zucc's voicebox or whatever it's called: https://www.microsoft.com/en-us/research/project/speechx/

Giving a look at it, it seems like a rather simpler method to Zucc's "flow-matching" based model voicebox uses: with some light adapting, they turned an existing VALL-E model into one that can do more than zero-shot voice cloning by making use of special tokens and formatting in the input prompt and procedurally generating output that fits the task at hand. It's something I could even try and pivot to, since the paper explicitly mentions using an existing VALL-E model to start with, and detailed how it goes about preparing the target outputs to train against, and even mentions how well it boosted the zero-shot capabilities from it.

It's something I should probably look into once the model is good enough, since the methodology seems very straightforward. I think all the necessary changes can be done in the dataloader's getitem to all the input data accordingly as a procedural post processing. I think the only issue would be any of the noise-related tasks, as I would have to decode the EnCodec tokens, apply the noise to the waveform, and then reencode them. Sure, I could have that as an ahead-of-time dataset, but the noise wouldn't be random per utterance sampled.

It also reminds me I should give the VALL-E X paper another look, since, knowing so much more now than I do before, I should be able to have insight on how they went about it.

The work and commentary is awesome!

The new SpeechX (and voicebox) model showcases content editing capabilities where some parts of a sentence are replaced but other parts remain intact (not referring to background editing). Can the Vall-E model do this too?
I am keen to donate A100s or H100s if you would like to use them. Would it be helpful?

> I decided to move this out of the """blog""" update comment above, since it should be it's own section for me to continue updating my thoughts on: > > --- > > It seems M$ has an answer to Zucc's voicebox or whatever it's called: https://www.microsoft.com/en-us/research/project/speechx/ > > Giving a look at it, it seems like a rather simpler method to Zucc's "flow-matching" based model voicebox uses: with some light adapting, they turned an existing VALL-E model into one that can do more than zero-shot voice cloning by making use of special tokens and formatting in the input prompt and procedurally generating output that fits the task at hand. It's something I could even try and pivot to, since the paper explicitly mentions using an existing VALL-E model to start with, and detailed how it goes about preparing the target outputs to train against, and even mentions how well it boosted the zero-shot capabilities from it. > > It's something I should probably look into once the model is good enough, since the methodology seems very straightforward. I think all the necessary changes can be done in the dataloader's __getitem__ to all the input data accordingly as a procedural post processing. I think the only issue would be any of the noise-related tasks, as I would have to decode the EnCodec tokens, apply the noise to the waveform, and then reencode them. Sure, I could have that as an ahead-of-time dataset, but the noise wouldn't be random per utterance sampled. > > It also reminds me I should give the VALL-E X paper another look, since, knowing so much more now than I do before, I should be able to have insight on how they went about it. The work and commentary is awesome! The new SpeechX (and voicebox) model showcases content editing capabilities where some parts of a sentence are replaced but other parts remain intact (not referring to background editing). Can the Vall-E model do this too? I am keen to donate A100s or H100s if you would like to use them. Would it be helpful?
Author
Owner

Thanks for everything your doing to replicate this project- Microcuck will never give us access to these tools.

The one thing that puzzles me is that no code has been released for SpeechX, NaturalSpeech2, nor VALL-E. I understand them not releasing weights, but no code is a bit silly, since it still requires an """ethicist""" with the compute to still bake up some weights effectively.

I suppose that's just the nature of voice synthesis; there's no need to be competitive, so everything can toss up paper tigers with their research papers.

The new SpeechX (and voicebox) model showcases content editing capabilities where some parts of a sentence are replaced but other parts remain intact (not referring to background editing). Can the Vall-E model do this too?

Yes-ish. Yes, because the core of SpeechX is still VALL-E. Ish, because it still requires training for that task, but is definitely do-able for me to implement.

The only challenges I have imagined up are:

  • extending the proms_emb's input count from 1024(+1 for the AR) an extra couple of tokens for the extra tokens that mark a task in the input prompt. This is allegedly easy, but it'd be kind of a pickle to automagically extend weights from a pre-extended model (I have ideas, but I need to figure out what is feasible for the end user who happens to have weights from before this change).
  • removal/extraction/editing tasks might have a nasty penalty from preparing a batch in the dataloader from having to decode audio, merge the waveforms, then reencode the merge. I imagine the dataloader workers are sane enough to make this hit very negligible.
  • editing tasks will require a bit of a paradigm shift in preparing/storing the dataset. I'll probably need to look over the paper again, but I imagine this requires word-level timestamps in the transcription to be able to split up words and insert different words in the input prompt. Not a huge deal to anyone who had happened to use mrq/ai-voice-cloning's transcriber with WhisperX (fast-whisper backend), as I have it store the word-level timestamps, but most of my dataset is still from before that, so I would need to re-transcribe my dataset.

Since SpeechX's paper shows that VALL-E can easily be extended to more tasks than just text-to-speech, I feel like it'd be cool to also add in my own set of tasks to, but I can't really think of any other tasks to have it to outside of something similar to VITS/RVC. The only crux is that I would need to rely on RVC to generate my own dataset for training such a task.

I am keen to donate A100s or H100s if you would like to use them. Would it be helpful?

mmm.

While I am an arbitrary amount of more comfortable in the notion, now that I'm much more confident in the implementation able to spin out something that semi-works (and with distributed training, more or less), I still feel I can't do that just yet. The 2x4090s felt very comfortable to use, despite there being some weird quirks to contend with. The batch size was able to be set to a comfortable amount that wasn't wasting VRAM to account for the chance of OOMing from the backwards pass.

When I was testing on the rental H100s, I did not feel comfortable at all, as I felt I wasn't able to nail out a comfortable batch size without risking OOMing in the backwards pass, and the uplift in throughput felt very flaccid in comparison. Although, I suppose I should revisit it with ZeRO/ZeRO++ when I get the chance.

I appreciate the offer, and I'll definitely keep it in mind. Aside from the "ugh... just not comfy enough..." cope, I do feel there's still quite some things left I need to adjust with my methodology, like expanding my dataset, what's the best way to even train it in terms of learning rate, and maybe pivoting away from DeepSpeed's quantization as I don't think it's working (there's just no documentation on how to actually use it), and some other things that should be done sooner than later.

> Thanks for everything your doing to replicate this project- Microcuck will never give us access to these tools. The one thing that puzzles me is that no code has been released for SpeechX, NaturalSpeech2, nor VALL-E. I understand them not releasing weights, but no code is a bit silly, since it still requires an """ethicist""" with the compute to still bake up some weights effectively. I suppose that's just the nature of voice synthesis; there's no need to be competitive, so everything can toss up paper tigers with their research papers. > The new SpeechX (and voicebox) model showcases content editing capabilities where some parts of a sentence are replaced but other parts remain intact (not referring to background editing). Can the Vall-E model do this too? Yes-ish. Yes, because the core of SpeechX *is* still VALL-E. Ish, because it still requires training for that task, but is definitely do-able for me to implement. The only challenges I have imagined up are: * extending the `proms_emb`'s input count from 1024(+1 for the AR) an extra couple of tokens for the extra tokens that mark a task in the input prompt. This is allegedly easy, but it'd be kind of a pickle to automagically extend weights from a pre-extended model (I have ideas, but I need to figure out what is feasible for the end user who happens to have weights from before this change). * removal/extraction/editing tasks might have a nasty penalty from preparing a batch in the dataloader from having to decode audio, merge the waveforms, then reencode the merge. I imagine the dataloader workers are sane enough to make this hit very negligible. * editing tasks *will* require a bit of a paradigm shift in preparing/storing the dataset. I'll probably need to look over the paper again, but I imagine this requires word-level timestamps in the transcription to be able to split up words and insert different words in the input prompt. Not a huge deal to anyone who had happened to use [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning)'s transcriber with WhisperX (fast-whisper backend), as I have it store the word-level timestamps, but most of my dataset is still from before that, so I would need to re-transcribe my dataset. Since SpeechX's paper shows that VALL-E can easily be extended to more tasks than just text-to-speech, I feel like it'd be cool to also add in my own set of tasks to, but I can't really think of any other tasks to have it to outside of something similar to VITS/RVC. The only crux is that I would need to rely on RVC to generate my own dataset for training such a task. > I am keen to donate A100s or H100s if you would like to use them. Would it be helpful? mmm. While I am an arbitrary amount of more comfortable in the notion, now that I'm much more confident in the implementation able to spin out something that semi-works (and with distributed training, more or less), I still feel I can't do that just yet. The 2x4090s felt very comfortable to use, despite there being some weird quirks to contend with. The batch size was able to be set to a comfortable amount that wasn't wasting VRAM to account for the chance of OOMing from the backwards pass. When I was testing on the rental H100s, I did not feel comfortable at all, as I felt I wasn't able to nail out a comfortable batch size without risking OOMing in the backwards pass, and the uplift in throughput felt very flaccid in comparison. Although, I suppose I should revisit it with ZeRO/ZeRO++ when I get the chance. I appreciate the offer, and I'll definitely keep it in mind. Aside from the "ugh... just not comfy enough..." cope, I do feel there's still quite some things left I need to adjust with my methodology, like expanding my dataset, what's the best way to even train it in terms of learning rate, and maybe pivoting away from DeepSpeed's quantization as I don't think it's working (there's just no documentation on how to actually use it), and some other things that should be done sooner than later.
Author
Owner

Having said that, I do feel much more confident in the models.

Over the course of however long I did leave the model to bake on the 2x4090s, I also was having the quarter sized model train on my 4070Ti, and the improvements were looking (sounding) good. It's able to actually produce rather coherent speech, but still has a lot of room for improvement. The "clone-ability" is still lacking, but I trust there's enough parameters for it to grow stronger.

The full size model is improving too. It's definitely getting better at trying to clone speech, and the general "linguistics" it produces is getting more and more consistent. Testing the RetNet capabilities, it definitely can extend past the context size it was trained on... but it seems to produce a stop token after extending past 3x (from one sentence to three sentences). I suppose things will get better as it trains more.

With even the quarter sized being able to provide decent-ish speech for its size, and the full being able to work, it makes me curious to try and see how well a model larger than the full size (probably 30 layers?) will compete, but I imagine the speed and even training it to be a pain.

However, inferencing still feels very inconsistent. It's pretty much a coin flip as to whether or not the output is good enough, while a lot of the evaluation/validation output sounds fine. I'm not sure where the issue might lie, as the inferencing code is pretty much the evaluation/validation code, and it can't be an issue of it being text from outside the dataset. It can't be, since both, the validation output sounds fine, and using text from the dataset also has inconsistencies. I'll have to double check my work I suppose.

  • Also, there's the issue of needing to set the AR temperature to 0.95 to get consistent output, anything more or less and it risks being fucked.

The training code also got some love, from distributed training finally working, to it being able to (properly) prune old checkpoints.

However I did catch one assumption I was wrong about. I assumed that the default provided dataloader and sampling technique would have every piece of the dataset visited once within the epoch, but after auditing the dataloader and sampler, I was grossly wrong. It's effectively just randomly picking a speaker and then randomly picking an utterance, not quite the same as what I'm doing of having an "epoch" cover every speaker, but every speaker picks a random utterance. My entire ick about my training getting interrupted mid-epoch was for naught, as a full epoch in fact did not guarantee the entire dataset was visited. I suppose this would explain how past training experiments had some speakers strive despite said speakers barely being visited. I suppose it would be better to, instead, ignore the sampler and just have the dataloader shuffle and pick from the paths. There's the interleave_reorder call to guarantee the list of paths to pick from is balanced, but I think it incurs quite the performance cost. I'll just have to gut out most of that code then.

Aside from the shocking revelation, I think I'm quite comfortable with just leaving the model to train on a smaller learning rate now, as it seems to be improving over the past few days with the more-explicit "sample by speakers" approach. A RetNet works, and so does the model. I just need to give it time to improve and not be half-assed. Although, I think I need to expand the dataset with more parts of LibriLight. I don't think ~4.5k hours / ~3.5k speakers is going to cut it.

Lastly, the changes to the dataset by including punctuation in the phonemes definitely improved the speech output. I incidentally compared against the P3 Mitsuru finetune samples I put out and my god was that awful in terms of pacing. The current outputs I got out of it sounded much more natural.

Don't really have any metrics to show, since the actual numbers don't tell that there's an improvement, but I'll try and provide a batch of samples when I get the chance.


Apologies if this sounds quite stilted and all over the place. I've had quite the notes saved up in my head, but as soon as I needed to recall them all, my brain turned to mush.

Having said that, I do feel much more confident in the models. Over the course of however long I did leave the model to bake on the 2x4090s, I also was having the quarter sized model train on my 4070Ti, and the improvements were looking (sounding) good. It's able to actually produce rather coherent speech, but still has a lot of room for improvement. The "clone-ability" is still lacking, but I trust there's enough parameters for it to grow stronger. The full size model is improving too. It's definitely getting better at trying to clone speech, and the general "linguistics" it produces is getting more and more consistent. Testing the RetNet capabilities, it definitely *can* extend past the context size it was trained on... but it seems to produce a stop token after extending past 3x (from one sentence to three sentences). I suppose things will get better as it trains more. With even the quarter sized being able to provide decent-ish speech for its size, and the full being able to work, it makes me curious to try and see how well a model larger than the full size (probably 30 layers?) will compete, but I imagine the speed and even training it to be a pain. However, inferencing still feels very inconsistent. It's pretty much a coin flip as to whether or not the output is good enough, while a lot of the evaluation/validation output sounds fine. I'm not sure where the issue might lie, as the inferencing code is pretty much the evaluation/validation code, and it can't be an issue of it being text from outside the dataset. It can't be, since both, the validation output sounds fine, and using text from the dataset also has inconsistencies. I'll have to double check my work I suppose. * Also, there's the issue of needing to set the AR temperature to 0.95 to get consistent output, anything more or less and it risks being fucked. The training code also got some love, from distributed training finally working, to it being able to (properly) prune old checkpoints. However I did catch one assumption I was wrong about. I assumed that the default provided dataloader and sampling technique would have every piece of the dataset visited once within the epoch, but after auditing the dataloader and sampler, I was grossly wrong. It's effectively just randomly picking a speaker and then randomly picking an utterance, not quite the same as what I'm doing of having an "epoch" cover every speaker, but every speaker picks a random utterance. My ***entire*** ick about my training getting interrupted mid-epoch was for naught, as a full epoch in fact did not guarantee the entire dataset was visited. I suppose this would explain how past training experiments had some speakers strive despite said speakers barely being visited. I suppose it would be better to, instead, ignore the sampler and just have the dataloader shuffle and pick from the paths. There's the interleave_reorder call to guarantee the list of paths to pick from is balanced, but I think it incurs quite the performance cost. I'll just have to gut out most of that code then. Aside from the shocking revelation, I think I'm quite comfortable with just leaving the model to train on a smaller learning rate now, as it seems to be improving over the past few days with the more-explicit "sample by speakers" approach. A RetNet works, and so does the model. I just need to give it time to improve and not be half-assed. Although, I think I need to expand the dataset with more parts of LibriLight. I don't think ~4.5k hours / ~3.5k speakers is going to cut it. Lastly, the changes to the dataset by including punctuation in the phonemes definitely improved the speech output. I incidentally compared against the P3 Mitsuru finetune samples I put out and my god was that awful in terms of pacing. The current outputs I got out of it sounded much more natural. Don't really have any metrics to show, since the actual numbers don't tell that there's an improvement, but I'll try and provide a batch of samples when I get the chance. --- Apologies if this sounds quite stilted and all over the place. I've had quite the notes saved up in my head, but as soon as I needed to recall them all, my brain turned to mush.
Author
Owner

Implemented the SpeechX tasks. New models will just need to ensure that the model configuration has task set to at least 8 to guarantee enough extra tokens. Existing models that want to pivot to using SpeechX tasks will need to be exported as fp32 weights and then re-used with trainer.load_state_dict = True. I'm able to modify the state dict to be expanded to a newly specified prompt embedding size (say if you change your prompt levels, or add in extra tokens for tasks).

I still need to figure out an elegant way to go about implementing the clean/noisy speech editing, as I'm very certain I need to grab word-level timestamps unless I go an extremely dirty way of stitching three utterances together as the input prompt, and the target is the middle one changed. I guess with a large enough dataset and enough training, the model should be robust enough from any errors. As soon as I typed that, I realized I can just do that. Each utterance is guaranteed to be no more solid of a piece of an utterance than it already is for TTS. The only problem would be not getting matching tone between pre/mid/edit/post, but I'm not able to guarantee that regardless of what methodology I use.

I've tested everything except the speech editing. I just wanted to whip something up when I had the revelation before going to bed. I did a brief test making sure everything got outputted right for the other tasks, at least.

I'm not really sure when to pivot to allow the SpeechX tasks to be used during training. The paper mentions it really does help to have a solid VALL-E base anyways, and even if I did have one, training with SpeechX tasks is quite the pain as it effectively will eat up a lot more VRAM, both from having the EnCodec model loaded (despite specifying to do it on the CPU side) and the extra prompt size / target size will make tight setups OOM during the backwards pass. I don't notice too big of a performance penalty from having these tasks enabled, they're rather quick and I imagine the dataloader can process them all before a forward pass completes. The only issue is that the EnCodec model will have to be duplicated across all worker processes.

The other issue I can think of is that there's just not enough bandwidth to resolve anything with noise in the decoded waveform.

The other issue is that the paper doesn't seem very clear on saying if the task-tailored input prompts are only for the AR or for both. Realistically, I don't think the NAR needs to be trained for these tasks, as the first level should be more than enough to guide the rest of the output. But who knows.


I suppose I'll give training with the SpeechX tasks a shot for a day or so and see where it takes me. My qualms:

  • it's quite the VRAM hog to include additional tasks into the mix. nse, cse, and tse require, at worst, 3x the data ( prom => target vs pre+mid+post prom => pre+edit+post target ), 2.5x the data ( prom => target vs pre+post => pre+edit+post target) and 1.5x the data ( prom => target vs prom+overlayed target => target ). To comfortably train, I had to set my batch size to 4 (from 16) especially while under the reduced dataset.
  • I'm not sure if it's too early to be doing so as the base TTS model isn't quite mature enough. On the other hand, maybe incorporating the tasks will help bolster the zero-shot's clone-ability that seems to still be lacking.
  • I technically still need to modify the inference script to allow for utilizing the other tasks, but then again, I haven't had proper inferencing for ages while I had the model training to begin with.
  • anything involving noise, I'm just not that confident in.
    • to reiterate, there's just not enough bandwidth to resolve a noisy 2-RVQ-binned EnCodec sequence, even with Vocos. SHODAN barely sounds fine as-is, and in testing overlaying noise to a lot of utterances just destroys the clip entirely. I hope 4-RVQ-bins will do if I grow a wild hair and expand to it.
    • the loudness of the noise is very inconsistent. Setting Encodec to normalize just utterly destroys the audio during prompt setup, but I think if, at dataset preparation time, I normalize the noise there, I can have a one-size-fits-all scale to the noise for it to not be obnoxious. After all, I just need to tell the model to try and differentiate between (un)wanted noise and speech.
    • sampling from a noise dataset was tacked on very lazily. It requires the end user training the model to provide their own noise dataset, although it's quite easy. I could probably just provide the quantized noise dataset to jumpstart homebrew trainings.

Either way, it was a nice fun exercise to test my knowledge to try and incorporate it in under a night or two, and if I cared, I can boast that my implementation has more features.

Implemented the SpeechX tasks. New models will just need to ensure that the model configuration has `task` set to at least 8 to guarantee enough extra tokens. Existing models that want to pivot to using SpeechX tasks will need to be exported as fp32 weights and then re-used with `trainer.load_state_dict = True`. I'm able to modify the state dict to be expanded to a newly specified prompt embedding size (say if you change your prompt levels, or add in extra tokens for tasks). ~~I still need to figure out an elegant way to go about implementing the clean/noisy speech editing, as I'm very certain I need to grab word-level timestamps *unless* I go an extremely dirty way of stitching three utterances together as the input prompt, and the target is the middle one changed. I ***guess*** with a large enough dataset and enough training, the model should be robust enough from any errors.~~ As soon as I typed that, I realized I can just do that. Each utterance *is* guaranteed to be no more solid of a piece of an utterance than it already is for TTS. The only problem would be not getting matching tone between pre/mid/edit/post, but I'm not able to guarantee that regardless of what methodology I use. I've tested everything except the speech editing. I just wanted to whip something up when I had the revelation before going to bed. I did a brief test making sure everything got outputted right for the other tasks, at least. I'm not really sure *when* to pivot to allow the SpeechX tasks to be used during training. The paper mentions it really does help to have a solid VALL-E base anyways, and even if I did have one, training with SpeechX tasks is quite the pain as it effectively will eat up a lot more VRAM, both from having the EnCodec model loaded (despite specifying to do it on the CPU side) and the extra prompt size / target size will make tight setups OOM during the backwards pass. I don't notice too big of a performance penalty from having these tasks enabled, they're rather quick and I imagine the dataloader can process them all before a forward pass completes. The only issue is that the EnCodec model will have to be duplicated across all worker processes. The other issue I can think of is that there's just not enough bandwidth to resolve anything with noise in the decoded waveform. The other issue is that the paper doesn't seem very clear on saying if the task-tailored input prompts are only for the AR or for both. Realistically, I don't think the NAR needs to be trained for these tasks, as the first level should be more than enough to guide the rest of the output. But who knows. --- I suppose I'll give training with the SpeechX tasks a shot for a day or so and see where it takes me. My qualms: * it's quite the VRAM hog to include additional tasks into the mix. `nse`, `cse`, and `tse` require, at worst, 3x the data ( prom => target vs pre+mid+post prom => pre+edit+post target ), 2.5x the data ( prom => target vs pre+post => pre+edit+post target) and 1.5x the data ( prom => target vs prom+overlayed target => target ). To comfortably train, I had to set my batch size to 4 (from 16) especially while under the reduced dataset. * I'm not sure if it's too early to be doing so as the base TTS model isn't *quite* mature enough. On the other hand, maybe incorporating the tasks will help bolster the zero-shot's clone-ability that seems to still be lacking. * I *technically* still need to modify the inference script to allow for utilizing the other tasks, but then again, I haven't had proper inferencing for ages while I had the model training to begin with. * anything involving noise, I'm just not that confident in. - to reiterate, there's just not enough bandwidth to resolve a noisy 2-RVQ-binned EnCodec sequence, even with Vocos. SHODAN barely sounds fine as-is, and in testing overlaying noise to a lot of utterances just destroys the clip entirely. I hope 4-RVQ-bins will do if I grow a wild hair and expand to it. - the loudness of the noise is very inconsistent. Setting Encodec to normalize just utterly destroys the audio during prompt setup, but I think if, at dataset preparation time, I normalize the noise there, I can have a one-size-fits-all scale to the noise for it to not be obnoxious. After all, I just need to tell the model to try and differentiate between (un)wanted noise and speech. - sampling from a noise dataset was tacked on very lazily. It requires the end user training the model to provide their own noise dataset, although it's quite easy. I could probably just provide the quantized noise dataset to jumpstart homebrew trainings. Either way, it was a nice fun exercise to test my knowledge to try and incorporate it in under a night or two, and if I cared, I can boast that my implementation has more features.
Author
Owner

Did some more sanity checks with the trainer.

I realized that it would probably be more saner to, instead, process the input prompts / target responses at the maximum RVQ bin/quant. level (8), and then trim off the remainder when finalizing the batch. This should help make anything using merged audios (target speaker extraction, anything with noise) work better from working with as much bandwidth as possible, rather than the yucky 2 RVQ bins. I actually haven't gotten a chance to validate the noise audio all that much, as I think right after getting everything running again, I crashed out.

Additionally, I reused the proms_emb "when loading the state dict, handle resizing it when adding in the special tasks tokens / adjusting the prom_levels" code to also adjust the resps_emb when increasing the output quant-levels (for the NAR mostly). Doing so, I ended up growing a wild hair and ended up in a rabbit hole of sticking to targeting 4 RVQ bins entirely (the NAR handles 3 levels instead of 1 now) to go ahead and bite the bullet to try and aim for better quality sounding outputs. I should have probably stuck to the full 8 while I'm doing this already, but I think the difference between 4 and 8 with Vocos is very much marginal.

Having said that, I did also bite the bullet and toss another $50 into rental training, but this time on 4x4090s instead; at a batch size of 64, it can eat through an "epoch" of 3.5k speakers in a minute, a huge increase over my one 4070Ti eeking out at about 10 minutes to do so. I think the way to try and fully utilize cards with more oomph to them is through increasing the RVQ-bins being processed, rather than increase the batch size, but this is just conjecture at the moment. I'm mostly using this to try and bring the NAR back up to par now that it has to contend with two more RVQ bin levels to output and kind of repairing the AR (doing the tasks training this early had its problems).

While I'm doing this, I'm putting the quarter-sized model under a similar treatment on my 4070Ti, where I'm pivoting from targeting 2 RVQ bins to 4, and letting things re-bake to help it out. I trust there'll be enough progress from the both of them over the course of the ETA 22 hours until I burn up the credit I spent on runpod, just in time for the weekend to be over.

Aside from those detours, I'm hoping (coping) that this is the last of the detours and I can go back to shutting the fuck up and letting the model train without interruptions. Worst case, the final interruption would be to add in more data to the dataset.

Just wish I had some audio samples or metric curves to provide, but my logs are completely tainted. Although, I think at this point the training metric curves are useless as any progression in the losses/accuracies are very noisy.

Did some more sanity checks with the trainer. I realized that it would *probably* be more saner to, instead, process the input prompts / target responses at the maximum RVQ bin/quant. level (8), and then trim off the remainder when finalizing the batch. This *should* help make anything using merged audios (target speaker extraction, anything with noise) work better from working with as much bandwidth as possible, rather than the yucky 2 RVQ bins. I actually haven't gotten a chance to validate the noise audio all that much, as I think right after getting everything running again, I crashed out. Additionally, I reused the `proms_emb` "when loading the state dict, handle resizing it when adding in the special tasks tokens / adjusting the `prom_levels`" code to also adjust the `resps_emb` when increasing the output quant-levels (for the NAR mostly). Doing so, I ended up growing a wild hair and ended up in a rabbit hole of sticking to targeting 4 RVQ bins entirely (the NAR handles 3 levels instead of 1 now) to go ahead and bite the bullet to try and aim for better quality sounding outputs. I should have probably stuck to the full 8 while I'm doing this already, but I think the difference between 4 and 8 with Vocos is very much marginal. Having said that, I did also bite the bullet and toss another $50 into rental training, but this time on 4x4090s instead; at a batch size of 64, it can eat through an "epoch" of 3.5k speakers in a minute, a huge increase over my one 4070Ti eeking out at about 10 minutes to do so. I ***think*** the way to try and fully utilize cards with more oomph to them is through increasing the RVQ-bins being processed, rather than increase the batch size, but this is just conjecture at the moment. I'm mostly using this to try and bring the NAR back up to par now that it has to contend with two more RVQ bin levels to output and *kind of* repairing the AR (doing the tasks training this early had its problems). While I'm doing this, I'm putting the quarter-sized model under a similar treatment on my 4070Ti, where I'm pivoting from targeting 2 RVQ bins to 4, and letting things re-bake to help it out. I trust there'll be enough progress from the both of them over the course of the ETA 22 hours until I burn up the credit I spent on runpod, just in time for the weekend to be over. Aside from those detours, I'm hoping (coping) that this is the last of the detours and I can go back to shutting the fuck up and letting the model train without interruptions. Worst case, the final interruption would be to add in more data to the dataset. Just wish I had some audio samples or metric curves to provide, but my logs are completely tainted. Although, I think at this point the training metric curves are useless as any progression in the losses/accuracies are very noisy.
Author
Owner

Eh, I guess I can provide samples: early this morning and a few minutes ago.

The actual speech sounds fine and mostly-consistent, I say an arbitrary 70%, but it's still not as consistent as I want it to be. The errors I can pinpoint:

  • it's a bit tricky to explain, but the output quality seems to be predicated on the NAR still not being mature enough. As I'm adding two more quant-levels to the mix now, those levels being too immature could easily muck things up.
  • the clips that are either mostly silence, a random utterance, or extremely gravely, I can pinpoint to the AR being fed a bad input prompt. I noticed one input prompt having two different speakers in it (I suppose this might be a problem with the donated audiobooks not having one narrator), and another input prompt being very crusty.
  • voice clone-ability still kinda sucks. In the first ZIP.
    • FF7R Tifa sounds debatably close, which is good, but a lot of the LibriTTS utterances don't sound that close to the reference clip (although, it does sound close to the input prompt, which I suppose is fine then).
    • Then theres' P3 Ryuji in the second ZIP that doesn't sound anywhere close to Yuri Lowenthal (well I'm sure there could be one character he voices that could be close to it, but it's definitely not what I have in the dataset). However, the fact that the output is still actual speech means it's not that bad. During my inference tests of the past, I felt a lot of the voices just wasn't consistent in producing even usable output.

Training on the 4x4090s is being a bit of a pill. I woke up this morning to it already running out of diskspace but somehow kept running? I had to stitch together the good checkpoint of the weights with optimizer states of the last known checkpoint, and restart from the FP32 weights. After going back to sleep and waking up again, I feel like the training kept resetting itself a few times as I'm not saving often enough, and I suppose if one device OOMs, then it'll hang and not properly save. I thought the old implementation handled that fine, but I suppose I botched something over time.

Either way, I think I need to wait for the NAR's outputted RVQ bins 3 and 4 to mature again. It definitely picked up the pace over the few hours overnight I had it train on just that, so I expect the rest of the day to be rather fruitful. I still haven't gotten a chance to test finetuning if it will magically save my hide. I hope so, since I can at least release the weights with good confidence if it can easily be finetuned into something competent. I just really do not want to release the weights when it outputs nothing but unsalvagable doodoopoopoo.

And if not, I suppose I'll have to accept the offer to accept the offer for using the A100s/H100s. I think the training protocol is solid enough now that I can comfortably either:

  • be able to quickly spin up a training environment on any other training rig and not doom over feeling like precious training time will get wasted on silly errors. I should have everything patched up now with the training script that crops up.
  • be able to easily hand off the weights and dataset and a foolproof procedure for someone else to spin up the training on their own hardware. My only qualm with that is I might need to drop the audiobooks from the dataset and replace them with more of LibriLight, as I feel the audiobooks are kind of still in a gray area with being distributed in a dataset.

Regardless, I'm HOPING that I will finally get somewhere by the end of this week for sure. I think I have exhausted all possible detours now, and I just need to shut the fuck up and let it train without being wishy-washy.


alsothere'sthequartersizedmodelstillbeingtrainedbutIfeelit'sagenuinefool'serrandtoexpectittoproduceanythingfruitfulatthtispoint


One more sample batch: here

I suppose in hindsight it's obvious that I should have been paying much more attention to the input prompt being fed rather than just comparing it to the target reference clip. A lot of the generate output does match against the input prompt being fed, although there's still a few times where the output is a bit busted and breaks. Notably, I did hypothesize with being able to generate a "random" voice with an empty input prompt, and it seemed that one of the evaluation output did do just that with a piece of dead air in the input prompt.

Sucks I don't have any consistent metrics now that I've been pushing back and forth the models between my 4070Ti to the 4090s I've been renting; I wish I knew how many samples / epochs (in terms of the dataset) have passed now, but if I remember right, I think about a little under an epoch for the full dataset (two epochs for the reduced dataset that makes training more stable) have passed, so I suppose this puts my weights at five epochs worth of data compared to the whole dataset?

Regardless, I'll probably dump another $50 to keep the model training for another day as it seems to be cleaning up things rather quickly with 4x4090s to handle it. I'm having my 4070Ti handle transcribing and processing LibriLight-6K proper to supplement the dataset for when I do release it alongside the weights.

I should probably use this time as well to play around with inferencing again and finetuning the weights with my 6800XT idly sitting around, but it's just a bit of a chore to export and jettison the weights from the rental rig.

Eh, I guess I can provide samples: [early this morning](https://files.catbox.moe/mc6e9o.zip) and [a few minutes ago](https://files.catbox.moe/n4r6v1.zip). The actual speech sounds fine and mostly-consistent, I say an arbitrary 70%, but it's still not as consistent as I want it to be. The errors I can pinpoint: * it's a bit tricky to explain, but the output quality seems to be predicated on the NAR still not being mature enough. As I'm adding two more quant-levels to the mix now, those levels being too immature could easily muck things up. * the clips that are either mostly silence, a random utterance, or extremely gravely, I can pinpoint to the AR being fed a bad input prompt. I noticed one input prompt having two different speakers in it (I suppose this *might* be a problem with the donated audiobooks not having one narrator), and another input prompt being very crusty. * voice clone-ability still kinda sucks. In the first ZIP. - FF7R Tifa sounds debatably close, which is good, but a lot of the LibriTTS utterances don't sound that close to the reference clip (although, it does sound close to the input prompt, which I suppose is fine then). - Then theres' P3 Ryuji in the second ZIP that doesn't sound anywhere close to Yuri Lowenthal (well I'm sure there could be one character he voices that could be close to it, but it's definitely not what I have in the dataset). However, the fact that the output is still actual speech means it's not that bad. During my inference tests of the past, I felt a lot of the voices just wasn't consistent in producing even usable output. Training on the 4x4090s is being a bit of a pill. I woke up this morning to it already running out of diskspace but somehow kept running? I had to stitch together the good checkpoint of the weights with optimizer states of the last known checkpoint, and restart from the FP32 weights. After going back to sleep and waking up again, I feel like the training kept resetting itself a few times as I'm not saving often enough, and I suppose if one device OOMs, then it'll hang and not properly save. I thought the old implementation handled that fine, but I suppose I botched something over time. Either way, I think I need to wait for the NAR's outputted RVQ bins 3 and 4 to mature again. It definitely picked up the pace over the few hours overnight I had it train on just that, so I expect the rest of the day to be rather fruitful. I still haven't gotten a chance to test finetuning if it will magically save my hide. I hope so, since I can at least release the weights with good confidence if it can easily be finetuned into something competent. I just really do not want to release the weights when it outputs nothing but unsalvagable doodoopoopoo. And if not, I suppose I'll have to accept the offer to accept the offer for using the A100s/H100s. I think the training protocol is solid enough *now* that I can comfortably either: * be able to quickly spin up a training environment on any other training rig and not doom over feeling like precious training time will get wasted on silly errors. I should have everything patched up now with the training script that crops up. * be able to easily hand off the weights and dataset and a foolproof procedure for someone else to spin up the training on their own hardware. My only *qualm* with that is I might need to drop the audiobooks from the dataset and replace them with more of LibriLight, as I feel the audiobooks are kind of still in a gray area with being distributed in a dataset. Regardless, I'm HOPING that I will finally get somewhere by the end of this week for sure. I think I have exhausted all possible detours now, and I just need to shut the fuck up and let it train without being wishy-washy. --- *alsothere'sthequartersizedmodelstillbeingtrainedbutIfeelit'sagenuinefool'serrandtoexpectittoproduceanythingfruitfulatthtispoint* --- One more sample batch: [here](https://files.catbox.moe/075x7y.zip) I suppose in hindsight it's obvious that I should have been paying much more attention to the input prompt being fed rather than just comparing it to the target reference clip. A lot of the generate output *does* match against the input prompt being fed, although there's still a few times where the output is a bit busted and breaks. Notably, I did hypothesize with being able to generate a "random" voice with an empty input prompt, and it seemed that one of the evaluation output *did* do just that with a piece of dead air in the input prompt. Sucks I don't have any consistent metrics now that I've been pushing back and forth the models between my 4070Ti to the 4090s I've been renting; I wish I knew how many samples / epochs (in terms of the dataset) have passed now, but if I remember right, I think about a little under an epoch for the full dataset (two epochs for the reduced dataset that makes training more stable) have passed, so I ***suppose*** this puts my weights at five epochs worth of data compared to the whole dataset? Regardless, I'll probably dump another $50 to keep the model training for another day as it seems to be cleaning up things rather quickly with 4x4090s to handle it. I'm having my 4070Ti handle transcribing and processing LibriLight-6K proper to supplement the dataset for when I do release it alongside the weights. I should probably use this time as well to play around with inferencing again and finetuning the weights with my 6800XT idly sitting around, but it's just a bit of a chore to export and jettison the weights from the rental rig.

Hey @mrq , I sent you an email to mrq@ecker.tech reaching out about some things. Let me know if you’ve seen it and are able to respond there, thanks!

Hey @mrq , I sent you an email to mrq@ecker.tech reaching out about some things. Let me know if you’ve seen it and are able to respond there, thanks!
Author
Owner

Hey @mrq , I sent you an email to mrq@ecker.tech reaching out about some things. Let me know if you’ve seen it and are able to respond there, thanks!

mmm.

I suppose I'll generalize-address it to the other people with similar propositions (including one from a month ago that I feel a bit bad for just now catching). I'll preface that I do not intend to be abrasive, blunt, mistrusting, or a schizo, but it'd be better to bite my tongue for most of my thoughts and be curt about it than spend another hour (out of the probably three I've spent trying to make things "elegant"):

While I do appreciate the offers to converse and collaborate, out of my dozens of copes I had to redact from giving:

  • I'm not some prestigious researcher in the AI/ML space. I'm just a mal-adjusted, self-taught programmer. I have no qualifications.
  • I just cannot resolve why I'm needed when I trust that people with qualifications and/or hardware would be able to spin up their own datasets and models much better than I, biased by the most cost-effective DIY approaches, could do.
  • I'm always one to not keep my privates private. My (shoddy) work, observations, and thoughts are out here in the open for others to (allegedly) appreciate. I don't think such exchanges should be made in private.
    • (This point I feel is most susceptible to degenerating into schizobabble, so I'm definitely cutting myself off here).

I'm going to have to decline.

Gomen.


I forgot to also provide some more samples: pre-pivoting to the full dataset and post-pivoting to the full dataset ("full" dataset being without reducing the dataset to utterances shorter than 8 seconds to increase throughput).

I'm using the last days worth of credit on runpod on 4x3090s to try and wrap the model up with being fed longer utterances again to see how it shapes up, and, I think it's for the worse right now. While it's probably only a difference between maybe 1000644*4 samples, I feel the latest evaluation/validation outputs sound worse. At least I've made backups in case I do need to revert, but yeesh.

On another note, it appears that the 4x3090s have a rather similar throughput to the 4x4090s. Kind of sucks, since I could have just used those instead of the 4090s that are almost twice the price. Especially sucks, since I could have just bought a 3090 to begin with instead of a 4070Ti since there's effectively not much of a difference between Ampere and Ada for this workload.

Oh well. I shouldn't try and sweat over it so much and get some rest while the model continues training and my local system is properly preparing and transcribing 6K hours of LibriLight.

> Hey @mrq , I sent you an email to mrq@ecker.tech reaching out about some things. Let me know if you’ve seen it and are able to respond there, thanks! mmm. I suppose I'll generalize-address it to the other people with similar *propositions* (including one from a month ago that I feel a bit bad for just now catching). I'll preface that I do not intend to be abrasive, blunt, mistrusting, or a schizo, but it'd be better to bite my tongue for most of my thoughts and be curt about it than spend another hour (out of the probably three I've spent trying to make things "elegant"): While I do appreciate the offer***s*** to converse and collaborate, out of my dozens of copes I had to redact from giving: * I'm not some prestigious researcher in the AI/ML space. I'm just a mal-adjusted, self-taught programmer. I have no qualifications. * I just cannot resolve why I'm needed when I trust that people with qualifications and/or hardware would be able to spin up their own datasets and models much better than I, biased by the most cost-effective DIY approaches, could do. * I'm always one to not keep my privates private. My (shoddy) work, observations, and thoughts are out here in the open for others to (allegedly) appreciate. I don't think such exchanges should be made in private. - (This point I feel is most susceptible to degenerating into schizobabble, so I'm definitely cutting myself off here). I'm going to have to decline. Gomen. --- I forgot to also provide some more samples: [pre-pivoting to the full dataset](https://files.catbox.moe/rd8sgl.zip) and [post-pivoting to the full dataset](https://files.catbox.moe/c3ahge.zip) ("full" dataset being without reducing the dataset to utterances shorter than 8 seconds to increase throughput). I'm using the last days worth of credit on runpod on 4x3090s to try and wrap the model up with being fed longer utterances again to see how it shapes up, and, I think it's for the worse right now. While it's probably only a difference between maybe 1000*64*4*4 samples, I feel the latest evaluation/validation outputs sound worse. At least I've made backups in case I do need to revert, but yeesh. On another note, it appears that the 4x3090s have a rather similar throughput to the 4x4090s. Kind of sucks, since I could have just used those instead of the 4090s that are almost twice the price. Especially sucks, since I could have just bought a 3090 to begin with instead of a 4070Ti since there's effectively not much of a difference between Ampere and Ada for this workload. Oh well. I shouldn't try and sweat over it so much and get some rest while the model continues training and my local system is properly preparing and transcribing 6K hours of LibriLight.

@mrq Appreciate the response, and I totally get it. Thanks for letting me know, and good luck with all the work you’re doing here.

@mrq Appreciate the response, and I totally get it. Thanks for letting me know, and good luck with all the work you’re doing here.
Author
Owner

Although, if you do have any questions, concerns, suggestions, whatever about using mrq/vall-e itself, I'll be happy to help out. I feel that the documentation still is pretty lacking, not straightforward to use, and digging through here for a detail is a fool's errand, so any outside input helps.


More samples, same kind of remarks: it sounds better after giving it more time from re-introducing longer utterances, I'll need to do inference tests to see if it did correlate to stabilizing longer utterances, etc. etc. I'm giving it another day to train while my 4070Ti continues to transcribe before pivoting to finetune tests.


Oh joy, another new Zucc toy: https://ai.meta.com/blog/seamless-m4t/. It seems to aim to "unify" a bunch of translation tasks between text and speech, and not just with a demo, but with code and weights to.

Although, if you do have any questions, concerns, suggestions, whatever about using [mrq/vall-e](https://git.ecker.tech/mrq/vall-e) itself, I'll be happy to help out. I feel that the documentation still is pretty lacking, not straightforward to use, and digging through here for a detail is a fool's errand, so any outside input helps. --- More [samples](https://files.catbox.moe/j6o5u6.zip), same kind of remarks: it sounds better after giving it more time from re-introducing longer utterances, I'll need to do inference tests to see if it did correlate to stabilizing longer utterances, etc. etc. I'm giving it another day to train while my 4070Ti continues to transcribe before pivoting to finetune tests. --- Oh joy, another new Zucc toy: https://ai.meta.com/blog/seamless-m4t/. It seems to aim to "unify" a bunch of translation tasks between text and speech, and not just with a demo, but with code and weights to.
Author
Owner

mmm... I'm not sure if it's the recent introspection the past few days, or just constantly tending to the training and repo the past few days consecutively, but I'm feeling quite at unease. In the event I do fuck off for the next few days, I'll (finally) go ahead and jettison my weights and dataset here: https://huggingface.co/ecker/vall-e.

I'll preface that the model output is by no means perfect, I feel they're serviceable at best. Sometimes it beats TorToiSe output, but there's still too many inconsistencies I feel at the moment (I could probably apply a similar cope bandaid to TorToiSe's CLVP/CVVP and generate in a batch and pick the best ones of the bunch). Aside from that, it'll be a good starting point for anyone looking to try and train from existing weights or finetune them with their own dataset.

I'm also going to provide a "libre" copy of my datset in the repo too. "Libre", as it'll contain the LibriTTS/portion of LibriLight-6K in it, with all the other gray-ly acquired data left out; the donated audiobooks that I'm still grateful for, the rips from muh vidya, etc. are culled. While I've been occasionally watching my 4070Ti transcribe LibriLight-6K proper, I'm reminded that the biggest hurdle is the dataset when training a model, and it would be very beneficial to anyone to have it as a starting point.

mmm... I'm not sure if it's the recent *introspection* the past few days, or just constantly tending to the training and repo the past few days consecutively, but I'm feeling quite at unease. In the event I do fuck off for the next few days, I'll (finally) go ahead and jettison my weights and dataset here: https://huggingface.co/ecker/vall-e. I'll preface that the model output is by no means perfect, I feel they're ***serviceable*** at best. Sometimes it beats TorToiSe output, but there's still too many inconsistencies I feel at the moment (I could probably apply a similar cope bandaid to TorToiSe's CLVP/CVVP and generate in a batch and pick the best ones of the bunch). Aside from that, it'll be a good starting point for anyone looking to try and train from existing weights or finetune them with their own dataset. I'm also going to provide a "libre" copy of my datset in the repo too. "Libre", as it'll contain the LibriTTS/portion of LibriLight-6K in it, with all the other gray-ly acquired data left out; the donated audiobooks that I'm still grateful for, the rips from muh vidya, etc. are culled. While I've been occasionally watching my 4070Ti transcribe LibriLight-6K proper, I'm reminded that the biggest hurdle *is* the dataset when training a model, and it would be very beneficial to anyone to have it as a starting point.

For sure, having an already prepared dataset is very helpful. I had tried the script for your provided dataset that you had in the readme, but there were errors unpickling the audios that I couldn’t resolve. Maybe that is just due to dependency differences.

What kind of latency are you seeing with the model compared to tortoise? Tortoise was too slow, I’m expecting vall-e will also be slow without quantization and/or model distillation.

For sure, having an already prepared dataset is very helpful. I had tried the script for your provided dataset that you had in the readme, but there were errors unpickling the audios that I couldn’t resolve. Maybe that is just due to dependency differences. What kind of latency are you seeing with the model compared to tortoise? Tortoise was too slow, I’m expecting vall-e will also be slow without quantization and/or model distillation.
Author
Owner

For sure, having an already prepared dataset is very helpful. I had tried the script for your provided dataset that you had in the readme, but there were errors unpickling the audios that I couldn’t resolve. Maybe that is just due to dependency differences.

Yeah, the prepare_*.sh scripts have been relics from several months ago when it was for quickly preparing a dataset to train with on rentals. I never got around to replacing them since I had my own draconian method to prepare the datasets.

I might go back and provide a script to create one from a pile of audio files instead, but it would have to be predicated on replacing/rewriting AIVC.

What kind of latency are you seeing with the model compared to tortoise? Tortoise was too slow, I’m expecting vall-e will also be slow without quantization and/or model distillation.

I need to do proper benchmarks, but inferencing with VALL-E is very snappy even with the weights at float32 after giving the inference script some love.

  • the AR is the primary bottleneck due to being recurrent, but my 6800XT tops at 20it/s while my 4070Ti will cap at 55it/s (where 75its is a second of audio) without much of a speed drop through the average lengthed sentence (about 6 seconds).
  • the NAR is near instant, which is a given since it always require at most resps_length - 1 passes.
  • this is also ignoring being able to watch multiple generations at once to further increase throughout; TorToiSe can do batched inferencing I imagine, but it seems to be quite the pain to implement.
  • this is also factoring in that the RetNet's chunked recurrent forward pass doesn't seem to be working at the moment, as the documentation is still rather hazy on how to utilize it.

I'm definitely pleased by the speeds I'm getting now with VALL-E, and I feel there's much more room for improvement. Compared to TorToiSe, the only limiting factor is the AR's throughput speed (the NAR and EnCodec/Vocos decoding are practically instant for all intents and purposes) instead of TorToiSe's batching in the AR + CLIP/CLVP candidate sampling + diffusion sampling + vocoder.

> For sure, having an already prepared dataset is very helpful. I had tried the script for your provided dataset that you had in the readme, but there were errors unpickling the audios that I couldn’t resolve. Maybe that is just due to dependency differences. Yeah, the prepare_*.sh scripts have been relics from several months ago when it was for quickly preparing a dataset to train with on rentals. I never got around to replacing them since I had my own draconian method to prepare the datasets. I might go back and provide a script to create one from a pile of audio files instead, but it would have to be predicated on replacing/rewriting AIVC. > What kind of latency are you seeing with the model compared to tortoise? Tortoise was too slow, I’m expecting vall-e will also be slow without quantization and/or model distillation. I need to do proper benchmarks, but inferencing with VALL-E is very snappy even with the weights at float32 after giving the inference script some love. * the AR is the primary bottleneck due to being recurrent, but my 6800XT tops at 20it/s while my 4070Ti will cap at 55it/s (where 75its is a second of audio) without much of a speed drop through the average lengthed sentence (about 6 seconds). * the NAR is near instant, which is a given since it always require at most `resps_length - 1` passes. * this is also ignoring being able to watch multiple generations at once to further increase throughout; TorToiSe *can* do batched inferencing I imagine, but it seems to be quite the pain to implement. * this is also factoring in that the RetNet's chunked recurrent forward pass doesn't seem to be working at the moment, as the documentation is still rather hazy on how to utilize it. I'm definitely pleased by the speeds I'm getting now with VALL-E, and I feel there's much more room for improvement. Compared to TorToiSe, the only limiting factor is the AR's throughput speed (the NAR and EnCodec/Vocos decoding are practically instant for all intents and purposes) instead of TorToiSe's batching in the AR + CLIP/CLVP candidate sampling + diffusion sampling + vocoder.
Author
Owner

Having mentioned getting 75% real time speed, it sort-of opens the idea of having streamed output in real time (or at the very least, buffered), but:

  • desu I'm not too keen on how robust an EnCodec code sequence is to being sliced. I've been noticing a good amount of eval/val output having harsh sounds for a brief moment in the beginning, and it's dawning on me that my "randomly slice an EnCodec sequence when creating the input prompt" isn't a wise move. I'll leave the why this emerges as an issue for a neural-encoder/decoder that attends to tokens within a window as an exercise to the reader.
  • there doesn't seem to be any examples on streaming output through EnCodec, so implementing such is an exercise left for me to figure out how. I imagine there has to be some funny business with the overlap window and brushing up on my literature. Coincidentally, there's an opened issue from the other day on the EnCodec repo asking about it, but only because the paper references it.
  • even with a 75it/s throughput speed from the AR (and assuming everything else is instantaneous), the NAR seemingly requires the entire AR sequence in full to inference the remaining residuals. I'm not too keen on how robust the NAR is when pseudo-causal sampling (in the sense you're sampling on at a time); I'll have to reread the code and the paper to make sure I can actually do this.
    • the other remedy is follow in MusicGen's footsteps and drop the NAR for an AR that interleaves the remaining residuals. Ignoring the theoretical performance penalty of 2x/4x/8x (depending on the target quant level) (this could be remedied with chunkwise recurrent forwards, but again, lack of documentation), this would require either a retrain from scratch or nasty bandaids to reuse the existing weights (it can be done. It shouldn't be any different than extending the NAR to handle more quant levels) and training from those weights.

It's just a thought that crossed my mind yesterday. I don't expect getting around to toying with it anytime soon, but it's something that can be done that TorToiSe (and I imagine a lot of other neural TTS systems) can't.

Having mentioned getting 75% real time speed, it sort-of opens the idea of having streamed output in real time (or at the very least, buffered), but: * desu I'm not too keen on how robust an EnCodec code sequence is to being sliced. I've been noticing a good amount of eval/val output having harsh sounds for a brief moment in the beginning, and it's dawning on me that my "randomly slice an EnCodec sequence when creating the input prompt" isn't a wise move. I'll leave the why this emerges as an issue for a neural-encoder/decoder that attends to tokens within a window as an exercise to the reader. * there doesn't seem to be any examples on streaming output through EnCodec, so implementing such is an exercise left for me to figure out how. I imagine there has to be some funny business with the overlap window and brushing up on my literature. Coincidentally, there's an opened issue from the other day on the EnCodec repo asking about it, but only because the paper references it. * even with a 75it/s throughput speed from the AR (and assuming everything else is instantaneous), the NAR seemingly requires the entire AR sequence in full to inference the remaining residuals. I'm not too keen on how robust the NAR is when pseudo-causal sampling (in the sense you're sampling on at a time); I'll have to reread the code and the paper to make sure I *can* actually do this. - the other remedy is follow in MusicGen's footsteps and drop the NAR for an AR that interleaves the remaining residuals. Ignoring the theoretical performance penalty of 2x/4x/8x (depending on the target quant level) (this *could* be remedied with chunkwise recurrent forwards, but again, lack of documentation), this would require either a retrain from scratch or nasty bandaids to reuse the existing weights (it *can* be done. It shouldn't be any different than extending the NAR to handle more quant levels) and training from those weights. It's just a thought that crossed my mind yesterday. I don't expect getting around to toying with it anytime soon, but it's something that *can* be done that TorToiSe (and I imagine a lot of other neural TTS systems) can't.

Streaming is very valuable but yeah it is surprisingly tough for most things.

Looks like you’re moving forward with RetNet, right? Why is that when the “vanilla” (no recurrent steps) transformer architectures are much more tried and tested at scale?

Streaming is very valuable but yeah it is surprisingly tough for most things. Looks like you’re moving forward with RetNet, right? Why is that when the “vanilla” (no recurrent steps) transformer architectures are much more tried and tested at scale?
Author
Owner

Looks like you’re moving forward with RetNet, right?

I might as well. I've put the most training time into this current model (ignoring when I've spent several weeks on a deadend model with ~500 hours of data and the worst LR scheduling imaginable).

I'd have to retrain from scratch, as the previous attention-based weights are rather flawed from a lack of punctuation in the phonemes. I could salvage it with gumming up the phoneme symmap, but why bother.

(no recurrent steps)

Ackshually, the RetNet implementations work without needing to use the special recurrent_forward / chunkwise_forward passes; to my understanding those routines re-leverages some "internal states" from the initial pass to offer a throughput increase for little to no extra cost.

The analogue for attention-based transformers (or at least, GPT) would be a KV-cache (which TorToiSe uses but incurs a memory cost, and I believe didn't work under DirectML).

Why is that when the “vanilla” transformer architectures are much more tried and tested at scale?

Training.

I've noted that the progression of training seemed noticeably faster in comparison to the attention-based "experiments", where the model reached a given loss/accuracy much earlier along the epoch, and if I recall right, specific traits emerged earlier too; I felt it was good at capturing the acoustics much earlier, and while I felt speech wasn't as precocious as I'd like, it still emerged rather quickly from being concerning to passable.

The reduction in the model size, and the optimizer tending to less parameters, led to enough of a sizeable reduction in VRAM usage I was able to pass the savings along to a larger batch size, leading to much better throughput in training.

However, the RetNet's literature mentions attention-based transformers under 2B parameters still outperformed the RetNet, and only until after that do RetNets outshine, but I can't really say for sure if it's true or not without training another attention-based model.

Sure, I suppose by sticking to an arch that still has yet to have any actual use in the wild, I'm opting out of all the other bandaids like xformers or flash-attention or whatever warts to cope with how intensive transformers can be. I'm fine with that, partly because I really do not like those bandaids and how much extra complexity gets added, and the other part is that it never got that far in scraping for savings for TorToiSe.

  • Technically, the original implementation isn't necessarily a bog-standard transformer, but rather "inject sinusoidal-position embeddings before passing the inputs through blocks that do pre-LayerNorms (or AdaLN, if a NAR), and in those blocks are masked attentions + feed-forwards" (an attention-based transformer, but with nuances).

Note: I'm referring to transformers as "attention-based" as, for all intents and purposes, the only difference between a RetNet and a transformer is that the typical transformers are attention-based, and RetNet's are retention-based. The only functional difference is in the math with the mechanisms to tending to input tokens.

> Looks like you’re moving forward with RetNet, right? I might as well. I've put the most training time into this current model (ignoring when I've spent several weeks on a deadend model with ~500 hours of data and the worst LR scheduling imaginable). I'd have to retrain from scratch, as the previous attention-based weights are rather flawed from a lack of punctuation in the phonemes. I *could* salvage it with gumming up the phoneme symmap, but why bother. > (no recurrent steps) Ackshually, the RetNet implementations work without needing to use the special recurrent_forward / chunkwise_forward passes; to my understanding those routines re-leverages some "internal states" from the initial pass to offer a throughput increase for little to no extra cost. The analogue for attention-based transformers (or at least, GPT) would be a KV-cache (which TorToiSe uses but incurs a memory cost, and I believe didn't work under DirectML). > Why is that when the “vanilla” transformer architectures are much more tried and tested at scale? Training. I've noted that the progression of training seemed noticeably faster in comparison to the attention-based "experiments", where the model reached a given loss/accuracy much earlier along the epoch, and if I recall right, specific traits emerged earlier too; I felt it was good at capturing the acoustics much earlier, and while I felt speech wasn't as precocious as I'd like, it still emerged rather quickly from being concerning to passable. The reduction in the model size, and the optimizer tending to less parameters, led to enough of a sizeable reduction in VRAM usage I was able to pass the savings along to a larger batch size, leading to much better throughput in training. However, the RetNet's literature mentions attention-based transformers under 2B parameters still outperformed the RetNet, and only until after that do RetNets outshine, but I can't really say for sure if it's true or not without training another attention-based model. Sure, I suppose by sticking to an arch that still has yet to have any actual use in the wild, I'm opting out of all the other bandaids like xformers or flash-attention or whatever warts to cope with how intensive transformers can be. I'm fine with that, partly because I really do not like those bandaids and how much extra complexity gets added, and the other part is that it never got *that* far in scraping for savings for TorToiSe. * Technically, the original [implementation](https://github.com/enhuiz/vall-e) isn't necessarily a bog-standard transformer, but rather "inject sinusoidal-position embeddings before passing the inputs through blocks that do pre-LayerNorms (or AdaLN, if a NAR), and in those blocks are masked attentions + feed-forwards" (an attention-based transformer, but with nuances). > **Note**: I'm referring to transformers as "attention-based" as, for all intents and purposes, the only difference between a RetNet and a transformer is that the typical transformers are attention-based, and RetNet's are retention-based. The only functional difference is in the math with the mechanisms to tending to input tokens.
Author
Owner

mmm...

Training is paused for the meantime on the runpod rentals. The improvements seem very marginal now, and I think I'm starting to hit a wall with how much continued training with a low LR will get me. I should be training it on the SpeechX tasks, but desu that's both low priority right now as the entire point of this is zero-shot TTS, and I feel is something I should supervise on my 4070Ti locally and not experiment on rentals. Besides, with LibriLight-6K properly being added, I feel it would be better to wait until then.

The LibriLight-6K transcription finished two days earlier than expected, but quantizing everything is quite the pain with a measly 25it/s and a lot of utterances. I expect two days until it's finished. I could try and speed this up with batching for EnCodec, but sequences will be padded to the longest sequence in a batch, and I'm not so sure if there's an intuitive way to unpad, although I'm sure the answer would be obvious when I try and bang my head against the wall to figure it out.

I don't know. I want to focus on improving zero-shot with more speakers in the dataset (which I won't gain any new speakers, as I already had some weird portion of LibriLight-6K in it already), but I still need to focus on getting consistent utterances outputted, which more utterances per speaker is the answer to that (as proven when I added the donated audiobooks with fewer speakers but many more utterances per speaker). The other side is that zero-shot doesn't seem all that bad, as it does copy the input prompt, but it's the input prompts themselves that are flawed at times and causes problems, so I might just be chasing the wrong animal entirely and need to better improve my methodology in sampling input prompts.

Oh well, I should have my answer soon on what's best.

mmm... Training is paused for the meantime on the runpod rentals. The improvements seem very marginal now, and I think I'm starting to hit a wall with how much continued training with a low LR will get me. I should be training it on the SpeechX tasks, but desu that's both low priority right now as the ***entire*** point of this *is* zero-shot TTS, and I feel is something I should supervise on my 4070Ti locally and not experiment on rentals. Besides, with LibriLight-6K properly being added, I feel it would be better to wait until then. The LibriLight-6K transcription finished two days earlier than expected, but quantizing everything is quite the pain with a measly 25it/s and ***a lot*** of utterances. I expect two days until it's finished. I could try and speed this up with batching for EnCodec, but sequences will be padded to the longest sequence in a batch, and I'm not so sure if there's an intuitive way to unpad, although I'm sure the answer would be obvious when I try and bang my head against the wall to figure it out. I don't know. I want to focus on improving zero-shot with more speakers in the dataset (which I won't gain any new speakers, as I already had some weird portion of LibriLight-6K in it already), but I still need to focus on getting consistent utterances outputted, which more utterances per speaker is the answer to that (as proven when I added the donated audiobooks with fewer speakers but many more utterances per speaker). The other side is that zero-shot doesn't seem all that bad, as it does copy the input prompt, but it's the input prompts themselves that are flawed at times and causes problems, so I might just be chasing the wrong animal entirely and need to better improve my methodology in sampling input prompts. Oh well, I should have my answer soon on what's best.

I'm looking to make use of multiple GPUs, but for all scripts used in the repo, looks like it's overriding my PyTorch DataParallel settings, etc with whatever's being set by deepspeed. Struggling to find where these are set in the configs (where are the deepspeed configs?). Are they here?

I'm looking to make use of multiple GPUs, but for all scripts used in the repo, looks like it's overriding my PyTorch DataParallel settings, etc with whatever's being set by deepspeed. Struggling to find where these are set in the configs (where are the deepspeed configs?). Are they [here](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/config.py)?
Author
Owner

looks like it's overriding my PyTorch DataParallel settings, etc with whatever's being set by deepspeed

Most likely. DeepSpeed handles whatever distributed training initialization it calls for. I don't recall if you can specify a communication backend (nccl, mpi, etc) through command line arguments passed to DeepSpeed or requires me setting it under ./vall_e/engines/deepspeed.py (do to the nature I'm invoking DeepSpeed, it needs an explicit call somewhere to initialize the distributed shit).

Struggling to find where these are set in the configs (where are the deepspeed configs?). Are they here?

./vall_e/config.py:271 correlates to the config.yaml's training.deepspeed section and generating the DeepSpeed config on the fly (with values that work for me but I'm sure needs saner defaults, especially for ZeRO and quantization/compression training).

You can override any de-facto DeepSpeed config values by providing a JSON under ./data/ds_config.json (per line 361) with what it normally takes from this mess of documentation.

I honestly forgot I've had that override in from the very beginning as I never ended up using it, and I should have it set to instead use f'{cfg.cfg_path}/ds_config.json' for overrides.

> looks like it's overriding my PyTorch DataParallel settings, etc with whatever's being set by deepspeed Most likely. DeepSpeed handles whatever distributed training initialization it calls for. I don't recall if you can specify a communication backend (nccl, mpi, etc) through command line arguments passed to DeepSpeed or requires me setting it under [`./vall_e/engines/deepspeed.py`](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/engines/deepspeed.py#L29) (do to the nature I'm invoking DeepSpeed, it needs an explicit call somewhere to initialize the distributed shit). > Struggling to find where these are set in the configs (where are the deepspeed configs?). Are they here? [`./vall_e/config.py:271`](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/config.py#L271) correlates to the `config.yaml`'s `training.deepspeed` section and generating the DeepSpeed config on the fly (with values that work for me but I'm sure needs saner defaults, especially for ZeRO and quantization/compression training). You can override any de-facto DeepSpeed config values by providing a JSON under `./data/ds_config.json` ([per line 361](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/config.py#L361)) with what it normally takes from [this mess of documentation](https://www.deepspeed.ai/docs/config-json/). I honestly forgot I've had that override in from the very beginning as I never ended up using it, and I should have it set to instead use `f'{cfg.cfg_path}/ds_config.json'` for overrides.

Thanks, I'll look into that.

And what about model size? How do you control that currently? I didn't see any params for it in config.yaml.

Thanks, I'll look into that. And what about model size? How do you control that currently? I didn't see any params for it in config.yaml.
Author
Owner

Currently guided only by presets: quarter, half, and full here and in the YAML here.

I need to add in a way to either specify model size or preset size for better control (for example, size being a dict defining tokens/dim/heads/layers or a string specifying a preset).

Currently guided only by presets: `quarter`, `half`, and `full` [here](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/config.py#L141) and in the YAML [here](https://git.ecker.tech/mrq/vall-e/src/branch/master/data/config.yaml#L36). I need to add in a way to either specify model size or preset size for better control (for example, size being a dict defining tokens/dim/heads/layers or a string specifying a preset).
Author
Owner

Idle hands are truly the devil's workshop.

I'm getting tempted to make another poor purchase decision. My gut wants to go with a 7900XTX despite:

  • already knowing there's an inherent design flaw with ROCm only supporting cards in PCIe slots with lanes going directly to the CPU (PCIe atomics are not available with PCIe lanes through the chipset).
  • allegedly gfx1100 is still not supported officially although it seems people are able to utilize it fine with little hoops to jump through.
  • my 6800XTs already seem to perform terribly under PyTorch/ROCm, although I've yet to actually test mine on pytorch2.0.1 + ROCm5.4. I'd like to cope that it's just because it's an almost 3 year old card at this point, and RDNA3 has le AI accelerators on-die rather than just RT cores (that are pretty much ray-box/ray-triangle ASICs).
  • renting 3090s and 4090s shows that having a lot of VRAM wiggleroom is a huge creature comfort.
  • buying a 4090 instead is painful. Buying a 3090 just feels like a bad investment since it's an older card. Buying another 4070Ti for distributed training seems like also a bad investment, since it's the same price as a 3090 for half the VRAM. I can at least cope with reusing the 4070Ti into my personal system.
  • I should wait until 7700XT/7800XT launches to maybe cause a bit of a price drop in 7900XTXs (realistically not).
  • I probably do not need a 5th GPU to add to my pile within the past couple of years. But I really do not want to be dependent on rentals. The money isn't (necessarily) a concern, but it's a pill to supervise any training through a rental.

If I do cave and it's a bad investment, I can always sell it or return it within 30 days (although, that was my plan with the 2060 when I needed to debug a Vulkan-based engine of mine with, when it turns out my Steam Deck had the same issues as Nvidia cards).

Keeping the sinful thoughts at bay, I've been doing cleanup while I wait for LibriLight-6K to finish quantizing/phonemizing.

  • The implementation works under Windows smoothly, albeit having to fallback to the local backend.
  • I'm still trying to chase down what possible discrepancy exists between inferencing being subpar and the evaluation/validation outputs sounding decent. I swear I'll think I fixed it only for it to be a placebo. It's genuinely driving me insane. The validation output sounding fine proves it can inference, but there's got to be an issue with the inferencing itself.
  • Finetuning, even on Windows with a 2060 (ROCm on Windows was a paper launch for all intents and purposes), is proving to be favorable. However I can only finetune the AR or NAR exclusively due to VRAM limitations, at float32, but 6GiB is very doable.
  • I still need to fix the phonemizer for working with Japanese. Feeding kanji will just coerce it into thinking it's "chinese letter" in English and phonemize that. I neglected to remember in my TorToiSe Japanese tests, I had to use pykakasi to romanize the text, and it seems that the phonemizer can process romaji with the segments backend, but the outputted phonemes don't seem consistent with the IPAs from English + espeak.

Should have forced myself to use the downtime as a brief break, but the unease will push me to keep working anyways. Oh well.

I've been theorizing in my head the next course of action with training, and I think I'll just resume training (from the previous weights, I don't think the issue of Frankenstein-ing datasets of the past will be an issue) with the content editing SpeechX-task enabled (alongside base TTS) with the full LibriLight-6K. These two in tandem should help bolster the strength of the model to generalize and not be overtrained.

  • like with what emerged from slotting in the donated audiobooks where each speaker had a ton more utterances, this should help with the baseline linguistics of the model itself.
  • I don't think I should be fretting so much over the zero-shot capabilities, as I've pretty much proved that the model already does a good job at it in the evaluation/validation outputs, it's just that those input prompts can be a bit flawed.

As for my decision with just using the content editing SpeechX task:

  • All of the other tasks require waveform merging. I feel it doesn't really add much time to the throughput to decode+merge+encode, as the dataloaders will process things in the background, but still, I also feel like it might actually be taxing for VRAM, despite realistically it's the content editing that's the most taxing as it's three utterances combined.
  • All of the other tasks aren't necessarily beneficial to the end user. Denoising audio already exists, and a user isn't going to have dirty audio to work with. Sure, it's still a neat feature to include in a final model, but it's not worth extra time to include for most use cases. However, I feel it might help generalize the model more, and thus have it overtrained.
  • Content editing is beneficial to both the end user and the model itself, as the model will be able to still have guided generations from a given pre/post acoustics, and I feel users are mostly going to use this anyways to make le vidya character say different word YTP style, instead of just generating an entire new utterance with different prosody.

I suppose I'll go back and try and benchmark my 6800XT to get the best ROCm performance possible out of it before I make any purchasing decisions.


I managed to get pytorch2.1.0+rocm5.5 working on my 6800XT but not rocm5.6 (segfaults with the nightly and the precompiled python-pytorch-opt-rocm from the AUR).

With apples-to-apples settings:

  • bs=8, float32, local backend, the 4070Ti is ~22% faster than the 6800XT.
  • bs=8, float16, DeepSpeed (local is not stable), the 4070Ti is way faster, probably because it has better (utilization of) float16.
  • bs=8, float16, DeepSpeed quantization/compression training, the 4070Ti is way much faster, probably because it has better (utilization of) int8.

It's not even really worth trying to increase the batch size for the 6800XT to try and close out the gap; it's not feasible to gimp the 4070Ti to train at float32. I suppose it's better to compare AMD vs Nvidia with a 7900XTX. Bleh.

Idle hands are truly the devil's workshop. I'm getting tempted to make another poor purchase decision. My gut wants to go with a 7900XTX despite: * already knowing there's an inherent design flaw with ROCm only supporting cards in PCIe slots with lanes going directly to the CPU (PCIe atomics are not available with PCIe lanes through the chipset). * *allegedly* gfx1100 is still not supported officially although it seems people are able to utilize it fine with little hoops to jump through. * my 6800XTs already seem to perform terribly under PyTorch/ROCm, although I've yet to actually test mine on pytorch2.0.1 + ROCm5.4. I'd like to cope that it's just because it's an almost 3 year old card at this point, and RDNA3 has le AI accelerators on-die rather than just RT cores (that are pretty much ray-box/ray-triangle ASICs). * renting 3090s and 4090s shows that having a lot of VRAM wiggleroom is a huge creature comfort. * buying a 4090 instead is painful. Buying a 3090 just feels like a bad investment since it's an older card. Buying another 4070Ti for distributed training seems like also a bad investment, since it's the same price as a 3090 for half the VRAM. I can at least cope with reusing the 4070Ti into my personal system. * I should wait until 7700XT/7800XT launches to maybe cause a bit of a price drop in 7900XTXs (realistically not). * I probably do not need a 5th GPU to add to my pile within the past couple of years. But I really do not want to be dependent on rentals. The money isn't (necessarily) a concern, but it's a pill to supervise any training through a rental. If I do cave and it's a bad investment, I can always sell it or return it within 30 days (although, that *was* my plan with the 2060 when I needed to debug a Vulkan-based engine of mine with, when it turns out my Steam Deck had the same issues as Nvidia cards). Keeping the sinful thoughts at bay, I've been doing cleanup while I wait for LibriLight-6K to finish quantizing/phonemizing. * The implementation works under Windows smoothly, albeit having to fallback to the `local` backend. * I'm still trying to chase down what possible discrepancy exists between inferencing being subpar and the evaluation/validation outputs sounding decent. I swear I'll think I fixed it only for it to be a placebo. It's genuinely driving me insane. The validation output sounding fine *proves* it can inference, but there's got to be an issue with the inferencing itself. * Finetuning, even on Windows with a 2060 (ROCm on Windows was a paper launch for all intents and purposes), is proving to be favorable. However I can only finetune the AR or NAR exclusively due to VRAM limitations, at float32, but 6GiB is very doable. * I still need to fix the phonemizer for working with Japanese. Feeding kanji will just coerce it into thinking it's "chinese letter" in English and phonemize that. I neglected to remember in my TorToiSe Japanese tests, I had to use `pykakasi` to romanize the text, and it seems that the phonemizer can process romaji with the segments backend, but the outputted phonemes don't seem consistent with the IPAs from English + espeak. Should have forced myself to use the downtime as a brief break, but the unease will push me to keep working anyways. Oh well. I've been theorizing in my head the next course of action with training, and I think I'll just resume training (from the previous weights, I don't think the issue of Frankenstein-ing datasets of the past will be an issue) with the content editing SpeechX-task enabled (alongside base TTS) with the full LibriLight-6K. These two in tandem should help bolster the strength of the model to generalize and not be overtrained. * like with what emerged from slotting in the donated audiobooks where each speaker had a ton more utterances, this should help with the baseline linguistics of the model itself. * I don't think I should be fretting so much over the zero-shot capabilities, as I've pretty much proved that the model already does a good job at it in the evaluation/validation outputs, it's just that those input prompts can be a bit flawed. As for my decision with just using the content editing SpeechX task: * All of the other tasks require waveform merging. I feel it doesn't really add much time to the throughput to decode+merge+encode, as the dataloaders will process things in the background, but still, I also feel like it might actually be taxing for VRAM, despite realistically it's the content editing that's the most taxing as it's three utterances combined. * All of the other tasks aren't necessarily beneficial to the end user. Denoising audio already exists, and a user isn't going to have dirty audio to work with. Sure, it's still a neat feature to include in a final model, but it's not worth extra time to include for most use cases. However, I feel it might help generalize the model more, and thus have it overtrained. * Content editing *is* beneficial to both the end user and the model itself, as the model will be able to still have guided generations from a given pre/post acoustics, and I feel users are mostly going to use this anyways to make le vidya character say different word YTP style, instead of just generating an entire new utterance with different prosody. I suppose I'll go back and try and benchmark my 6800XT to get the best ROCm performance possible out of it before I make any purchasing decisions. --- I managed to get pytorch2.1.0+rocm5.5 working on my 6800XT but not rocm5.6 (segfaults with the nightly and the precompiled `python-pytorch-opt-rocm` from the AUR). With apples-to-apples settings: * bs=8, float32, local backend, the 4070Ti is ~22% faster than the 6800XT. * bs=8, float16, DeepSpeed (local is not stable), the 4070Ti is way faster, probably because it has better (utilization of) float16. * bs=8, float16, DeepSpeed quantization/compression training, the 4070Ti is way much faster, probably because it has better (utilization of) int8. It's not even really worth trying to increase the batch size for the 6800XT to try and close out the gap; it's not feasible to gimp the 4070Ti to train at float32. I suppose it's better to compare AMD vs Nvidia with a 7900XTX. Bleh.
Author
Owner

Additionally, while trying to make recurrent_forward work, I think I managed to finally fix the issue with inferencing. It seems that chunkwise_recurrent does in fact work, and it was actually being used. It was not only harming the output, but also performance. Consistency seems to be boosted, but there's still a few hiccups.

My 4070Ti is able to top out at an orgasmic 105it/s while the 6800XT barely peaked at 40it/s at float16. With float32, the 4070Ti peaked at 80it/s and the throughput dropped to 60it/s, while the 6800XT maintained a constant 34it/s.

I'm going to do more inference tests just to validate this did in fact fix everything, but my test inferences are in fact working.

On the other hand, I did try and take a crack at making use of chunkwise_recurrent and I don't think there's an elegant way to make use of it, unless I'm just stupid and am using the wrong way to sample the logits. The output is destroyed no matter what I try.

Additionally, while trying to make `recurrent_forward` work, I think I managed to finally fix the issue with inferencing. It seems that `chunkwise_recurrent` does in fact work, and it was actually being used. It was not only harming the output, but also performance. Consistency seems to be boosted, but there's still a few hiccups. My 4070Ti is able to top out at an orgasmic 105it/s while the 6800XT barely peaked at 40it/s at float16. With float32, the 4070Ti peaked at 80it/s and the throughput dropped to 60it/s, while the 6800XT maintained a constant 34it/s. I'm going to do more inference tests just to validate this did in fact fix everything, but my test inferences are in fact working. On the other hand, I did try and take a crack at making use of `chunkwise_recurrent` and I don't think there's an elegant way to make use of it, unless I'm just stupid and am using the wrong way to sample the logits. The output is destroyed no matter what I try.
Author
Owner

I think I've got everything I wanted to do done before the next training session, so I can just leave the GPUs (yes, plural) training and shutting up for a while (or at least not overworking myself).

  • LibriLight-6K processing is complete. I'll have to come back with the numbers, but there's now 6,000,000 samples in my dataset, and an estimated 10K hours of audio.
    • I did a deduplication test and there's only one speaker+book pair overlapping, and it's not really worth culling that one pair from the dataset.
    • I did re-check the LibriLight repo and it mentions there being 4.5K hours of possibly duplicated books. I imagine this just means multiple speakers reading the same book in the dataset, rather than each book only being in the dataset once by one speaker. I'll probably take a crack at that too, as it's the next step that isn't the 52K hours in the large dataset that I really don't want to touch right now.
  • I've cleaned up and provided some helper scripts to directly process LibriTTS (leveraging the transcriptions it provides) and preparing LibriLight (gets the files named similarly to LibriTTS) for transcription and processing. I used to have scripts half-assed for similar tasks from several months ago, but they were for much more narrow-er datasets.
  • I'm using unsliced LibriTTS-R, as I think my copy is too mucked up. I thought I had a backup from before I haphazardly dumped a bit of LibrLight into it, but it seems some speakers' folders are empty. It's probably for the best to use actual, canonical transcriptions and full slices for them, but sadly they're only 550 hours out of [unknown dataset size].
  • I've cleaned up the dataloading step to be as fast as it can be. I don't have rough numbers, but my now-dataset of 6,000,000 samples can be loaded in ten-ish minutes? Although, I'm cheating a little bit:
  • when creating an HDF5 dataset (or running python3 -m vall_e.data --action=metadata), helpful metadata.json will be generated to store, in each speaker folder, a JSON where the keys are the IDs, and the values are the duration/phoneme lengths. This helps speed up the validation step (culling data not within a requested size) a shit ton, as even querying the HDF5's attributes takes some time, and when not using an HDF5 dataset, the phonemes/quants have to be loaded to query this.
  • I've let the demons win and I caved and ordered a 7900XTX, and I suppose I'll be the guinea pig and have my metrics (I should be fine in the financial department, I don't wanna have to change my mind and accept donations).
    • pytorch2.1.0+rocm5.5 (pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.5) is the sweet spot, as pytorch2.0.1+rocm5.4.2 lacks bfloat16 and is slower (I suppose as evident with the 6800XT), and pytorch2.1.0+rocm5.6 segfaults still, no matter which nightly I install OR using the AUR's precompiled copy of pytorch2.0.1 (I'm currently trying to compile it again, but it just threw a ==> ERROR: A failure occurred in package_python-pytorch(). with no helpful info.)
    • benchmarking training against my 4070Ti with apples-to-apples settings (bs=8, bfloat16, DeepSpeed quantization/compression), the 7900XTX is actually slightly faster than the 4070Ti. I hope that with rocm5.6 actually working, there'd be even more speed gains to be had. I'm just a bit concerned, since the 7900XTX has to be in slot 1 (I do not want to gamble with PCIe slot shenanigans) and it's peaking at 77C, while the 4070Ti is still at a comfy 50C.
    • since they're practically at the same speed, I wonder if I should bother trying to distribute training against them (somehow), or just have the AR trained on one and the NAR trained on the other.
    • sadly, inferencing tests aren't fruitful (I don't know why, maybe CUDA-based @torch.inference_mode() has better optimizations). The 7900XTX tops out at 40it/s compared to the 4070Ti's peak 110it/s (but drops off rather fast to 75it/s). I imagine getting recurrent_forward/chunkwise_forward working would close the gap, as I had accidentally left recurrent_forward enabled and the 7900XTX had better throughput (110it/s in the prefil, 90it/s after).
    • in reality, I think spending $200 more for double the VRAM is a bit silly for something that was not guaranteed to work at all, but in hindsight I could've just bought one in the past, and then bought another one later for better distributed training. I just realized that training one model at a time might be a pill, as I won't be able to get proper evaluation/validation outputs as only one model is loaded and not both. I suppose I can do some tricks to load the other model but keep it in CPU until needed, and if it works out right, it can be the latest copy of the other model it's loading.

I just hope my efforts and impulsiveness pays off. At worst, I'm 2xing my training throughput, although I wonder if I should have done this earlier. I hope I can get some peace of mind now, although the smell of a new GPU is plastered onto my face (better than having the smell of not-so-clean gas mysteriously clinging to me like yesterday).

I suppose I'll follow through and just let things train and shut up for a few days now.

I think I've got everything I wanted to do done before the next training session, so I can just leave the GPUs (yes, plural) training and shutting up for a while (or at least not overworking myself). * LibriLight-6K processing is complete. I'll have to come back with the numbers, but there's now 6,000,000 samples in my dataset, and an estimated 10K hours of audio. - I did a deduplication test and there's only one speaker+book pair overlapping, and it's not really worth culling that one pair from the dataset. - I did re-check the [LibriLight repo](https://github.com/facebookresearch/libri-light/blob/main/data_preparation/README.md#downloading-the-data) and it mentions there being 4.5K hours of possibly duplicated books. I imagine this just means multiple speakers reading the same book in the dataset, rather than each book only being in the dataset once by one speaker. I'll probably take a crack at that too, as it's the next step that isn't the 52K hours in the `large` dataset that I really don't want to touch right now. * I've cleaned up and provided some helper scripts to directly process LibriTTS (leveraging the transcriptions it provides) and preparing LibriLight (gets the files named similarly to LibriTTS) for transcription and processing. I used to have scripts half-assed for similar tasks from several months ago, but they were for much more narrow-er datasets. * I'm using unsliced LibriTTS-R, as I think my copy is too mucked up. I *thought* I had a backup from before I haphazardly dumped a bit of LibrLight into it, but it seems some speakers' folders are empty. It's probably for the best to use actual, canonical transcriptions and full slices for them, but sadly they're only 550 hours out of `[unknown dataset size]`. * I've cleaned up the dataloading step to be as fast as it can be. I don't have rough numbers, but my now-dataset of 6,000,000 samples can be loaded in ten-ish minutes? Although, I'm cheating a little bit: * when creating an HDF5 dataset (or running `python3 -m vall_e.data --action=metadata`), helpful `metadata.json` will be generated to store, in each speaker folder, a JSON where the keys are the IDs, and the values are the duration/phoneme lengths. This helps speed up the validation step (culling data not within a requested size) a shit ton, as even querying the HDF5's attributes takes some time, and when not using an HDF5 dataset, the phonemes/quants have to be loaded to query this. * I've let the demons win and I caved and ordered a 7900XTX, and I suppose I'll be the guinea pig and have my metrics (I should be fine in the financial department, I don't wanna have to change my mind and accept donations). - pytorch2.1.0+rocm5.5 (`pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.5`) is the sweet spot, as pytorch2.0.1+rocm5.4.2 lacks bfloat16 and is slower (I suppose as evident with the 6800XT), and pytorch2.1.0+rocm5.6 segfaults still, no matter which nightly I install OR using the [AUR's precompiled copy of pytorch2.0.1](https://archlinux.org/packages/extra/x86_64/python-pytorch-rocm/) (I'm currently trying to compile it again, but it ***just*** threw a `==> ERROR: A failure occurred in package_python-pytorch().` with no helpful info.) - benchmarking training against my 4070Ti with apples-to-apples settings (bs=8, bfloat16, DeepSpeed quantization/compression), the 7900XTX is actually slightly faster than the 4070Ti. I hope that with rocm5.6 actually working, there'd be even more speed gains to be had. I'm just a bit concerned, since the 7900XTX *has* to be in slot 1 (I do not want to gamble with PCIe slot shenanigans) and it's peaking at 77C, while the 4070Ti is still at a comfy 50C. - since they're practically at the same speed, I wonder if I should bother trying to distribute training against them (somehow), or just have the AR trained on one and the NAR trained on the other. - sadly, inferencing tests aren't fruitful (I don't know why, maybe CUDA-based `@torch.inference_mode()` has better optimizations). The 7900XTX tops out at 40it/s compared to the 4070Ti's peak 110it/s (but drops off rather fast to 75it/s). I imagine getting recurrent_forward/chunkwise_forward working would close the gap, as I had accidentally left recurrent_forward enabled and the 7900XTX had better throughput (110it/s in the prefil, 90it/s after). - in reality, I think spending $200 more for double the VRAM is a bit silly for something that was not guaranteed to work at all, but in hindsight I could've just bought one in the past, and then bought another one later for better distributed training. I just realized that training one model at a time might be a pill, as I won't be able to get proper evaluation/validation outputs as only one model is loaded and not both. I suppose I can do some tricks to load the other model but keep it in CPU until needed, and if it works out right, it can be the latest copy of the other model it's loading. I just hope my efforts and impulsiveness pays off. At worst, I'm 2xing my training throughput, although I wonder if I should have done this earlier. I hope I can get some peace of mind now, although the smell of a new GPU is plastered onto my face (better than having the smell of not-so-clean gas mysteriously clinging to me like yesterday). I suppose I'll follow through and just let things train and shut up for a few days now.
Author
Owner

mmm... I think it's foolish to continue running training on the existing weights.

  • even before with the rental 4090s/3090s, the metrics never improved, they're just wavering between the ranges no matter how long I continue training on a low LR for extended lengths (up to 1.0e-7). Especially when training each model on its own GPU, it's probably just mere placebo on the output quality, as there's still some inconsistencies.
  • I imagine I would need to scale up the model size if I want to try and close the gap. Even the quarter sized model seemed to have stagnated, although again I feel I'm rather impatient with it.
  • It could also just be a constraint of the RetNet; I'm not sure if the attention-based models would fare better. I imagine not, since it's definitely the best I've gotten so far.
  • Sunday I ran into two problems: heat and power, although, I'm not too sure why this is now a problem, as I didn't have any issues in April/May when using the two 6800XTs, and they were quite the power hog and heat dump.
    • very long story short, I'm hitting the limitations of training locally in terms of power draw and heat. I'll need to re-evaluate things, since it seems even though it's not as hot outside as it was Sunday (consistently 110F throughout the day), the 7900XTX is still causing problems.
  • trying to squeeze out some juice of the weights with training the SpeechX tasks is very foolish, it's such a problem child to try and train with how sloppy my solution is, as the speech editing tasks are consistently triggering OOMs even at very low batch sizes. Even when it hits that task, the training speed noticeably plummets.

I don't know. I feel fate is telling me there's no point in trying to continue training with these weights. I just don't understand, since the models are using the same dimensions and what-not as the paper at full size.

  • I refuse to believe it simply is a matter of the dataset not being large enough. I effectively doubled my dataset size with both the full LibriLight-6K and untrimmed (or rather, further trimmed) LibriTTS + its canonical transcriptions, and both the full sized models and the quarter sized models haven't budged a hair.
  • I simply do not think I need to increase the input prompt lengths for training from 3 seconds. That shouldn't matter at all in terms of getting consistent output, and the cloning is rather consistent for a given input prompt.
  • I'm in denial about it being a RetNet issue when it already was outperforming the attention-based transformer models. The only big difference makers would be the base implementation having sinusoidal position embeddings (which, the RetNet has xpos or an analogue to it, so there is positional embedding, but that shouldn't matter at all), as the attention/retention mechanisms shouldn't be so imperative to the last few hairs of percentages in terms of quality.

I suppose it's what I feared: not knowing what to do when it gets to this point. I suppose I'll chip at processing the duplicate 4K or so hours of the LibriLight dataset for I think 800 more speakers? And by the time it finishes in another week, I'll evaluate the logistics of committing to increasing the model size, probably only the AR to 20 layers.

mmm... I think it's foolish to continue running training on the existing weights. * even before with the rental 4090s/3090s, the metrics never improved, they're just wavering between the ranges no matter how long I continue training on a low LR for extended lengths (up to 1.0e-7). Especially when training each model on its own GPU, it's probably just mere placebo on the output quality, as there's still some inconsistencies. * I imagine I would need to scale up the model size if I want to try and close the gap. Even the quarter sized model seemed to have stagnated, although again I feel I'm rather impatient with it. * It could also just be a constraint of the RetNet; I'm not sure if the attention-based models would fare better. I imagine not, since it's definitely the best I've gotten so far. * Sunday I ran into two problems: heat and power, although, I'm not too sure why this is now a problem, as I didn't have any issues in April/May when using the two 6800XTs, and they were quite the power hog and heat dump. - very long story short, I'm hitting the limitations of training locally in terms of power draw and heat. I'll need to re-evaluate things, since it seems even though it's not as hot outside as it was Sunday (consistently 110F throughout the day), the 7900XTX is still causing problems. * trying to squeeze out some juice of the weights with training the SpeechX tasks is very foolish, it's such a problem child to try and train with how sloppy my solution is, as the speech editing tasks are consistently triggering OOMs even at very low batch sizes. Even when it hits that task, the training speed noticeably plummets. I don't know. I feel fate is telling me there's no point in trying to continue training with these weights. I just don't understand, since the models are using the same dimensions and what-not as the paper at full size. * I refuse to believe it simply is a matter of the dataset not being large enough. I effectively doubled my dataset size with both the full LibriLight-6K *and* untrimmed (or rather, further trimmed) LibriTTS + its canonical transcriptions, and both the full sized models *and* the quarter sized models haven't budged a hair. * I simply do not think I need to increase the input prompt lengths for training from 3 seconds. That shouldn't matter at all in terms of getting consistent output, and the cloning is rather consistent for a given input prompt. * I'm in denial about it being a RetNet issue when it already was outperforming the attention-based transformer models. The only big difference makers would be the base implementation having sinusoidal position embeddings (which, the RetNet has xpos or an analogue to it, so there is positional embedding, but that shouldn't matter at all), as the attention/retention mechanisms shouldn't be so imperative to the last few hairs of percentages in terms of quality. I suppose it's what I feared: not knowing what to do when it gets to this point. I suppose I'll chip at processing the `duplicate` 4K or so hours of the LibriLight dataset for I think 800 more speakers? And by the time it finishes in another week, I'll evaluate the logistics of committing to increasing the model size, probably only the AR to 20 layers.

I'm just posting to inform you that vast.ai is just a nugget for GPU cloud, often 3x cheaper than runpod for 3090/4090/A40.
The trick is to activate "Unverified Machines" (some machines may have a problem at first, it happens rarely).
However, the price of bandwidth also has a cost, determined by the vendor, so some are free, but it's easy to find very good deals.
Like here I see 4x 3090 at $0.496/hr or $0.678/hr if you want free bandwidth.

I'm just posting to inform you that vast.ai is just a nugget for GPU cloud, often 3x cheaper than runpod for 3090/4090/A40. The trick is to activate "Unverified Machines" (some machines may have a problem at first, it happens rarely). However, the price of bandwidth also has a cost, determined by the vendor, so some are free, but it's easy to find very good deals. Like here I see 4x 3090 at $0.496/hr or $0.678/hr if you want free bandwidth.
Author
Owner

I'm just posting to inform you that vast.ai is just a nugget for GPU cloud, often 3x cheaper than runpod for 3090/4090/A40.
The trick is to activate "Unverified Machines"

Ah I see, I didn't notice that every time I went to do price comparisons.

However, the price of bandwidth also has a cost

mmm, seems to be the bit of a monkey paw. I'm already sifting through them and a lot of them "recoup" the low price of having high upload/download per TiB.

I don't have to constantly sync and backup the training states, but the dataset's HDF5 alone is already at 100GiB, even with lz4 compression, but:

  • I can always recreate the dataset with the second half of the RVQ bins removed (as I'm only training up to 4 RVQ bins).
  • I recall gzipping an early copy of the dataset at ~59GiB brought it down to 19GiB, somehow. It's just the issue of having enough disk space allocated to download and unzip, as that was the trickier part on runpod.

I'll keep that in mind if push comes to shove and I need to go back to taking the rentpill.

>I'm just posting to inform you that vast.ai is just a nugget for GPU cloud, often 3x cheaper than runpod for 3090/4090/A40. > The trick is to activate "Unverified Machines" Ah I see, I didn't notice that every time I went to do price comparisons. > However, the price of bandwidth also has a cost mmm, seems to be the bit of a monkey paw. I'm already sifting through them and a lot of them "recoup" the low price of having high upload/download per TiB. I don't *have* to constantly sync and backup the training states, but the dataset's HDF5 alone is already at 100GiB, even with lz4 compression, but: * I can always recreate the dataset with the second half of the RVQ bins removed (as I'm only training up to 4 RVQ bins). * I recall gzipping an early copy of the dataset at ~59GiB brought it down to 19GiB, somehow. It's just the issue of having enough disk space allocated to download and unzip, as that was the trickier part on runpod. I'll keep that in mind if push comes to shove and I need to go back to taking the rentpill.
Author
Owner

I think I've got a good wrangling of any electrical-related issues over painful trial and error and isolation over the past few days. Turns out there's quite the rabbit hole that I just so happened to be ignoring. As a bit of a redpill:

  • in Windows, a 6800XT with dual monitors at mismatching refresh rates will idle at 60C (ΔT over ambient, ~43C), and this is apparently within spec. Locking both to 60Hz will have it idle at 45C, still terrible. A 2060 will idle at 40C, still not bad, but not favorable.
  • a Ryzen 7600X has an iGPU, and it idles at 35C.
  • I'm a bit hopeful the Vega 56 in my server (that's literally only used because a Threadripper 1920X needs a GPU to boot) idles fine, given the 7900XTX is actually idling headless at 23C (ΔT over ambient 3C).

Aside from that, I'm going to have the 4070Ti take a crack at transcribing LibriLight's duplicate dataset before doing tests with extending the model to more layers. It'd be a good time to try and take it easy again, but not completely devoid of any work that it'll eat at me for wasting a week.

The more important reasons I'm writing an update:

  • pytorch2.1.0+rocm5.6 seems to finally work now, mostly. Despite power limiting the 7900XTX, it's able to get a inference throughput of ~80it/s with dtype=bfloat16 and the local backend. It's not quite at parity with the 4070Ti, but it's still impressive. I'll need to do proper benchmarks.
    • mostly, since I'm getting an error when initializing DeepSpeed because of NCCL:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, internal error, NCCL version 2.16.5
ncclInternalError: Internal check failed.
Last error:
MSCCL Internal Scheduler: open algorithm directory /home/mrq/Programs/ai-voice-cloning/venv-3.11-rocm5.6/lib/python3.11/site-packages/torch/lib/msccl-algorithms failed
  • I was going to say having your dtype=bfloat16 (or float16) really impacts the cloning accuracy (the speech is fine, but it's not going to do a good job at cloning), but I can't seem to replicate it now, and it seemed only incidental when I swapped between the two. The input prompts are kept the same, as I'm using the native CLI, rather than the web UI to automagically create a random input prompt. I'll need to look into this more and verify, since it seems like it's placebo that I'm getting better quality with float32, and weird output with bfloat16 (despite training being practically quantization-aware, so there shouldn't be that much of an accuracy hit.
    • I was going to provide samples of this, but again, I can't replicate it, so it wouldn't be accurate to provide samples.

Back into my cave I go for the next few days, I hope I can get better results with increasing the layer size without it being that much of a detriment to iteration rates OR VRAM consumption.

I think I've got a good wrangling of any electrical-related issues over painful trial and error and isolation over the past few days. Turns out there's quite the rabbit hole that I just so happened to be ignoring. As a bit of a redpill: * in Windows, a 6800XT with dual monitors at mismatching refresh rates will idle at 60C (ΔT over ambient, ~43C), and this is apparently within spec. Locking both to 60Hz will have it idle at 45C, still terrible. A 2060 will idle at 40C, still not bad, but not favorable. * a Ryzen 7600X has an iGPU, and it idles at 35C. * I'm a bit hopeful the Vega 56 in my server (that's literally only used because a Threadripper 1920X needs a GPU to boot) idles fine, given the 7900XTX is actually idling headless at 23C (ΔT over ambient 3C). Aside from that, I'm going to have the 4070Ti take a crack at transcribing LibriLight's `duplicate` dataset before doing tests with extending the model to more layers. It'd be a good time to try and take it easy again, but not completely devoid of any work that it'll eat at me for wasting a week. The more important reasons I'm writing an update: * pytorch2.1.0+rocm5.6 seems to finally work now, mostly. Despite power limiting the 7900XTX, it's able to get a inference throughput of ~80it/s with dtype=bfloat16 and the local backend. It's not quite at parity with the 4070Ti, but it's still impressive. I'll need to do proper benchmarks. - mostly, since I'm getting an error when initializing DeepSpeed because of NCCL: ``` torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, internal error, NCCL version 2.16.5 ncclInternalError: Internal check failed. Last error: MSCCL Internal Scheduler: open algorithm directory /home/mrq/Programs/ai-voice-cloning/venv-3.11-rocm5.6/lib/python3.11/site-packages/torch/lib/msccl-algorithms failed ``` * I *was* going to say having your dtype=bfloat16 (or float16) really impacts the cloning accuracy (the speech is fine, but it's not going to do a good job at cloning), but I can't seem to replicate it now, and it seemed only incidental when I swapped between the two. The input prompts are kept the same, as I'm using the native CLI, rather than the web UI to automagically create a random input prompt. I'll need to look into this more and verify, since it seems like it's placebo that I'm getting better quality with float32, and weird output with bfloat16 (despite training being practically quantization-aware, so there shouldn't be that much of an accuracy hit. - I *was* going to provide samples of this, but again, I can't replicate it, so it wouldn't be accurate to provide samples. Back into my cave I go for the next few days, I hope I can get better results with increasing the layer size without it being that much of a detriment to iteration rates OR VRAM consumption.

It looks like the original vall-e model used ~140B parameters. That can't fit into a 4070 can it, so are you using a smaller model size? Does size: "full" correspond to the original paper model size?

It looks like the original vall-e model used ~140B parameters. That can't fit into a 4070 can it, so are you using a smaller model size? Does `size: "full"` correspond to the original paper model size?
Author
Owner

It looks like the original vall-e model used ~140B parameters.

Where'd you get that number from? The papers (VALL-E, VALL-E X, SpeechX) don't mention a parameter count anywhere.

NaturalSpeech2's paper on page 19 mentions its parameter counts between every component, and twice mentions "scaling [...] to 400M parameters".

Voicebox's paper on page 9 mentions 330M parameters for the audio model (the difference being 24 layers instead of 16 layers, and some funny connections between layers).

Does size: "full" correspond to the original paper model size?

From the paper, full corresponds to the original dimensions of the model:

Both the AR model and the NAR model have the same transformer architecture with 12
layers, 16 attention heads, an embedding dimension of 1024, a feed-forward layer dimension of
4096, and a dropout of 0.1.

which yields the AR and NAR having ~200M parameters each with the RetNet (I need to check the parameter count for the attention-based transformers). The feed-forward dimension is marked as 4 * embedding dim.

Now, the general consensus (or rather, what G**gle results mentioned) says that widening a model (increasing the dimensions) is better than increasing the layer count, although seeing voicebox increase the layer count seems to be the contrary to that nugget of wisdom. It just sucks, because this means I might have to conduct two experiments to pick a wider model over a deeper model.


Now, the other day, it crossed my mind if it would be better to either provide a transcription to the input acoustic prompt, and/or have it try and "continue" off from a given input acoustic prompt, to try and help steer it into more natural output. Coincidentally enough, while checking the paper if it mentioned a parameter count, there's this little detail:

In the AR model, we do not explicitly extract an audio clip as the prompt in training. The training process is pure casual language model training. In this way, any prefix sequence c<t,1 is treated as a prompt for the latter part of the sequence c≥t,1. During inference, given an enrolled recording, we should concatenate the phoneme sequence of the enrolled recording and the phoneme sequence for synthesis together. Meanwhile, the acoustic token sequence of the enrolled recording is used as the prefix in AR decoding, as formulated in equation 1. We will study the superiority of this setting in the experiment.

In other words, the paper mentions two "modes" to sequence with, and it seems the original implementation I forked from enhuiz/vall-e didn't really take this into account.

VALL-E: Our main interest is to generate given content for unseen speakers. The model is given a text sentence, a segment of enrolled speech, and its corresponding transcription. We prepend the transcription phoneme of the enrolled speech to the phoneme sequence of the given sentence as the phoneme prompt, and use the first layer acoustic token of the enrolled speech c˜:,1 as an acoustic prefix. With the phoneme prompt and the acoustic prefix, VALL-E generates the acoustic tokens for the given text cloning the voice of this speaker.

Base VALL-E will have its sequences as so:

<target text phonemes><sep><reference audio prompt><sep><response audio>
|-------------------- crafted input -------------------||--- output ---|

VALL-E-continual: In this setting, we use the whole transcription and the first 3 seconds of the utterance as the phoneme and acoustic prompts respectively, and ask the model to generate the continuations. The inference process is the same as setting VALL-E, except that the enrolled speech and the generated speech are semantically continuous.

On the other hand, this mode will sequence as so:

<transcript of response audio><sep><three seconds of the response audio><the remainder of the response audio>
|---------------------------- crafted input ---------------------------||---------- target output ----------|

where inferencing can be done as:

<transcription of reference><phonemes of desired output><sep><reference audio><response audio>
|------------------------------- crafted input ------------------------------||--- output ---|

The original enhuiz/vall-e implementation never took this into account, so I never thought it was that necessary. The lifeiteng/vall-e (and naturally, the Plachtaa/VALL-E-X implementation, as it's a fork) has a prefix mode that I admittedly never looked much into since the documentation is rather rough, so it might take this into account, but I can't say for sure.

Now, whether implementing a continual mode is all that imperative, who knows. The paper has comparison test scores between the modes and, while the word error rate is much lower for VALL-E continual, the speaker similar score was lower, so it seems I shouldn't really bother with this, as I moreso care about speaker similarity rather than reducing the WER.

It shouldn't be too much effort to add it in, and even inferencing in this "mode" requires no code change, just putting the transcription of your input acoustic prompt before your desired output phonemes. I just feel like training with this mode in mind isn't going to amount to much of anything.

> It looks like the original vall-e model used ~140B parameters. Where'd you get that number from? The papers (VALL-E, VALL-E X, SpeechX) don't mention a parameter count anywhere. [NaturalSpeech2](https://arxiv.org/pdf/2304.09116.pdf)'s paper on page 19 mentions its parameter counts between every component, and twice mentions "scaling [...] to 400M parameters". [Voicebox](https://arxiv.org/pdf/2306.15687.pdf)'s paper on page 9 mentions 330M parameters for the audio model (the difference being 24 layers instead of 16 layers, and some funny connections between layers). > Does size: "full" correspond to the original paper model size? From the paper, `full` corresponds to the original dimensions of the model: > Both the AR model and the NAR model have the same transformer architecture with 12 layers, 16 attention heads, an embedding dimension of 1024, a feed-forward layer dimension of 4096, and a dropout of 0.1. which yields the AR and NAR having ~200M parameters each with the RetNet (I need to check the parameter count for the attention-based transformers). The feed-forward dimension is marked as 4 * `embedding dim`. Now, the general consensus (or rather, what G\*\*gle results mentioned) says that widening a model (increasing the dimensions) is better than increasing the layer count, although seeing voicebox increase the layer count seems to be the contrary to that nugget of wisdom. It just sucks, because this means I might have to conduct two experiments to pick a wider model over a deeper model. --- Now, the other day, it crossed my mind if it would be better to either provide a transcription to the input acoustic prompt, and/or have it try and "continue" off from a given input acoustic prompt, to try and help steer it into more natural output. Coincidentally enough, while checking the paper if it mentioned a parameter count, there's this little detail: > In the AR model, we do not explicitly extract an audio clip as the prompt in training. The training process is pure casual language model training. In this way, any prefix sequence c<t,1 is treated as a prompt for the latter part of the sequence c≥t,1. During inference, given an enrolled recording, we should concatenate the phoneme sequence of the enrolled recording and the phoneme sequence for synthesis together. Meanwhile, the acoustic token sequence of the enrolled recording is used as the prefix in AR decoding, as formulated in equation 1. We will study the superiority of this setting in the experiment. In other words, the paper mentions two "modes" to sequence with, and it seems the original implementation I forked from [enhuiz/vall-e](https://github.com/enhuiz/vall-e) didn't really take this into account. > VALL-E: Our main interest is to generate given content for unseen speakers. The model is given a text sentence, a segment of enrolled speech, and its corresponding transcription. We prepend the transcription phoneme of the enrolled speech to the phoneme sequence of the given sentence as the phoneme prompt, and use the first layer acoustic token of the enrolled speech c˜:,1 as an acoustic prefix. With the phoneme prompt and the acoustic prefix, VALL-E generates the acoustic tokens for the given text cloning the voice of this speaker. Base VALL-E will have its sequences as so: ``` <target text phonemes><sep><reference audio prompt><sep><response audio> |-------------------- crafted input -------------------||--- output ---| ``` > VALL-E-continual: In this setting, we use the whole transcription and the first 3 seconds of the utterance as the phoneme and acoustic prompts respectively, and ask the model to generate the continuations. The inference process is the same as setting VALL-E, except that the enrolled speech and the generated speech are semantically continuous. On the other hand, this mode will sequence as so: ``` <transcript of response audio><sep><three seconds of the response audio><the remainder of the response audio> |---------------------------- crafted input ---------------------------||---------- target output ----------| ``` where inferencing can be done as: ``` <transcription of reference><phonemes of desired output><sep><reference audio><response audio> |------------------------------- crafted input ------------------------------||--- output ---| ``` The original [enhuiz/vall-e](https://github.com/enhuiz/vall-e) implementation never took this into account, so I never thought it was that necessary. The [lifeiteng/vall-e](https://github.com/lifeiteng/vall-e) (and naturally, the [Plachtaa/VALL-E-X](https://github.com/Plachtaa/VALL-E-X) implementation, as it's a fork) has a prefix mode that I admittedly never looked much into since the documentation is rather rough, so it *might* take this into account, but I can't say for sure. Now, whether implementing a `continual` mode is all that imperative, who knows. The paper has comparison test scores between the modes and, while the word error rate is much lower for VALL-E continual, the speaker similar score was lower, so it seems I shouldn't really bother with this, as I moreso care about speaker similarity rather than reducing the WER. It shouldn't be too much effort to add it in, and even inferencing in this "mode" requires no code change, just putting the transcription of your input acoustic prompt before your desired output phonemes. I just feel like training with this mode in mind isn't going to amount to much of anything.

It looks like the original vall-e model used ~140B parameters.

Where'd you get that number from? The papers (VALL-E, VALL-E X, SpeechX) don't mention a parameter count anywhere.

NaturalSpeech2's paper on page 19 mentions its parameter counts between every component, and twice mentions "scaling [...] to 400M parameters".

Voicebox's paper on page 9 mentions 330M parameters for the audio model (the difference being 24 layers instead of 16 layers, and some funny connections between layers).

Does size: "full" correspond to the original paper model size?

From the paper, full corresponds to the original dimensions of the model:

Both the AR model and the NAR model have the same transformer architecture with 12
layers, 16 attention heads, an embedding dimension of 1024, a feed-forward layer dimension of
4096, and a dropout of 0.1.

which yields the AR and NAR having ~200M parameters each with the RetNet (I need to check the parameter count for the attention-based transformers). The feed-forward dimension is marked as 4 * embedding dim.

Now, the general consensus (or rather, what G**gle results mentioned) says that widening a model (increasing the dimensions) is better than increasing the layer count, although seeing voicebox increase the layer count seems to be the contrary to that nugget of wisdom. It just sucks, because this means I might have to conduct two experiments to pick a wider model over a deeper model.

I had gotten that from here, but I think yeah it is just plain incorrect and probably closer to the number you gave.

On wider versus deeper, it's kind of random, but if you have a deep "enough" model, you'll want to increase width instead.

> > It looks like the original vall-e model used ~140B parameters. > > Where'd you get that number from? The papers (VALL-E, VALL-E X, SpeechX) don't mention a parameter count anywhere. > > [NaturalSpeech2](https://arxiv.org/pdf/2304.09116.pdf)'s paper on page 19 mentions its parameter counts between every component, and twice mentions "scaling [...] to 400M parameters". > > [Voicebox](https://arxiv.org/pdf/2306.15687.pdf)'s paper on page 9 mentions 330M parameters for the audio model (the difference being 24 layers instead of 16 layers, and some funny connections between layers). > > > Does size: "full" correspond to the original paper model size? > > From the paper, `full` corresponds to the original dimensions of the model: > > > Both the AR model and the NAR model have the same transformer architecture with 12 > layers, 16 attention heads, an embedding dimension of 1024, a feed-forward layer dimension of > 4096, and a dropout of 0.1. > > which yields the AR and NAR having ~200M parameters each with the RetNet (I need to check the parameter count for the attention-based transformers). The feed-forward dimension is marked as 4 * `embedding dim`. > > Now, the general consensus (or rather, what G\*\*gle results mentioned) says that widening a model (increasing the dimensions) is better than increasing the layer count, although seeing voicebox increase the layer count seems to be the contrary to that nugget of wisdom. It just sucks, because this means I might have to conduct two experiments to pick a wider model over a deeper model. I had gotten that from [here](https://www.rapidops.com/ai-tracker/vall-e/), but I think yeah it is just plain incorrect and probably closer to the number you gave. On wider versus deeper, it's kind of random, but if you have a deep "enough" model, you'll want to increase width instead.
Author
Owner

I had gotten that from here, but I think yeah it is just plain incorrect and probably closer to the number you gave.

Seems like someone ran someone else's article (and not the paper itself) into a weak LLM to summarize from how littered it is with hallucinations (the parameter count, the second limitation being an outright lie, the given use cases being very milkytoast [sic], the acronym for VALL-E it hallucinated isn't even a proper acronym, etc).

On wider versus deeper, it's kind of random, but if you have a deep "enough" model, you'll want to increase width instead.

I suppose I'll just wing it with increasing the layer count and hope for the best, since:

  • the embedding / feed-forward dimensions are the same for NaturalSpeech2 and VoiceBox (don't remember if this is also true with TorToiSe), while they all (TorToiSe too) have higher layer counts over VALL-E's twelve.
  • all the given presets (quarter/half/full) stick to twelve layers regardless, while heads and dims are scaled instead.
> I had gotten that from here, but I think yeah it is just plain incorrect and probably closer to the number you gave. Seems like someone ran someone else's article (and not the paper itself) into a weak LLM to summarize from how littered it is with hallucinations (the parameter count, the second limitation being an outright lie, the given use cases being very milkytoast [*sic*], the acronym for VALL-E it hallucinated isn't even a proper acronym, etc). > On wider versus deeper, it's kind of random, but if you have a deep "enough" model, you'll want to increase width instead. I suppose I'll just wing it with increasing the layer count and hope for the best, since: * the embedding / feed-forward dimensions are the same for NaturalSpeech2 and VoiceBox (don't remember if this is also true with TorToiSe), while they all (TorToiSe too) have higher layer counts over VALL-E's twelve. * all the given presets (quarter/half/full) stick to twelve layers regardless, while heads and dims are scaled instead.

Playing around with encodec encoding + vocos decoding. As good as vocos is, it still gives some minor audio artifacts for higher pitch voices. This puts an upperbound on the quality of the model, no? Maybe that can be fixed by some minor postprocessing?

Also, reading through the paper more, the split between AR and NAR seems inelegant. They want it for inference speed. Why not just use an AR with all the codebooks, keep the inference speed slow, but then distill the model later to increase inference speed?

Playing around with encodec encoding + vocos decoding. As good as vocos is, it still gives some minor audio artifacts for higher pitch voices. This puts an upperbound on the quality of the model, no? Maybe that can be fixed by some minor postprocessing? Also, reading through the paper more, the split between AR and NAR seems inelegant. They want it for inference speed. Why not just use an AR with all the codebooks, keep the inference speed slow, but then distill the model later to increase inference speed?

Another question: how are you plotting your loss curves etc? Was going to write some code for it, but looks like you were producing them somehow. Maybe I missed them in the repo.

Another question: how are you plotting your loss curves etc? Was going to write some code for it, but looks like you were producing them somehow. Maybe I missed them in the repo.
Author
Owner

Playing around with encodec encoding + vocos decoding. As good as vocos is, it still gives some minor audio artifacts for higher pitch voices. This puts an upperbound on the quality of the model, no? Maybe that can be fixed by some minor postprocessing?

I imagine that's more of a bandwidth limitation that can only really be """solved""" with increasing how many RVQ bins the NAR outputs. Although, I only did a brief dive into how much additional quant-levels matter for more complex voices, rather than for voices outside a normal range.

Also, reading through the paper more, the split between AR and NAR seems inelegant. They want it for inference speed. Why not just use an AR with all the codebooks, keep the inference speed slow, but then distill the model later to increase inference speed?

A couple of things (apologies if it seems a little disheveled, I'll have my thoughts primed and the moment I try and express them it just fades and needs to be milked out).

  • MusicGen's paper mentions a couple of different patterns for interleaved sampling, where each one has its pros and cons. If I implement it right in this hypothetical new AR, I shouldn't have to worry about retraining from scratch, just brief "retunes", but this is pure speculation given how the embeddings are implemented and factors in which level each code belongs to (assuming it works really well).
  • A lot of added complexity. I remember diving into MusicGen's code to see how "easy" it would be to pivot to a different interleave pattern and my brain melted from how much care is needed to interelave/de-interleave. desu, it's a pain to get right, as I remember in my rasterization engine work in OpenGL/Vulkan, being able to seamlessly pivot between interleaved/de-interleaved vertex attributes is a nightmare to handle elegantly, so it's expected in this scope.
  • I think, at least given the current "dimensionality" of the models, I don't think model distillation is fruitful.
    • Sure, I imagine this is the driving principle of SpeechX able to bolster the core zero-shot TTS task, as it generalizes the model, but even with that approach it's just not worth the huge overhead in training.
    • The NAR is sort of already evident of the current model dimensionality not being good enough. Before I upped the RVQ bins it tends to from one to three, it was having an accuracy of ~75%, but dropped to ~63% and is showing no room for improvement (alongside the AR being effectively capped at ~82%). I don't expect it to fare well if I were to make the AR tend to more than 1 RVQ bin.
    • As a counter to the above, though, MusicGen also mentions it's 24 layers (same dimensions/heads), so I suppose the extra 12 layers should house enough parameters to help the model generalize to more than one RVQ layer, but at the same time, this can just be applied to the normal model.
    • As another counter-point, I haven't actually tried distilling the NAR back to only tend to one RVQ bin and seeing it picks back up, so who knows.
  • I'm probably extremely biased, but I've grown to like how it's pretty much a "mixture-of-experts" in terms of having multiple models rather than one conglomerate of weights.
    • being able to train both models together or one at a time offers flexibility when training "at home", especially now for me where I can have one model training at its own pace on my 4070Ti, and have the other trained at its own pace on my 7900XTX. One model might benefit from a certain dimensionality, while the other can be kept smaller to keep things lighter. It just leaves room for dialing the knobs better with the modularity rather than a monolithic approach.

I'll still keep it in mind as an avenue for improvement. MusicGen's paper shows interleaving works, regardless if music is more "complex" to try and model versus it being able to leave a little more room for error. I just feel it should be something to explore later down the road, like when I can get RetNet's dedicated recurrent/chunkwise sampling sounding right (more on this later) to help reap the benefits of a pure AR.

Another question: how are you plotting your loss curves etc? Was going to write some code for it, but looks like you were producing them somehow. Maybe I missed them in the repo.

This repo (mrq/ai-voice-cloning) repurposes the training tab for TorToiSe/DLAS's metric graphs, but I have a myriad of gripes about it (kludge mostly) with that approach, so I'll shy you away from using it.

Now, I was doing some housekeeping with the mrq/vall-e repo itself and stumbled upon this under ./scripts/plot.py, but it needs a bit of rework.

It's a bit rustic, and I'll see about cramming it into the main package and have it derive most of everything from a provided YAML, but to invoke it now, it's:

python3 ./scripts/plot.py --log-dir ./training/valle/logs/1693675364 --out-dir ./data/ --xs=ar.engine_step--ys=ar.loss
> Playing around with encodec encoding + vocos decoding. As good as vocos is, it still gives some minor audio artifacts for higher pitch voices. This puts an upperbound on the quality of the model, no? Maybe that can be fixed by some minor postprocessing? I imagine that's more of a bandwidth limitation that can only really be """solved""" with increasing how many RVQ bins the NAR outputs. Although, I only did a brief dive into how much additional quant-levels matter for more complex voices, rather than for voices outside a normal range. > Also, reading through the paper more, the split between AR and NAR seems inelegant. They want it for inference speed. Why not just use an AR with all the codebooks, keep the inference speed slow, but then distill the model later to increase inference speed? A couple of things (apologies if it seems a little disheveled, I'll have my thoughts primed and the moment I try and express them it just fades and needs to be milked out). * [MusicGen's paper](https://arxiv.org/pdf/2306.05284.pdf) mentions a couple of different patterns for interleaved sampling, where each one has its pros and cons. If I implement it right in this hypothetical new AR, I *shouldn't* have to worry about retraining from scratch, just brief "retunes", but this is pure speculation given how the embeddings are implemented and factors in which level each code belongs to (assuming it works really well). * A lot of added complexity. I remember diving into MusicGen's code to see how "easy" it would be to pivot to a different interleave pattern and my brain melted from how much care is needed to interelave/de-interleave. desu, it's a pain to get right, as I remember in my rasterization engine work in OpenGL/Vulkan, being able to seamlessly pivot between interleaved/de-interleaved vertex attributes is a nightmare to handle elegantly, so it's expected in this scope. * I think, at least given the current "dimensionality" of the models, I don't think model distillation is fruitful. - Sure, I imagine this is the driving principle of SpeechX able to bolster the core zero-shot TTS task, as it generalizes the model, but even with that approach it's just not worth the ***huge*** overhead in training. - The NAR is sort of already evident of the current model dimensionality not being good enough. Before I upped the RVQ bins it tends to from one to three, it was having an accuracy of ~75%, but dropped to ~63% and is showing no room for improvement (alongside the AR being effectively capped at ~82%). I don't expect it to fare well if I were to make the AR tend to more than 1 RVQ bin. - As a counter to the above, though, MusicGen also mentions it's 24 layers (same dimensions/heads), so I suppose the extra 12 layers should house enough parameters to help the model generalize to more than one RVQ layer, but at the same time, this can just be applied to the normal model. - As another counter-point, I haven't *actually* tried distilling the NAR back to only tend to one RVQ bin and seeing it picks back up, so who knows. * I'm probably extremely biased, but I've grown to like how it's pretty much a "mixture-of-experts" in terms of having multiple models rather than one conglomerate of weights. - being able to train both models together or one at a time offers flexibility when training "at home", especially now for me where I can have one model training at its own pace on my 4070Ti, and have the other trained at its own pace on my 7900XTX. One model might benefit from a certain dimensionality, while the other can be kept smaller to keep things lighter. It just leaves room for dialing the knobs better with the modularity rather than a monolithic approach. I'll still keep it in mind as an avenue for improvement. MusicGen's paper shows interleaving works, regardless if music is more "complex" to try and model versus it being able to leave a little more room for error. I just feel it should be something to explore later down the road, like when I can get RetNet's dedicated recurrent/chunkwise sampling sounding right (more on this later) to help reap the benefits of a pure AR. > Another question: how are you plotting your loss curves etc? Was going to write some code for it, but looks like you were producing them somehow. Maybe I missed them in the repo. This repo ([mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning)) repurposes the training tab for TorToiSe/DLAS's metric graphs, but I have a myriad of gripes about it (kludge mostly) with that approach, so I'll shy you away from using it. Now, I was doing some housekeeping with the [mrq/vall-e](https://git.ecker.tech/mrq/vall-e) repo itself and stumbled upon this under [`./scripts/plot.py`](https://git.ecker.tech/mrq/vall-e/src/branch/master/scripts/plot.py)~~, but it needs a bit of rework~~. It's a bit rustic, and I'll see about cramming it into the main package and have it derive most of everything from a provided YAML, but to invoke it now, it's: ``` python3 ./scripts/plot.py --log-dir ./training/valle/logs/1693675364 --out-dir ./data/ --xs=ar.engine_step--ys=ar.loss ```
Author
Owner

As an aside update:

  • I've resumed training on my 7900XTX with the AR expanded to 24 layers instead of 12. No hitches so far.
    • I should have gone with 18 layers instead to babystep, but might as well go big.
    • Simply exporting the weights and loading the state dict un-strictly worked; the remaining layers are randomly initialized.
    • The loss didn't seem to change all that much, which is a bit of a concern for a number of reasons.
    • I think this might be a bad idea, as now updating the weights for the remaining layers are going to take a while, in theory. The optimizer should be wise enough to quickly tune the second half of the parameters, I hope.
  • I've crammed in "VALL-E continuous" as a task (tts-c) just to help try and mix up the data. It doesn't seem to offer much of a difference in terms of loss/accuracy and output, though. I do need to extend the inferencing to allow for this guided mode as well, but it's low priority.
  • LibriLight's duplicate 4.5K hours of data is still transcribing off my 4070Ti. Extenuating circumstances had me wary of having any GPU workload for the past few days, but things seem fine now. I've also done some housekeeping in my current dataset of culling a lot of useless voices I've had in the beginning, and removing speakers from LibriTTS-R that are in my LibriLight-6K, as I felt it would be better to try and make an "epoch" of visiting every speaker smaller.
  • I've taken another crack at trying to get RetNet's recurrent forward working again, as it doesn't output pure garbage. However, it's has the acoustics right, but it does not sound right at all. I suppose it's a similar issue to how I can't utilize chunkwise recurrent sampling as it the accuracy is harmed drastically. Oh well.
As an aside update: * I've resumed training on my 7900XTX with the AR expanded to 24 layers instead of 12. No hitches so far. - I should have gone with 18 layers instead to babystep, but might as well go big. - Simply exporting the weights and loading the state dict un-strictly worked; the remaining layers are randomly initialized. - The loss didn't seem to change all that much, which is a bit of a concern for a number of reasons. - I *think* this might be a bad idea, as now updating the weights for the remaining layers are going to take a while, in theory. The optimizer should be wise enough to quickly tune the second half of the parameters, I hope. * I've crammed in "VALL-E continuous" as a task (`tts-c`) just to help try and mix up the data. It doesn't seem to offer much of a difference in terms of loss/accuracy and output, though. I do need to extend the inferencing to allow for this guided mode as well, but it's low priority. * LibriLight's `duplicate` 4.5K hours of data is still transcribing off my 4070Ti. Extenuating circumstances had me wary of having any GPU workload for the past few days, but things seem fine now. I've also done some housekeeping in my current dataset of culling a lot of useless voices I've had in the beginning, and removing speakers from LibriTTS-R that are in my LibriLight-6K, as I felt it would be better to try and make an "epoch" of visiting every speaker smaller. * I've taken another crack at trying to get RetNet's recurrent forward working again, as it doesn't output pure garbage. However, it's has the acoustics right, but it does not sound right at all. I suppose it's a similar issue to how I can't utilize chunkwise recurrent sampling as it the accuracy is harmed drastically. Oh well.
Author
Owner

I suppose I have to train the AR from scratch if I want to increase the layer count. There's been no perceptible change in the model after training it about a day.

  • I suppose that my naive approach of "just glue layers to the end" was a naive and bad idea.
  • I imagine, instead, it would be better to interleave the new layers in between the old layers to fill the gap. It should be easy, but I don't think it's the right approach.

On the other hand, training the new AR from scratch seems to be very fruitful, at least metrics wise:

Training Metrics: {"ar.loss": 3.171875, "ar.lr": 0.0001, "ar.grad_norm": 0.46822504212200383, "ar.elapsed_time": 5.655715465545654, "ar.engine_step": 338, "ar.samples_processed": 43264, "ar.tokens_processed": 18642961, "ar.loss.nll": 3.171875, "ar.stats.acc": 0.6666419506072998, "ar.stats.precision": 0.0666641965508461, "elapsed_time": 5.655715465545654, "it": 338, "epoch": 0.007669706255781442}.
  • At a glance, it seems to already be at a good point after a couple of hours that, if I remember right, took maybe a day to reach with the previous on-spec model size.
    • On the other hand, those metrics were kind of flawed, as the total loss was incorrectly summing up the accuracy and precision metrics. If I were to go back and compare, I would need to exclusively look at the (misnamed) ar.loss.nll metric, and even then training might have been a bit flawed as I don't remember how long it took during that training run for me to correct that.
  • I'm also not sure if this could be a testament to me cleaning up the dataloader to be rather explicit (and also training with the tts-c task ahead of time), since the previous sampling method wasn't very explicit in what it was doing.
  • However, the output isn't very usable, even if it's at a loss/accuracy that previous ARs were able to be usable at. I'm a bit curious as to how this is so, but I suppose this is more of a testament to it being able to better "clone" the acoustics for a given prompt/response that well, rather than it sounding right.

I'll let the AR train from scratch at this point and see how it fares before making more claims, such as:

  • whether or not if I should bother training the NAR from scratch instead. I don't think it's necessary at the matter, since despite the NAR capping out at about ~62% accuracy, a strong AR will help guide it.
  • if I should bother straying from VALL-E's spec more and adopt MusicGen's AR model spec of dim=1536, heads=24 layers=48 for, what the paper mentions, a 1.5B model, and an interleaved AR. I suppose the answer to model distillation is simply just make the model bigger rather than trying to cram a bunch of tasks into the model.
    • I suppose if I have nothing to do and have no drive to play vidya for this weekend, I'll see about incorporating MusicGen's interleave/de-interleave routines.
    • I'm just worried if I would be even able to train this, as not only would a 1.5B model (in attention-based transformer land) take up a lot of VRAM for the model itself, even with bfloat16 and quantization/compression, the gradients needed for the model would be astronomical, even for 24GiB of VRAM. It's not something I can train locally I'm sure unless I pivot to SGD instead of AdamW and cross my fingers it works out.
I suppose I have to train the AR from scratch if I want to increase the layer count. There's been no perceptible change in the model after training it about a day. * I suppose that my naive approach of "just glue layers to the end" was a naive and bad idea. * I imagine, instead, it would be better to interleave the new layers in between the old layers to fill the gap. It should be easy, but I don't think it's the right approach. On the other hand, training the new AR from scratch seems to be very fruitful, at least metrics wise: ``` Training Metrics: {"ar.loss": 3.171875, "ar.lr": 0.0001, "ar.grad_norm": 0.46822504212200383, "ar.elapsed_time": 5.655715465545654, "ar.engine_step": 338, "ar.samples_processed": 43264, "ar.tokens_processed": 18642961, "ar.loss.nll": 3.171875, "ar.stats.acc": 0.6666419506072998, "ar.stats.precision": 0.0666641965508461, "elapsed_time": 5.655715465545654, "it": 338, "epoch": 0.007669706255781442}. ``` * At a glance, it seems to already be at a good point after a couple of hours that, if I remember right, took maybe a day to reach with the previous on-spec model size. - On the other hand, those metrics were kind of flawed, as the total loss was incorrectly summing up the accuracy and precision metrics. If I were to go back and compare, I would need to exclusively look at the (misnamed) `ar.loss.nll` metric, and even then training might have been a bit flawed as I don't remember how long it took during that training run for me to correct that. * I'm also not sure if this could be a testament to me cleaning up the dataloader to be rather explicit (and also training with the `tts-c` task ahead of time), since the previous sampling method wasn't very explicit in what it was doing. * However, the output isn't very usable, even if it's at a loss/accuracy that previous ARs were able to be usable at. I'm a bit curious as to how this is so, but I suppose this is more of a testament to it being able to better "clone" the acoustics for a given prompt/response that well, rather than it sounding right. I'll let the AR train from scratch at this point and see how it fares before making more claims, such as: * whether or not if I should bother training the NAR from scratch instead. I don't think it's necessary at the matter, since despite the NAR capping out at about ~62% accuracy, a strong AR will help guide it. * if I should bother straying from VALL-E's spec more and adopt MusicGen's AR model spec of dim=1536, heads=24 layers=48 for, what the paper mentions, a 1.5B model, and an interleaved AR. I suppose the answer to model distillation is simply just make the model bigger rather than trying to cram a bunch of tasks into the model. - I suppose if I have nothing to do and have no drive to play vidya for this weekend, I'll see about incorporating MusicGen's interleave/de-interleave routines. - I'm just worried if I would be even able to train this, as not only would a 1.5B model (in attention-based transformer land) take up a lot of VRAM for the model itself, even with bfloat16 and quantization/compression, the gradients needed for the model would be astronomical, even for 24GiB of VRAM. It's not something I can train locally I'm sure unless I pivot to SGD instead of AdamW and cross my fingers it works out.
Author
Owner

I figured I wasn't going to do anything leisurely today besides sleep, so I elected to work on an interleaved AR. Initially, I was going to say it was very much not fruitful, but after doing the very simplest approach (just use codes.flatten() and codes.unflatten(...) to interleave/deinterleave), I managed to get the test trainer to output this: https://files.catbox.moe/5ynjcv.wav

I am not happy with the crackle in the beginning. I am not happy with the inference speed effectively being reduced 4-fold. I do not expect RetNet's chunkwise sampling to save me as I'm pretty sure the next RVQ-bins depend on the previous one like the NAR does.

I did do another test trainer run and got: https://files.catbox.moe/t3uyvi.wav. There's no nasty crackle, so I suppose the model architecture is fine enough to further test on training real data against. But, I don't know. I'm kind of happy with the new-deepened AR already being at 72% accuracy despite only being 3% through an epoch's worth of data.

I figured I wasn't going to do anything leisurely today besides sleep, so I elected to work on an interleaved AR. Initially, I was going to say it was very much not fruitful, but after doing the very simplest approach (just use `codes.flatten()` and `codes.unflatten(...)` to interleave/deinterleave), I managed to get the test trainer to output this: https://files.catbox.moe/5ynjcv.wav I am not happy with the crackle in the beginning. I am not happy with the inference speed effectively being reduced 4-fold. I do not expect RetNet's chunkwise sampling to save me as I'm pretty sure the next RVQ-bins depend on the previous one like the NAR does. I did do another test trainer run and got: https://files.catbox.moe/t3uyvi.wav. There's no nasty crackle, so I suppose the model architecture is *fine* enough to further test on training real data against. But, I don't know. I'm kind of happy with the new-deepened AR already being at 72% accuracy despite only being 3% through an epoch's worth of data.
Author
Owner

mmm... I guess I'm due for a bit of an update.

  • 15% through my dataset's "epoch" (out of ~5640894 samples, I think I should have a better metric that should calculate how many iterations it will take to ensure every sample per speaker is visited once, which I guess is rather easy to do).
  • peaked at 79% accuracy, with a local minimum of 75% accuracy.
  • loss averaging around 2.55.

And speech is emerging once more.

  • Slowly relative to real time. Compared to epoch time, this is rather fast.
    • If I were wiser I would have thrown some 3090s to try and speed things along, but I shouldn't really remark about borrowing hardware or paying for rentals at the moment.
  • I'm very hopeful about 24 layers despite the inference throughput being rather yucky.
    • I imagine the hopium-fueled future will be able to do things like pruning or sparse-whatevers to shrink the model size down, but I doubt it; all the neat LLaMA innovations / future innovations won't ever make it to this sphere.
  • I'm worried, though, that this is going to cap out eventually before the "epoch" is complete. I forget when the previous runs hit diminishing returns, but maybe it'll happen much later for 24 layers. However, I am skeptical of the accuracy metric, but at least I should have cleaner metrics all around.

As for experimenting with an interleaved AR: I'm not too sure if I should bother with it.

  • I worry it's not going to be worth investing in.
    • My existing pattern is rather naive, and even in the test trainer it's rather unstable.
    • Using padded patterns that MusicGen calls for requires a shit ton more VRAM for each sample, and grows as there's more RVQ bins.
    • I might get better results if I actually make use of the additional levels in the resp_emb, though. As it currently is implemented, n_resp_levels is fixed at being 1. Some clever positional embedding could work too alongside resp_emb.
  • I don't think a grand interleaved AR is going to provide strong results in comparison to the current AR + NAR orchestra.
    • I just don't think it's the right way to go about it. I feel the compute time put into a grand interleaved AR will be astronomical in comparison to just training an AR and NAR to similar results.
    • The NAR is already evidence that a model that attends to multiple RVQ bin levels isn't going to be strong in terms of accuracy.
    • I suppose I can test this by having an AR+NAR, where I do loss calculation for RVQ bin level 1, and the remaining levels do loss calculations for the NAR like normal. I suppose I can test this with my 4070Ti while it's idle.
  • I don't think model distillation with a grand interleaved AR would be fruitful.
    • It's pretty much predicated on being able to re-tune an AR into being a NAR. I imagine it would work for the interleaved AR, as it's aware of more than one RVQ-bin, but turning any of my previous ARs might not work out well, but I can always experiment with this on my free time, as the 4070Ti is idle right now.
    • Again, I don't think it's worth the compute.

I don't know, once more. My 4070Ti is sitting idle, as I'm not too sure it has the VRAM for me to train a NAR at double-depth, but I suppose I can pivot to using SGD and a high LR and hoping for the best, especially when a NAR has faster throughput rates compared to the AR.


However, I think, instead of an interleaved AR, I could experiment with training a model that handles both the tasks of the AR in an autoregressive manner, and the NAR in a non-autoregressive manner, as:

  • they both are trained in parallel anyhow.
  • the only fundamental difference is structuring the tensors for the response, and modifying the target sequence to train things right.

When I get a moment I suppose I can run the test mini-trainer to see how it fares, and if it works, then I suppose I can throw it on the 4070Ti to train at 24 layers.

mmm... I guess I'm due for a bit of an update. * 15% through my dataset's "epoch" (out of ~5640894 samples, I think I should have a better metric that should calculate how many iterations it will take to ensure every sample per speaker is visited once, which I guess is rather easy to do). * peaked at 79% accuracy, with a local minimum of 75% accuracy. * loss averaging around 2.55. And speech is emerging once more. * Slowly relative to real time. Compared to epoch time, this is rather fast. - If I were wiser I would have thrown some 3090s to try and speed things along, but I shouldn't really remark about borrowing hardware or paying for rentals at the moment. * I'm very hopeful about 24 layers despite the inference throughput being rather yucky. - I imagine the hopium-fueled future will be able to do things like pruning or sparse-whatevers to shrink the model size down, but I doubt it; all the neat LLaMA innovations / future innovations won't ever make it to this sphere. * I'm worried, though, that this is going to cap out eventually before the "epoch" is complete. I forget when the previous runs hit diminishing returns, but maybe it'll happen much later for 24 layers. However, I am skeptical of the accuracy metric, but at least I should have cleaner metrics all around. As for experimenting with an interleaved AR: I'm not too sure if I should bother with it. * I worry it's not going to be worth investing in. - My existing pattern is rather naive, and even in the test trainer it's rather unstable. - Using padded patterns that MusicGen calls for requires a shit ton more VRAM for each sample, and grows as there's more RVQ bins. - I might get better results if I actually make use of the additional levels in the `resp_emb`, though. As it currently is implemented, `n_resp_levels` is fixed at being 1. Some clever positional embedding could work too alongside `resp_emb`. * I don't think a grand interleaved AR is going to provide strong results in comparison to the current AR + NAR orchestra. - I just don't think it's the right way to go about it. I feel the compute time put into a grand interleaved AR will be astronomical in comparison to just training an AR and NAR to similar results. - The NAR is already evidence that a model that attends to multiple RVQ bin levels isn't going to be strong in terms of accuracy. - I ***suppose*** I can test this by having an AR+NAR, where I do loss calculation for RVQ bin level 1, and the remaining levels do loss calculations for the NAR like normal. I suppose I can test this with my 4070Ti while it's idle. * I don't think model distillation with a grand interleaved AR would be fruitful. - It's pretty much predicated on being able to re-tune an AR into being a NAR. I imagine it would work for the interleaved AR, as it's aware of more than one RVQ-bin, but turning any of my previous ARs *might* not work out well, but I can always experiment with this on my free time, as the 4070Ti is idle right now. - Again, I don't think it's worth the compute. I don't know, once more. My 4070Ti is sitting idle, as I'm not too sure it has the VRAM for me to train a NAR at double-depth, but I suppose I can pivot to using SGD and a high LR and hoping for the best, especially when a NAR has faster throughput rates compared to the AR. --- However, I think, instead of an interleaved AR, I could experiment with training a model that handles both the tasks of the AR in an autoregressive manner, and the NAR in a non-autoregressive manner, as: * they both are trained in parallel anyhow. * the only fundamental difference is structuring the tensors for the response, and modifying the target sequence to train things right. When I get a moment I suppose I can run the test mini-trainer to see how it fares, and if it works, then I suppose I can throw it on the 4070Ti to train at 24 layers.
Author
Owner

I feel rather silly.

I imagine the lifeiteng/vall-e implementation had the right idea with having an (almost) single model that handles both AR and NAR tasks. It's doable, and I like it as an option better than an interleaved AR approach. Some things to keep in mind:

  • A monolithic model requires the resp_emb to be split, where a dedicated one exists for AR tasks, and a dedicated one exists for NAR tasks. Without it, the model just won't perform properly. I'm not too sure why, as the provided MultiEmbedding should be able to handle this. I do wonder if the NAR tasks will perform better if there was a dedicated resp_emb per RVQ-bin level.
  • Training the NAR already might require more iterations, as each training step will randomly pick a quant_level to train against for each sample in a batch. Training a dual model might require double the training time anyways, as I have to randomly decide between training for the AR or training for the NAR. I don't think I can have the forward pass procedurally decide which resp_emb to select (or at the very least, which weight for the embedding) and have the target sequence to compute the loss against procedurally formatted for a given quant_level. Besides, it's probably for the better to have the first RVQ-bin level considered more than a single remaining RVQ-bin level, as the first level is rather important.
    • in reality, I think the existing transformer (attention-based or retention based) can double dip on what it "knows" from the other half of the model to further bolster it, so there isn't really a way to quantify the "tax" for this.

I do like the idea, as:

  • one model means I won't have to have double the weights in memory, and won't have double the parameters to attend to with an optimizer.
  • this method can open up the way to model distillation, as a grand model can be distilled back down into an AR and a NAR separately.
    • I do wonder if I can do the reverse, but I doubt it, as I would need to pull the other half of the resp_emb out of thin air.
  • I do not have the performance penalties incurred from an interleaved AR.

However, I'm not too sure how it would perform, as I'm basically foregoing a "mixture-of-experts" approach with a monolithic approach. I'll need to evaluate which card would get which model to train, as I think I should pivot the double-deepend AR to the 4070Ti, and train the monolithic AR+NAR on the 7900XTX at an increased model dimensionality to make the parameter count higher (1536 dim / 24 heads / 48 layers) or something similar.

  • I would go back to rentoiding, but I cannot conceive of a way to easily transfer my dataset again.

Also it seems that the provided attention-based transformer requires less parameters than a RetNet. I'm not really sure why it freed up VRAM when I pivoted to a RetNet.

I feel rather silly. I imagine the [lifeiteng/vall-e](https://github.com/lifeiteng/vall-e/) implementation had the right idea with having an (almost) single model that handles both AR and NAR tasks. It's doable, and I like it as an option better than an interleaved AR approach. Some things to keep in mind: * A monolithic model *requires* the `resp_emb` to be split, where a dedicated one exists for AR tasks, and a dedicated one exists for NAR tasks. Without it, the model just won't perform properly. I'm not too sure why, as the provided `MultiEmbedding` *should* be able to handle this. I do wonder if the NAR tasks will perform better if there was a dedicated `resp_emb` per RVQ-bin level. * Training the NAR already *might* require more iterations, as each training step will randomly pick a `quant_level` to train against for each sample in a batch. Training a dual model might require double the training time anyways, as I have to randomly decide between training for the AR or training for the NAR. I don't think I can have the forward pass procedurally decide which `resp_emb` to select (or at the very least, which weight for the embedding) and have the target sequence to compute the loss against procedurally formatted for a given `quant_level`. Besides, it's probably for the better to have the first RVQ-bin level considered more than a single remaining RVQ-bin level, as the first level is rather important. - in reality, I think the existing transformer (attention-based or retention based) can double dip on what it "knows" from the other half of the model to further bolster it, so there isn't really a way to quantify the "tax" for this. I do like the idea, as: * one model means I won't have to have double the weights in memory, and won't have double the parameters to attend to with an optimizer. * this method *can* open up the way to model distillation, as a grand model can be distilled back down into an AR and a NAR separately. - I do wonder if I can do the reverse, but I doubt it, as I would need to pull the other half of the `resp_emb` out of thin air. * I do not have the performance penalties incurred from an interleaved AR. However, I'm not too sure how it would perform, as I'm basically foregoing a "mixture-of-experts" approach with a monolithic approach. I'll need to evaluate which card would get which model to train, as I *think* I should pivot the double-deepend AR to the 4070Ti, and train the monolithic AR+NAR on the 7900XTX at an increased model dimensionality to make the parameter count higher (1536 dim / 24 heads / 48 layers) or something similar. * I *would* go back to rentoiding, but I cannot conceive of a way to easily transfer my dataset again. --- Also it seems that the provided attention-based transformer requires less parameters than a RetNet. I'm not really sure why it freed up VRAM when I pivoted to a RetNet.
Author
Owner

I think I'm pretty pilled on using a monolithic AR+NAR.

I was training a half-sized monolithic model on the side (also making use of prodigy, major props), and even at 25% of the epoch processed, the AR-side of the model was already reaching ~73% accuracy, while the NAR side was looking a bit rough at ~44% accuracy, but that's expected. I don't have any samples since I forgot to meddle with the evaluation / validation routines to be aware of a monolithic AR+NAR (it seemed it was falling back to treating it like an AR), so I'll need to go back and yank out samples.

Now, I don't know if this is a testament to prodigy performing really well with getting it up to snuff, or a monolithic approach to the transformer (/ RetNet to pedantists) is what helps bolster the model.

I suppose the next few days I'll see about converting existing ARs into a monolithic approach.

  • I'll toy around with re-baking fresh prom_emb/resp_embs to the ~new~ way with my good full-sized AR weights.
    • It already seems rather fruitful, the AR side already is at 80% accuracy, but the NAR side is at 20% accuracy.
    • It seems the best way is to freeze every parameter except the new prom_emb/resp_embs, train for a bit until the AR seems to be back up to par, then train again with the main transformer (/ RetNet) weights unfrozen, since it still needs to be re-tuned for NAR tasks.
  • if this proves fruitful, then I can easily apply this with the double-deepened AR I'm still training on my 7900XTX.
    • I expect it to be fruitful, at the very least in a way that saves me time from having to re-train it from scratch.
    • I might even be able to take the embeddings from the full-sized AR after re-baking them and stitch them to the double-deepened converted AR to shortcut things.

I'm rather happy that I'm at a point to where I can start stepping out of the previously established comfort zone and start toying with things more and more. I just hope that I can get around to figuring out how to implement the more fancier sampling techniques like repetition penalties and what-not, since I don't have the luxuries from using huggingface wrappers like TorToiSe does for these things.


Also the double-deepened AR is rather fruitful too at 28% through the "epoch": samples. I only picked at some of the validation and I'm rather pleased. The only issue is that I wonder how many issues are from re-using my previous NAR as a supplement, since I feel some samples over time felt a little too compressed in terms of the range (where it sounds kind of muffled I suppose, no additional detail to help resolve any nuances, blah blah). I'm very pleased that it won't hit a flat wall in terms of loss/accuracy and approach a loss of ~1.0 / what I believe is 90% accuracy.

I think I'm pretty pilled on using a monolithic AR+NAR. I was training a half-sized monolithic model on the side (also making use of [prodigy](https://github.com/konstmish/prodigy), major props), and even at 25% of the epoch processed, the AR-side of the model was already reaching ~73% accuracy, while the NAR side was looking a bit rough at ~44% accuracy, but that's expected. I don't have any samples since I forgot to meddle with the evaluation / validation routines to be aware of a monolithic AR+NAR (it seemed it was falling back to treating it like an AR), so I'll need to go back and yank out samples. Now, I don't know if this is a testament to prodigy performing really well with getting it up to snuff, or a monolithic approach to the transformer (/ RetNet to pedantists) is what helps bolster the model. I suppose the next few days I'll see about converting existing ARs into a monolithic approach. * I'll toy around with re-baking fresh `prom_emb`/`resp_emb`s to the \~new\~ way with my good full-sized AR weights. - It already seems rather fruitful, the AR side already is at 80% accuracy, but the NAR side is at 20% accuracy. - It seems the best way is to freeze every parameter except the new `prom_emb`/`resp_emb`s, train for a bit until the AR seems to be back up to par, then train again with the main transformer (/ RetNet) weights unfrozen, since it still needs to be re-tuned for NAR tasks. * if this proves fruitful, then I can easily apply this with the double-deepened AR I'm still training on my 7900XTX. - I expect it to be fruitful, at the very least in a way that saves me time from having to re-train it from scratch. - I *might* even be able to take the embeddings from the full-sized AR after re-baking them and stitch them to the double-deepened converted AR to shortcut things. I'm rather happy that I'm at a point to where I can start stepping out of the previously established comfort zone and start toying with things more and more. I just hope that I can get around to figuring out how to implement the more fancier sampling techniques like repetition penalties and what-not, since I don't have the luxuries from using huggingface wrappers like TorToiSe does for these things. --- Also the double-deepened AR is rather fruitful too at 28% through the "epoch": [samples](https://files.catbox.moe/3ltuj5.7z). I only picked at some of the validation and I'm rather pleased. The only issue is that I wonder how many issues are from re-using my previous NAR as a supplement, since I feel some samples over time felt a little too *compressed* in terms of the range (where it sounds kind of muffled I suppose, no additional detail to help resolve any nuances, blah blah). I'm very pleased that it won't hit a flat wall in terms of loss/accuracy and approach a loss of ~1.0 / what I believe is 90% accuracy.
Author
Owner

Before I go fuck off and let the models train for however long, a status update (in no particular order of importance):

  • I finally set up a HuggingFace Space: https://huggingface.co/spaces/ecker/vall-e
    • CPU inferencing is slow, but it's a nice way to demo it.
    • This also has an extremely streamlined gradio web UI to go with it.
  • I've implemented some fancy sampler options:
    • top-k / top-p sampling requires a lot of playing with, as it's very easy for it to go from producing usable output to ruining the output. It might be beneficial to have split settings between the AR and NAR.
    • repetition penalty (with length factoring) definitely is not so good for sequence models that have low token counts. Length decay factoring seems to be a good compromise to have it not generate repeat tokens so close to each other, but I need to play around with a good value for this (refer to the implementation's code for the formula).
    • length penalty is rather silly for the AR since it just modifies the probability of the stop token from being sampled, and the AR seems rather decent at generating properly lengthed sequences.
  • converting an existing AR into a monolithic AR+NAR seems fruitful: the full-sized AR is going along fine (samples), enough so that I opted to pivot the double-deepend AR to a monolithic approach, and so far it seems okay right now: (samples).
    • A shortcut is to simply take the previous AR's state dict, and reshape its resps_emb to shape[0] = 4, and randomly initialize weights[1:] so the NAR's resps_emb can train better (glueing a NAR's resps_emb is not helpful). It's probably better to not freeze any parameters so the main weights can be better trained for NAR tasks.
    • It might be beneficial to, instead, repurpose a NAR into a monolithic AR+NAR, as it's a bit of a pain to train NAR tasks in the first place.
    • metrics wise, it seems both models are ~81% accuracy with the AR, and ~54% with the NAR.
  • Updated models are being uploaded to my HuggingFace models repo, before I forget: https://huggingface.co/ecker/vall-e

I think I've crammed out everything I can think of. In my brief inference tests, whatever model I did end up testing seemed rather fruitful with short GLaDOS tests. Nothing fantastic, but it's definitely better than what I remembered.

I'll probably leave things be for another week as I think I overdid it again, so the 4070Ti is currently convert-training the monolithic full AR, while the 7900XTX is back to converting the monolithic double-deepened AR.

Before I go fuck off and let the models train for however long, a status update (in no particular order of importance): * I finally set up a HuggingFace Space: https://huggingface.co/spaces/ecker/vall-e - CPU inferencing is slow, but it's a nice way to demo it. - This also has an extremely streamlined gradio web UI to go with it. * I've implemented some fancy sampler options: - top-k / top-p sampling requires a lot of playing with, as it's very easy for it to go from producing usable output to ruining the output. It might be beneficial to have split settings between the AR and NAR. - repetition penalty (with length factoring) definitely is not so good for sequence models that have low token counts. Length decay factoring seems to be a good compromise to have it not generate repeat tokens so close to each other, but I need to play around with a good value for this (refer to the implementation's code for the formula). - length penalty is rather silly for the AR since it just modifies the probability of the stop token from being sampled, and the AR seems rather decent at generating properly lengthed sequences. * converting an existing AR into a monolithic AR+NAR seems fruitful: the `full`-sized AR is going along fine ([samples](https://files.catbox.moe/ciq0xk.7z)), enough so that I opted to pivot the double-deepend AR to a monolithic approach, and so far it seems *okay* right now: ([samples](https://files.catbox.moe/phcx6l.7z)). - A shortcut is to simply take the previous AR's state dict, and reshape its `resps_emb` to shape[0] = 4, and randomly initialize weights[1:] so the NAR's `resps_emb` can train better (glueing a NAR's `resps_emb` is not helpful). It's probably better to not freeze any parameters so the main weights can be better trained for NAR tasks. - It might be beneficial to, instead, repurpose a NAR into a monolithic AR+NAR, as it's a bit of a pain to train NAR tasks in the first place. - metrics wise, it seems both models are ~81% accuracy with the AR, and ~54% with the NAR. * Updated models are being uploaded to my HuggingFace models repo, before I forget: https://huggingface.co/ecker/vall-e I think I've crammed out everything I can think of. In my brief inference tests, whatever model I did end up testing seemed rather fruitful with short GLaDOS tests. Nothing fantastic, but it's definitely better than what I remembered. I'll probably leave things be for another week as I think I overdid it again, so the 4070Ti is currently convert-training the monolithic `full` AR, while the 7900XTX is back to converting the monolithic double-deepened AR.

@mrq have you tried https://github.com/Plachtaa/vallex-webui ?

pretty decent

the author say they use https://github.com/lifeiteng/vall-e for training code

with small modification

@mrq have you tried https://github.com/Plachtaa/vallex-webui ? pretty decent the author say they use https://github.com/lifeiteng/vall-e for training code with small modification
Author
Owner

@mrq have you tried https://github.com/Plachtaa/vallex-webui ?
the author say they use https://github.com/lifeiteng/vall-e for training code

I gave it a cursory glance and I find it rather impressive, considering what I remember from the previous unofficial/homebrewed weights. I'll need to muck around with it more to test it capabilities, as I know my models have quite the issues I've noticed so far.

I'll reserve my judgment from my biases towards the base implementation being a pain, and the web UI and documentation taking too much inspiration from Bark in how it goes about things. If it works, it works.

I am curious, though, what the dataset looks like. The "model card" doesn't give much information outside of it being a bog-standard full-sized AR and NAR (separate, proving my assumption wrong as I looked at the implementation again) that targets 8 RVQ-bins. I'd be surprised if it was the full 60K hours of Librilight + whatever else LibriVox has for Japanese and Chinese.

Although, regardless if that model and implementation takes off, and/or mine finally gets to a decent output quality, my bigger fear is that the "sphere of voice synthesis" will still be rather stagnant in just waiting for someone else do improve upon things due to the lack of eyes on things because there's no model leak from big conglomerate (like Stable Diffusion was originally, or LLaMA was originally).

I suppose I'll go back to shutting up, not overworking myself, and not stressing over the model and let things train for another week and see how it fares. I just worry that I'd be better off training from scratch again, so perhaps I should set things up to be able to train off a rental again.

> @mrq have you tried https://github.com/Plachtaa/vallex-webui ? > the author say they use https://github.com/lifeiteng/vall-e for training code I gave it a cursory glance and I find it rather impressive, considering what I remember from the previous unofficial/homebrewed weights. I'll need to muck around with it more to test it capabilities, as I know my models have quite the issues I've noticed so far. I'll reserve my judgment from my biases towards the base implementation being a pain, and the web UI and documentation taking too much inspiration from Bark in how it goes about things. If it works, it works. I am curious, though, what the dataset looks like. The "model card" doesn't give much information outside of it being a bog-standard full-sized AR and NAR (separate, proving my assumption wrong as I looked at the implementation again) that targets 8 RVQ-bins. I'd be surprised if it was the full 60K hours of Librilight + whatever else LibriVox has for Japanese and Chinese. Although, regardless if that model and implementation takes off, and/or mine finally gets to a decent output quality, my bigger fear is that the "sphere of voice synthesis" will still be rather stagnant in just waiting for someone else do improve upon things due to the lack of eyes on things because there's no model leak from big conglomerate (like Stable Diffusion was originally, or LLaMA was originally). I suppose I'll go back to shutting up, not overworking myself, and not stressing over the model and let things train for another week and see how it fares. I just worry that I'd be better off training from scratch again, so perhaps I should set things up to be able to train off a rental again.
Author
Owner

Don't expect any updates for a while.

Both the full sized model and double-deepend models are being retrained from scratch and not stitched and glued from existing ARs to the monolithic approach and now to the full eight RVQ bins. From the outputs so far it seems that it's much better in the RVQ bins 2-8 department (what the NAR targets), but actual speech is still waiting to be realized.

I did add a naïve implementation for beam searching a few days ago, but I don't know how well it fares. I feel the more I play with the instance running on the HuggingFace Space, I feel the worse the model really is.

These graphs aren't looking so great either, but that's probably just the nature of bruteforcing the model to randomly pick each level for each sample in a batch. I just hate that the computed loss/accuracy is rather useless now, and the aura-loss computed is still very forgiving when it's not factoring in the actual speech (or lack of it).

Oh well.

Don't expect any updates for a while. Both the full sized model and double-deepend models are being retrained from scratch and not stitched and glued from existing ARs to the monolithic approach and now to the full eight RVQ bins. From the outputs so far it seems that it's much better in the RVQ bins 2-8 department (what the NAR targets), but actual speech is still waiting to be realized. I did add a naïve implementation for beam searching a few days ago, but I don't know how well it fares. I feel the more I play with the instance running on the HuggingFace Space, I feel the worse the model really is. These graphs aren't looking so great either, but that's probably just the nature of bruteforcing the model to randomly pick each level for each sample in a batch. I just hate that the computed loss/accuracy is rather useless now, and the aura-loss computed is still very forgiving when it's not factoring in the actual speech (or lack of it). Oh well.
Author
Owner

for a while.

I lied. I suppose there's quite a bit of updates I need to vomit out before I forget about them.

Turns out, the NAR has been trained a little wrong for months.

  • Ignoring computing the loss for the text portion of the prompt is bad. I do not know why the original implementation did this, but I suppose it's hard to figure out unless the model was specifically being trained in a monolithic enviroment.
  • This definitely should help explain the gap in the loss/accuracy between the AR and NAR, as evident in the immediate drop/spike in the loss/accuracy.
  • I feel like this did offer a nice bump in NAR quality.

A monolithic approach definitely does work when trained from the onset as one.

  • do NOT try and glue anything extra onto a model. Expanding a model to tend to more RVQ bins, be it from converting an AR into a monolithic AR+NAR, or a NAR into more RVQ bin levels. The underlying transformer (attention or retention based) just will never truly learn additional bins after the model was trained far enough. You just simply cannot bruteforce train it into conformity, or at least it's not worth it.
    • I don't have a better place to put this, but I think the approach with SpeechX is flawed in introducing special prom_embs tokens, and instead just having an embedding for these tasks tokens. This approach can also be used to add in a language identifier, rather than subjugating the text tokens after the fact.

As for the models being trained (again):

  • The normal full-sized model is coming along swimmingly.
    • Actual speech is starting to emerge once more at about 40% of my "epoch" of 12K hours (again, despite the term "epoch" being hazy due to the sampler behaves).
      • I do not have a good comparison in throughput between training an AR and NAR together, vs training a monolithic AR+NAR in one model. There's too many theoretics, and I can't really compare apples/apples as my training methodologies already differ from the previous training run.
    • I opted to have the input acoustic prompt randomly trim between 3 and 9 seconds.
      • I noticed in testing that quality output from inferencing very much depends on having the length of your input acoustic prompt match what's in training: too long and you're corrupting the output.
      • This does kind of hinder throughput speed, but whatever, I don't think it's beneficial to train at a "low" context then try and go up.
    • I'm rather pleased with the actual audio quality of the model, ignoring speech still needs some time. The "fine" portions (the NAR parts) of the audio sounds pretty crisp in the evaluation output. The validation output needs some love, as it's not all that refined, but it can be.
  • The double deepened NAR is on pause right now, for a LLaMA2 related endeavor:
    • I originally was considering subjugating the 4070Ti, but with GPTQ + ExLLaMA, it can only go up to 4K context, so the 7900XTX has to be used, also at the benefit of ExLLaMA v2 8bit models at 8K (or 4.Xbit quantized models at 16K context).
    • As much as I feel training the double deepened model would be more favorable in terms of quality output, it makes more sense to train the normal sized model as it's proven it can work at that size, and I think getting something usable out of it will take much less time than it would for the double deepened model in sheer terms of throughput rates.
  • I've added a bit of a naive beam search to the sampler.
    • It'll make use of being able to batch inference and keep the top-K (K=beam width) probabilities. I don't have any objective metrics, but it's better than nothing.
    • I say naive, because it just takes the top-K logits amongst the batch.
      • There's logic to keep track of the best scoring "most probable" branch, but it always returns the 0-th indexed branch, so either I'm doing something wrong or it the branches always ends up being ordered by total score.
  • The HuggingFace Space is using the original split AR / NAR models due to really bad quality degradation with the chimera AR+NAR monolithic model. With the right settings, the Space can push out decent output, but it takes more wrangling than I'd like.
    • I've listed the settings in the Space, but it's something like rep pen: 1.3, rep pen length decay: 0.3, top p: 0.95, top k: 768, beam width: 16, player preferences for the temps. I think repetition penalty with a bit of length decay really helps shape out the outputs.

I think this should be all the news I've kept in my head for those that are interested still.

I'm hoping I can stop stressing too much over the models the more I realize I'm under no pressure to push out a model, as I'm still able to keep improving the project without needing a decent model from the get-go.

  • I'm surprised how simple and easy it was to implement the sampling features, especially still being rather green to the whole ML space.
  • I am kind of itching to implement mirostat sampling, as the code for it under LLaMA loaders seem rather simple to carry over. I'm just not so sure if it'd be helpful.
>for a while. I lied. I suppose there's quite a bit of updates I need to vomit out before I forget about them. Turns out, the NAR has been trained a little wrong for months. * Ignoring computing the loss for the text portion of the prompt is bad. I do not know why the original implementation did this, but I suppose it's hard to figure out unless the model was specifically being trained in a monolithic enviroment. * This definitely should help explain the gap in the loss/accuracy between the AR and NAR, as evident in the immediate drop/spike in the loss/accuracy. * I ***feel*** like this did offer a nice bump in NAR quality. A monolithic approach definitely does work when trained from the onset as one. * do ***NOT*** try and glue anything extra onto a model. Expanding a model to tend to more RVQ bins, be it from converting an AR into a monolithic AR+NAR, or a NAR into more RVQ bin levels. The underlying transformer (attention or retention based) just will never truly learn additional bins after the model was trained far enough. You just simply cannot bruteforce train it into conformity, or at least it's not worth it. - I don't have a better place to put this, but I think the approach with SpeechX is flawed in introducing special `prom_embs` tokens, and instead just having an embedding for these tasks tokens. This approach can also be used to add in a language identifier, rather than subjugating the text tokens after the fact. As for the models being trained (again): * The normal full-sized model is coming along swimmingly. + Actual speech is starting to emerge once more at about 40% of my "epoch" of 12K hours (again, despite the term "epoch" being hazy due to the sampler behaves). - I do not have a good comparison in throughput between training an AR and NAR together, vs training a monolithic AR+NAR in one model. There's too many theoretics, and I can't really compare apples/apples as my training methodologies already differ from the previous training run. + I opted to have the input acoustic prompt randomly trim between 3 and 9 seconds. - I noticed in testing that quality output from inferencing very much depends on having the length of your input acoustic prompt match what's in training: too long and you're corrupting the output. - This does kind of hinder throughput speed, but whatever, I don't think it's beneficial to train at a "low" context then try and go up. + I'm rather pleased with the actual audio quality of the model, ignoring speech still needs some time. The "fine" portions (the NAR parts) of the audio sounds pretty crisp in the evaluation output. The validation output needs some love, as it's not all that refined, but it *can* be. * The double deepened NAR is on pause right now, for a LLaMA2 related endeavor: + I originally was considering subjugating the 4070Ti, but with GPTQ + ExLLaMA, it can only go up to 4K context, so the 7900XTX has to be used, also at the benefit of ExLLaMA v2 8bit models at 8K (or 4.Xbit quantized models at 16K context). + As much as I feel training the double deepened model would be more favorable in terms of quality output, it makes more sense to train the normal sized model as it's proven it *can* work at that size, and I think getting something usable out of it will take much less time than it would for the double deepened model in sheer terms of throughput rates. * I've added a bit of a naive beam search to the sampler. - It'll make use of being able to batch inference and keep the top-K (K=beam width) probabilities. I don't have any objective metrics, but it's better than nothing. - I say naive, because it just takes the top-K logits amongst the batch. + There's logic to keep track of the best scoring "most probable" branch, but it always returns the 0-th indexed branch, so either I'm doing something wrong or it the branches always ends up being ordered by total score. * The HuggingFace Space is using the original split AR / NAR models due to really bad quality degradation with the chimera AR+NAR monolithic model. With the right settings, the Space *can* push out decent output, but it takes more wrangling than I'd like. - I've listed the settings in the Space, but it's something like `rep pen: 1.3`, `rep pen length decay: 0.3`, `top p: 0.95`, `top k: 768`, `beam width: 16`, player preferences for the temps. I think repetition penalty with a bit of length decay really helps shape out the outputs. I think this should be all the news I've kept in my head for those that are interested still. I'm hoping I can stop stressing too much over the models the more I realize I'm under no pressure to push out a model, as I'm still able to keep improving the project without needing a decent model from the get-go. * I'm surprised how simple and easy it was to implement the sampling features, especially still being rather green to the whole ML space. * I am kind of itching to implement mirostat sampling, as the code for it under LLaMA loaders seem rather simple to carry over. I'm just not so sure if it'd be helpful.

So, I'm trying to overfit on just 3 speakers just to ensure I have things set up correctly. I'd like to query exactly same data from the training set to ensure everything is going fine.

Right now, I've been training for about 40 epochs (~4M tokens) and getting close to 60% accuracy with loss sub-linearly dropping. But inference comes out a garbled mess. At what point do you start hearing human-like sounds?

So, I'm trying to overfit on just 3 speakers just to ensure I have things set up correctly. I'd like to query exactly same data from the training set to ensure everything is going fine. Right now, I've been training for about 40 epochs (~4M tokens) and getting close to 60% accuracy with loss sub-linearly dropping. But inference comes out a garbled mess. At what point do you start hearing human-like sounds?
Author
Owner

So, I'm trying to overfit on just 3 speakers just to ensure I have things set up correctly.

Right, I never went back to try and test training on much narrower datasets, as I was doing things entirely wrong with my initial narrowed tests. I know you can definitely overfit for one single sample, as the mini/test trainers do, but I don't think I ever got anything fruitful with just one speaker. I'm sure it's doable, as the lifeiteng/vall-e implementation has a training script on LJSpeech alone.

Right now, I've been training for about 40 epochs (~4M tokens) and getting close to 60% accuracy with loss sub-linearly dropping. But inference comes out a garbled mess. At what point do you start hearing human-like sounds?

Token wise, I'll need to check my metrics. A problem I noted a week or so ago with DeepSpeed is that the tokens processed metric isn't stored, so I'll need to muck around in ./vall_e/plot.py to correct for this. When I do, I should be able to pick out where along training it was in relation to tokens processed.

But, I can say now judging from all my evaluation / validation outputs from the current model (the AR+NAR monolithic RetNet, I'll have to check the numbers for the previous split AR and NAR models):

Although it's kind of hard to say exactly when these milestones precisely occurred. I'll have to assume an average sample would be 64 text tokens + 75 * 6 audio tokens = 514 tokens per sample, so for now my estimated tokens for those milestones would be:

  • 1,184,256,000 tokens
  • 1,973,760,000 tokens
  • 6,773,549,568 tokens

Again, I'm not sure how different a model's progression would be with a much smaller dataset, but if my first naive test runs are anything to go by, it'll take what feels like a loooong time.

> So, I'm trying to overfit on just 3 speakers just to ensure I have things set up correctly. Right, I never went back to try and test training on much narrower datasets, as I was doing things entirely wrong with my initial narrowed tests. I know you can definitely overfit for one single sample, as the mini/test trainers do, but I don't think I ever got anything fruitful with just one speaker. I'm sure it's doable, as the [lifeiteng/vall-e](https://github.com/lifeiteng/vall-e) implementation has a training script on LJSpeech alone. > Right now, I've been training for about 40 epochs (~4M tokens) and getting close to 60% accuracy with loss sub-linearly dropping. But inference comes out a garbled mess. At what point do you start hearing human-like sounds? Token wise, I'll need to check my metrics. A problem I noted a week or so ago with DeepSpeed is that the tokens processed metric isn't stored, so I'll need to muck around in `./vall_e/plot.py` to correct for this. When I do, I should be able to pick out where along training it was in relation to tokens processed. But, I can say now judging from all my evaluation / validation outputs from the current model (the AR+NAR monolithic RetNet, I'll have to check the numbers for the previous split AR and NAR models): * [human sounding, but still garbled](https://files.catbox.moe/juwnll.7z), speech started to emerge at 2,304,000 samples processed. * [human sounding, but not actual English](https://files.catbox.moe/h54iwk.7z), started to emerge at 3,840,000 samples processed (reported epoch: 0.5). * [a semblance of actual English](https://files.catbox.moe/60089h.7z) started to emerge at 13,178,112 samples processed. Although it's kind of hard to say exactly when these milestones precisely occurred. I'll have to assume an average sample would be 64 text tokens + 75 * 6 audio tokens = 514 tokens per sample, so for now my estimated tokens for those milestones would be: * 1,184,256,000 tokens * 1,973,760,000 tokens * 6,773,549,568 tokens Again, I'm not sure how different a model's progression would be with a much smaller dataset, but if my first naive test runs are anything to go by, it'll take what feels like a *loooong* time.
Author
Owner

Also I just realized the issue is working again. I'm not sure why it broke, or how it resolved itself.

There wasn't really anything noteworthy outside of:

  • I added mirostat sampling, but it's incompatible with beam search sampling.
  • the AR+NAR monolithic is quite fruitful, but needs more time training to iron out inconsistencies. The HuggingFace Space pivoted over to it as it has real 8 RVQ bin outputs now, but it seems it can't inference at float16.
  • I caved and have a GitHub to serve as a mirror.

I suppose the things I still have left to do is:

  • finetune tests, but this is more predicated on a good baseline first.
  • experiment with context extending. I have an idea to cheat around what seems to be a maximum context (despite RetNet allegedly fixing this).
  • figure out if it's viable to continue training a double deepened model.
  • actually do SpeechX tasks (which should be fine on my 7900XTX), and other languages (but this is still predicated on a proper Japanese phonemizer, and acquiring a dataset on other languages).
Also I just realized the issue is working again. I'm not sure why it broke, or how it resolved itself. There wasn't really anything noteworthy outside of: * I added mirostat sampling, but it's incompatible with beam search sampling. * the AR+NAR monolithic is quite fruitful, but needs more time training to iron out inconsistencies. The [HuggingFace Space](https://huggingface.co/spaces/ecker/vall-e) pivoted over to it as it has real 8 RVQ bin outputs now, but it seems it can't inference at float16. * I caved and have a [GitHub](https://github.com/e-c-k-e-r/vall-e) to serve as a mirror. I suppose the things I still have left to do is: * finetune tests, but this is more predicated on a good baseline first. * experiment with context extending. I have an idea to cheat around what seems to be a maximum context (despite RetNet allegedly fixing this). * figure out if it's viable to continue training a double deepened model. * actually do SpeechX tasks (which should be fine on my 7900XTX), and other languages (but this is still predicated on a proper Japanese phonemizer, and acquiring a dataset on other languages).

Although it's kind of hard to say exactly when these milestones precisely occurred. I'll have to assume an average sample would be 64 text tokens + 75 * 6 audio tokens = 514 tokens per sample, so for now my estimated tokens for those milestones would be:

1,184,256,000 tokens
1,973,760,000 tokens
6,773,549,568 tokens
Again, I'm not sure how different a model's progression would be with a much smaller dataset, but if my first naive test runs are anything to go by, it'll take what feels like a loooong time.

I'm asking about the accuracies and losses you see once it turns into human sounding (just trying to debug inference for my custom dataset). E.g. is it 50% acc, 60%, 70%, 80%? Since losses and accs vs tokens vary with hyperparameter settings.

> Although it's kind of hard to say exactly when these milestones precisely occurred. I'll have to assume an average sample would be 64 text tokens + 75 * 6 audio tokens = 514 tokens per sample, so for now my estimated tokens for those milestones would be: > 1,184,256,000 tokens 1,973,760,000 tokens 6,773,549,568 tokens Again, I'm not sure how different a model's progression would be with a much smaller dataset, but if my first naive test runs are anything to go by, it'll take what feels like a loooong time. I'm asking about the accuracies and losses you see once it turns into human sounding (just trying to debug inference for my custom dataset). E.g. is it 50% acc, 60%, 70%, 80%? Since losses and accs vs tokens vary with hyperparameter settings.
Author
Owner

I'm asking about the accuracies and losses you see once it turns into human sounding (just trying to debug inference for my custom dataset). E.g. is it 50% acc, 60%, 70%, 80%? Since losses and accs vs tokens vary with hyperparameter settings.

I had a huge block outlining why using a loss / accuracy metric is a baseless metric to go by, but I've omitted it for coming off far too blunt.

Your magic number with the current monolothic AR+NAR is loss = 3.1, acc = 0.7. Enjoy.

> I'm asking about the accuracies and losses you see once it turns into human sounding (just trying to debug inference for my custom dataset). E.g. is it 50% acc, 60%, 70%, 80%? Since losses and accs vs tokens vary with hyperparameter settings. I had a huge block outlining why using a loss / accuracy metric is a baseless metric to go by, but I've omitted it for coming off far too blunt. Your magic number with the current monolothic AR+NAR is loss = 3.1, acc = 0.7. Enjoy.

Cool, that's useful for the purposes of debugging anyway. I do see in some of your earlier posts how sometimes quality versus loss/acc can be inconsistent.

Another question, I'm using the monolithic ar+nar. I see you have a model class that is ar+nar, but in inference.py you separately instantiate and call the ar and nar. Is that correct? I know there's an ar_nar class here.

Again, just trying to debugging my inferencing (could also be there's nothing wrong and I just need to wait for it to train longer).

Cool, that's useful for the purposes of debugging anyway. I do see in some of your earlier posts how sometimes quality versus loss/acc can be inconsistent. Another question, I'm using the monolithic ar+nar. I see you have a model class that is ar+nar, but in inference.py you separately instantiate and call the ar and nar. Is that correct? I know there's an ar_nar class [here](https://git.ecker.tech/mrq/vall-e/src/branch/master/vall_e/models/ar_nar.py#L13). Again, just trying to debugging my inferencing (could also be there's nothing wrong and I just need to wait for it to train longer).

Another thing that would be fairly useful for the ar+nar class:
Right now, you can only see the combined loss and accuracy. One thing that may be useful to adjust over time is the p_ar_level. If I can notice the ar loss is high but nar loss is low, I can set the p_ar_level to be high.

So, is there a simple way to additionally emit the losses for ar and nar separately? I'll take a look at that portion of the code somewhat soon.

Another thing that would be fairly useful for the ar+nar class: Right now, you can only see the combined loss and accuracy. One thing that may be useful to adjust over time is the `p_ar_level`. If I can notice the ar loss is high but nar loss is low, I can set the `p_ar_level` to be high. So, is there a simple way to additionally emit the losses for ar and nar separately? I'll take a look at that portion of the code somewhat soon.

You were right, at around loss 3.0 I am getting human-like sounds (this is just on 30 hours of audio...). I was able to add some lines to emit the metrics separately. It looks like the ar loss is a good deal lower than the nar loss, which is in line with some of your prior posts. What's your intuitive thoughts on what the ar versus nar losses should correspond to?

AR corresponds to the first quantized level, whereas NAR is the other ones. So, canonically, the paper mentions NAR should correspond the acoustics and speaker voice specifics, whereas AR should correspond to more the actual text synthesis accuracy?

If I'm getting good acoustics but bad text adherence (i.e. it's speaking gibberish, maybe sounds like another language, but human acoustics sounds are good), wouldn't that correspond to low NAR loss but high AR loss? I'm kind of seeing the opposite right now: human acoustic sounds are fairly good but basically no adherence to the text (just gibberish). So, I would expect that to be higher NAR loss and lower AR loss, but instead I see the opposite (~2.2 AR loss versus ~3.1 NAR loss).

Curious to hear what your thoughts and interpretation of these values are.

You were right, at around loss 3.0 I am getting human-like sounds (this is just on 30 hours of audio...). I was able to add some lines to emit the metrics separately. It looks like the ar loss is a good deal lower than the nar loss, which is in line with some of your prior posts. What's your intuitive thoughts on what the ar versus nar losses should correspond to? AR corresponds to the first quantized level, whereas NAR is the other ones. So, canonically, the paper mentions NAR should correspond the acoustics and speaker voice specifics, whereas AR should correspond to more the actual text synthesis accuracy? If I'm getting good acoustics but bad text adherence (i.e. it's speaking gibberish, maybe sounds like another language, but human acoustics sounds are good), wouldn't that correspond to low NAR loss but high AR loss? I'm kind of seeing the opposite right now: human acoustic sounds are fairly good but basically no adherence to the text (just gibberish). So, I would expect that to be higher NAR loss and lower AR loss, but instead I see the opposite (~2.2 AR loss versus ~3.1 NAR loss). Curious to hear what your thoughts and interpretation of these values are.
Author
Owner

but in inference.py you separately instantiate and call the ar and nar. Is that correct?

The AR/NAR/AR_NAR classes just have overloaded properties and a forward to do the sampling proper. I can very much assure you it's correct, as both the HuggingFace Space and the web UI are both fine with the monolithic model.

So, canonically, the paper mentions NAR should correspond the acoustics and speaker voice specifics, whereas AR should correspond to more the actual text synthesis accuracy?

The AR does heavily guide the "accuracy" of the utterance, but only for the fact that it's the dependency for the remainder of the sequences, as every level after will depend on the prior level. The NAR governs the "finer" details of the waveform, but only in the "each additional quantization level is effectively another Whittaker-Shannon sinc interpolation wave, but its effect on the final waveform is smaller and smaller, thus resolving finer details that prior levels cannot resolve" sense.

However, saying that the first quantization level is solely responsible for "adherence to the text" was a naive interpretation of mine. There's properties/details of speech that the first level cannot ever resolve, but the NAR can even with it targeting one level, and vice versa. This is evident in the past when I would include pure AR / impure NAR outputs, where details in an utterance are kind of there but were never quite enough to resolve consistently.

I'm kind of seeing the opposite right now: human acoustic sounds are fairly good but basically no adherence to the text (just gibberish)

That's just the model knowing how to generate something that sounds human-ish yet chaotic, but cannot apply order (language) to it, or the nuances of it. In fact, the AR is usually the first to have speech emerge (or at least, be the one that sounds fine when it does), while the NAR will still like crusty shit at that point in time and have a bunch of artifacts (at least, in the non-monolithic models).

So, I would expect that to be higher NAR loss and lower AR loss, but instead I see the opposite (~2.2 AR loss versus ~3.1 NAR loss).

You should instead split your loss per RVQ-bin level rather than per AR (1 level) and per NAR (7 levels). You should see that, as the quantization level increases, the average loss should increase. Should, since I only know that when making the jump from a NAR that targets 1 level to 3 then to 7, the loss climbed up more and more. I could be wrong, and they're all higher together/in aggregate.

There's also always the chance that preparing the target sequence for the NAR to compute the loss against to be flawed again, as evident when the loss for it dropped a significant amount when having it apply loss calculations against the text too. But loss isn't a good metric.

What's your intuitive thoughts on what the ar versus nar losses should correspond to?
Curious to hear what your thoughts and interpretation of these values are.

I don't have any.

Treating the loss/accuracy as a strong metric for a model's performance after speech emerges is quite naive, as evident with the reported losses with auraloss during the evaluation / validation routines meaning nothing. Any training after that point is essentially bruteforcing due to the lack of a meaningful way to quantify the accuracy for speech and praying that training in much much smaller steps to try and align with the targets will iron things out over time (at the risk of overfitting).

It was in my omitted blunt blurb, but there's simply no way for a naive loss computation to account for the accuracy of the speech itself from an already neural sequence of EnCodec codes while retaining the logits.

Now, I say retain the logits, as with retaining the logits, then the model can be improved through the backwards pass. Doing something like "compute a word-error rate score to gauge accuracy of the speech" can't be done while retaining logits and thus updating the model through the backwards pass. However, I imagine something like reinforced learning can help improve the model with word-error rate as a metric itself, but implementing it is beyond my knowledge, and RLHF is synonymous with ethics alignment, so it inherently has a sour taste.

> but in inference.py you separately instantiate and call the ar and nar. Is that correct? The AR/NAR/AR_NAR classes just have overloaded properties and a forward to do the sampling proper. I can very much assure you it's correct, as both [the HuggingFace Space](https://huggingface.co/spaces/ecker/vall-e) and the web UI are both fine with the monolithic model. > So, canonically, the paper mentions NAR should correspond the acoustics and speaker voice specifics, whereas AR should correspond to more the actual text synthesis accuracy? The AR does heavily guide the "accuracy" of the utterance, but only for the fact that it's the dependency for the remainder of the sequences, as every level after will depend on the prior level. The NAR governs the "finer" details of the waveform, but only in the "each additional quantization level is effectively another Whittaker-Shannon sinc interpolation wave, but its effect on the final waveform is smaller and smaller, thus resolving finer details that prior levels cannot resolve" sense. However, saying that the first quantization level is solely responsible for "adherence to the text" was a naive interpretation of mine. There's properties/details of speech that the first level cannot ever resolve, but the NAR can even with it targeting one level, and vice versa. This is evident in the past when I would include pure AR / impure NAR outputs, where details in an utterance are kind of there but were never quite enough to resolve consistently. > I'm kind of seeing the opposite right now: human acoustic sounds are fairly good but basically no adherence to the text (just gibberish) That's just the model knowing how to generate something that sounds human-ish yet chaotic, but cannot apply order (language) to it, or the nuances of it. In fact, the AR is usually the first to have speech emerge (or at least, be the one that sounds fine when it does), while the NAR will still like crusty shit at that point in time and have a bunch of artifacts (at least, in the non-monolithic models). > So, I would expect that to be higher NAR loss and lower AR loss, but instead I see the opposite (~2.2 AR loss versus ~3.1 NAR loss). You should instead split your loss per RVQ-bin level rather than per AR (1 level) and per NAR (7 levels). You should see that, as the quantization level increases, the average loss should increase. Should, since I only know that when making the jump from a NAR that targets 1 level to 3 then to 7, the loss climbed up more and more. I could be wrong, and they're all higher together/in aggregate. There's also always the chance that preparing the target sequence for the NAR to compute the loss against to be flawed again, as evident when the loss for it dropped a significant amount when having it apply loss calculations against the text too. But loss isn't a good metric. > What's your intuitive thoughts on what the ar versus nar losses should correspond to? > Curious to hear what your thoughts and interpretation of these values are. I don't have any. Treating the loss/accuracy as a strong metric for a model's performance after speech emerges is quite naive, as evident with the reported losses with auraloss during the evaluation / validation routines meaning nothing. Any training after that point is essentially bruteforcing due to the lack of a meaningful way to quantify the accuracy for speech and praying that training in much much smaller steps to try and align with the targets will iron things out over time (at the risk of overfitting). It was in my omitted blunt blurb, but there's simply no way for a naive loss computation to account for the accuracy of the speech itself from an already neural sequence of EnCodec codes while retaining the logits. Now, I say retain the logits, as with retaining the logits, then the model can be improved through the backwards pass. Doing something like "compute a word-error rate score to gauge accuracy of the speech" can't be done while retaining logits and thus updating the model through the backwards pass. However, I imagine something like reinforced learning *can* help improve the model with word-error rate as a metric itself, but implementing it is beyond my knowledge, and RLHF is synonymous with ethics alignment, so it inherently has a sour taste.
Author
Owner

Besides that, and the site issues, microsoft/torchscale did some commits that breaks compatibility with existing models using its RetNet. It messes with the normalization method (LayerNorm => RMSNorm), removes biases from the weights, and uses a gated linear unit (just an additional weight and removal of subln) in place of a feed-forward network. Playing around with re-enabling each new feature has the model suffer tremendously, and from the test trainer there seems to be no apparent gains from using RMSNorm / no biasing / a GLU instead, so I will not try and glue things around again and end up crippling the model like I kept doing in the past.


I suppose I'll go back to shutting up and trying not to stress too much over the model as I've had for who knows how long before. I feel I should do my re-evaluations when it hits maybe 4 or 5 epochs trained (it's currently on epoch 2.4) before making any further decisions with the model.

  • I'm tempted to pivot to tossing in the SpeechX tasks / VALL-E X multi-lingualness to try and jostle some things in hopes of improving the model, now that I have grew a brain and realized the right way to go about implementing such markers.
    • Although SpeechX tasks will severely harm throughput, and my implementation for training any noise related tasks are still rather dubious.
    • I still need to figure out how to go about phonemizing Japanese, and acquiring more data, but that's a pain. I think Plachtaa/VALL-E-X had a commit removing phonemizer as a dependency, so that might be somewhere I can poke at for ideas.
  • I'm also getting tempted to try and delve into more exotic loss calculation functions / model training (like reinforced learning) in hopes that it can help with the dilemma of "there's no real way to try and gauge the corrected in the generated speech / the underlying EnCodec codes are '''neurally causal'''" so I don't have to bruteforce training, but I imagine the greats in the EnCodec-based audio LM sphere (or I guess even the mel-spectrogram based LMs) would have already found a solution for that anyways.
    • I imagine reinforced learning over supervised learning would just take just as much time with the amount of time needed to perform transcription. My dataset would also need to be re-parsed to have the original text and not the phonemes (or I would have to train an EnCodec-based audio transcription model that outputs phonemes, so an inverse of VALL-E, and that's going to have its own problems).
  • The gravity of how much training might actually be needed is starting to dawn on me. Realistically, the odds of some chucklefuck like me with one measily GPU being able to "beat out the competition" of other solutions like TorToise, Bark, and VALL-E X, seems rather infinitesimally small, despite how "quick" it seemed the model was able to emerge speech.
Besides that, and the site issues, [microsoft/torchscale](https://github.com/microsoft/torchscale/) did some commits that breaks compatibility with existing models using its RetNet. It messes with the normalization method (LayerNorm => RMSNorm), removes biases from the weights, and uses a gated linear unit (just an additional weight and removal of subln) in place of a feed-forward network. Playing around with re-enabling each new feature has the model suffer tremendously, and from the test trainer there seems to be no apparent gains from using RMSNorm / no biasing / a GLU instead, so I will not try and glue things around again and end up crippling the model like I kept doing in the past. --- I suppose I'll go back to shutting up and trying not to stress too much over the model as I've had for who knows how long before. I feel I should do my re-evaluations when it hits maybe 4 or 5 epochs trained (it's currently on epoch 2.4) before making any further decisions with the model. * I'm tempted to pivot to tossing in the SpeechX tasks / VALL-E X multi-lingualness to try and jostle some things in hopes of improving the model, now that I have grew a brain and realized the right way to go about implementing such markers. - Although SpeechX tasks will severely harm throughput, and my implementation for training any noise related tasks are still rather dubious. - I still need to figure out how to go about phonemizing Japanese, and acquiring more data, but that's a pain. I think [Plachtaa/VALL-E-X](https://github.com/Plachtaa/VALL-E-X/) had a commit removing phonemizer as a dependency, so that might be somewhere I can poke at for ideas. * I'm also getting tempted to try and delve into more exotic loss calculation functions / model training (like reinforced learning) in hopes that it can help with the dilemma of "there's no real way to try and gauge the corrected in the generated speech / the underlying EnCodec codes are '''neurally causal'''" so I don't have to bruteforce training, but I imagine the greats in the EnCodec-based audio LM sphere (or I guess even the mel-spectrogram based LMs) would have already found a solution for that anyways. - I imagine reinforced learning over supervised learning would just take just as much time with the amount of time needed to perform transcription. My dataset would also need to be re-parsed to have the original text and not the phonemes (or I would have to train an EnCodec-based audio transcription model that outputs phonemes, so an inverse of VALL-E, and that's going to have its own problems). * The gravity of how much training might actually be needed is starting to dawn on me. Realistically, the odds of some chucklefuck like me with one measily GPU being able to "beat out the competition" of other solutions like TorToise, Bark, and VALL-E X, seems rather infinitesimally small, despite how "quick" it seemed the model was able to emerge speech.

Thoughts on StyleTTS2?
https://github.com/yl4579/StyleTTS2

Thoughts on StyleTTS2? https://github.com/yl4579/StyleTTS2
Sign in to join this conversation.
No description provided.