Implement Training #30

New Issue

Armored1065 · 2023-02-13T03:42:40Z

Armored1065 commented

2023-02-13 03:42:40 +00:00

Can you please include all the existing voices so the voices folder is not empty and we can test everything works during the first run?

Also might be a good idea to collaborate with https://github.com/152334H/tortoise-tts-fast as end goal seems to be similar.

Can you please include all the existing voices so the voices folder is not empty and we can test everything works during the first run? Also might be a good idea to collaborate with https://github.com/152334H/tortoise-tts-fast as end goal seems to be similar.

lightmare commented

2023-02-13 17:23:10 +00:00

~~I don't know if the intention is to leave the voices folder empty, but~~ for now you can grab them from an older commit: 811539b20a/tortoise/voices
Edit: it probably was intentional, as the include in MANIFEST.in is still there

~~I don't know if the intention is to leave the voices folder empty, but~~ for now you can grab them from an older commit: https://git.ecker.tech/mrq/tortoise-tts/src/commit/811539b20adfe6d85d2bc3e6728d55fd2427aae0/tortoise/voices Edit: it probably was intentional, as the [include in MANIFEST.in](https://git.ecker.tech/mrq/tortoise-tts/src/commit/5e843fe29d886afc8371fa1c343fe70f7bf02cc3/MANIFEST.in#L2) is still there

mrq commented

2023-02-13 19:25:56 +00:00

I restored the random voice option, as that got forgotten from adding in a web UI in commit 37d25573ac. It would've been nice when I was testing the colab notebook instead of importing a latent every setup, but oh well. On the note of the colab:

Can you please include all the existing voices so the voices folder is not empty and we can test everything works during the first run?

Right. I did it for a number of (in post, stupid) reasons that I don't think anyone wants to be bothered hearing the laundry list of "despite it not really mattering in the end" or some other cope, but in the end it helps with the consistency of userdata being outside of the ./tortoise/ folder (honestly, voices have no reason to be in that folder, but I wouldn't know better).

To restore them, run:

Linux: git checkout 811539b20adfe6d85d2bc3e6728d55fd2427aae0 ./tortoise/voices/
Windows: git checkout 811539b20adfe6d85d2bc3e6728d55fd2427aae0 .\tortoise\voices\

Do not forget to move them out from the ./tortoise/ folder and into ./voices/, because of the nightmare of trying to maintain compatibility between people using either the old or the new spot (which I might as well just make it grab voices from either locations).

might be a good idea to collaborate

!TODO!: fill in later, as I'm already spending too much time into this section, so I want the rest of the comment submitted first. Actually, I'll keep the unabridged version to myself, much less, any of the abridged version, as they're rather blunt and aren't necessarily within the scope of this repo.

it probably was intentional, as the include in MANIFEST.in is still there

Damn, I feel like a foolish fool, as I removed them (initially) solely because they kept getting copied during setup.py install.

I restored the `random` voice option, as that got forgotten from adding in a web UI in commit 37d25573accf2dce213cc5ec72c05c4afa02f2b5. It would've been nice when I was testing the colab notebook instead of importing a latent every setup, but oh well. On the note of the colab: > Can you please include all the existing voices so the voices folder is not empty and we can test everything works during the first run? Right. I did it for a number of (in post, stupid) reasons that I don't think anyone wants to be bothered hearing the laundry list of "despite it not really mattering in the end" or some other cope, but in the end it helps with the consistency of userdata being outside of the `./tortoise/` folder (honestly, voices have no reason to be in that folder, but I wouldn't know better). To restore them, run: * Linux: `git checkout 811539b20adfe6d85d2bc3e6728d55fd2427aae0 ./tortoise/voices/` * Windows: `git checkout 811539b20adfe6d85d2bc3e6728d55fd2427aae0 .\tortoise\voices\` Do not forget to move them out from the `./tortoise/` folder and into `./voices/`, because of the nightmare of trying to maintain compatibility between people using either the old or the new spot (which I might as well just make it grab voices from either locations). > might be a good idea to collaborate ~~**!**TODO**!**: fill in later, as I'm already spending too much time into this section, so I want the rest of the comment submitted first.~~ Actually, I'll keep the unabridged version to myself, much less, any of the abridged version, as they're rather blunt and aren't necessarily within the scope of this repo. > it probably was intentional, as the include in MANIFEST.in is still there Damn, I feel like a foolish fool, as I removed them (initially) solely because they kept getting copied during `setup.py install`.

👍 1

mrq closed this issue

2023-02-14 21:25:22 +00:00

Armored1065 commented

2023-02-16 02:40:35 +00:00

Actually, I'll keep the unabridged version to myself, much less, any of the abridged version, as they're rather blunt and aren't necessarily within the scope of this repo.

Understood. Btw, they've seemed to have figured out the training part, please take a look and see if this can be added to this repo, with everything accessible via gradio.

>Actually, I'll keep the unabridged version to myself, much less, any of the abridged version, as they're rather blunt and aren't necessarily within the scope of this repo. Understood. Btw, they've seemed to have [figured out the training part, please take a look](https://github.com/152334H/DL-Art-School) and see if this can be added to this repo, with everything accessible via gradio.

Armored1065 reopened this issue

2023-02-16 02:57:09 +00:00

mrq commented

2023-02-16 04:08:37 +00:00

Christ, that was quick. Props to that mad lad for getting it working. Conveniently just in time too, as I felt I was starting to run out of ways to boost performance/quality to make up for the lack of being able to fine-tune/retrain.

I suppose I'll list out my thoughts around it (bulleted, because my autism likes bulleting things):

> the changes in this repo are also licensed as AGPL

seeing him also re-license it to AGPL reminded me I was going to follow suit when I was first shown his tortoise fork, but I guess it's not that much of a big deal. I guess he has quite the ill-will towards the original dev, while I'm rather apathetic.
and by extension, I suppose there's not much to use code-wise from that repo itself, as it's moreso documentation at this point.
although I also don't think I used any substantial amount of code from the tortoise-tts-fast, just like, the lines to fix kv_cacheing

> INSTALLATION

> ONLY TESTED ON python=3.9; use your existing tortoise env if possible
- how most convenient
I guess a considerable challenge though is figuring out how I want to re-distribute it, like:
- if I want it to be another branch but merged with my fork
- have it a separate repo that incidentally somehow integrates with this repo
- just fuck it and copy paste the files and ignore the git history with them

> RUNNING

seems pretty straightforward
I'd imagine with a web UI implementation, dealing with the yml editing would be very idiot-proof
I think the only challenge would be:
- integrating the train.py nicely to train it
  - which shouldn't be a problem, as I already pretty much erased tts.tts_with_preset from the original tortoise
  - integrating the gradio progress thing should be easy as I already got an override for it
with the web UI integration too, it would be very straightforward to pick and choose checkpoint models, similar to Voldy's Web UI picking models

> RESULTS

> 500 steps [...] batch size 128 [...] ~4.5k wav files [...] 11 epochs
- mmm
- I guess it's less about the large source sample and more about the epochs
- but depending on the it/s I might be CBT'd if I have to have try and train off of my 2060 (if I can't slap DirectML onto it)
impressive it copied the accent
- although it's a little bit of a shock how crunchy default tortoise output is, I'm already too spoiled by voicefixed-and-resampled-to-44K audio even with quick settings

> This project is in its infantcy, and is in desperate need of contributors. If you have any suggestions / ideas, please visit the discussions page. If you have programming skills, try making a pull request!

that's where I come in I suppose

> offload all of the work to other contributors

hah, yeah. Most definitely where I come in I suppose, not that I mind
- funnily-ish enough, that's part of the reason I was wary of collaborating, as I wanted to see how things turned out independently from me

As for what's on my plate (my own sort of to-do, I suppose to put it in writing):

get something trained just for me to play around with and get a good feel for it
document which settings work best, or at least tips and tricks (similar to my findings for Textual Inversion when it was in its infancy)
slap it into my web UI as a Training tab
- parameters for handling the yml configurating parts
- add in stuff for selecting autoregressive models (between default and trained and named models)
- cram in gradio Progress stuff for any tqdm parts (I don't think I even checked if track_tqdm works all that well)
try and slap my crude DirectML wrapper onto it
- I say try, because I tried doing it with kohya/sd-script LoRA trainer since I was curious and it couldn't
- although, I can't imagine why a GPT2 trainer won't work for it, but a GPT2 inferencer will
something else, but I'll figure it out by the time I get to this point

However, I make no promises on my to-do.

Christ, that was quick. Props to that mad lad for getting it working. Conveniently just in time too, as I felt I was starting to run out of ways to boost performance/quality to make up for the lack of being able to fine-tune/retrain. I suppose I'll list out my thoughts around it (bulleted, because my autism likes bulleting things): >\> the changes in this repo are also licensed as AGPL * seeing him also re-license it to AGPL reminded me I was going to follow suit when I was first shown his tortoise fork, but I guess it's not that much of a big deal. I guess he has quite the ill-will towards the original dev, while I'm rather apathetic. * and by extension, I suppose there's not much to use code-wise from that repo itself, as it's moreso documentation at this point. * although I also don't think I used any substantial amount of code from the tortoise-tts-fast, just like, the lines to fix `kv_cache`ing >\> INSTALLATION * >\> ONLY TESTED ON python=3.9; use your existing tortoise env if possible - how most convenient * I guess a considerable challenge though is figuring out how I want to re-distribute it, like: - if I want it to be another branch but merged with my fork - have it a separate repo that incidentally somehow integrates with this repo - just fuck it and copy paste the files and ignore the git history with them >\> RUNNING * seems pretty straightforward * I'd imagine with a web UI implementation, dealing with the yml editing would be very idiot-proof * I think the only challenge would be: - integrating the `train.py` nicely to train it + which shouldn't be a problem, as I already pretty much erased `tts.tts_with_preset` from the original tortoise + integrating the gradio progress thing should be easy as I already got an override for it * with the web UI integration too, it would be very straightforward to pick and choose checkpoint models, similar to Voldy's Web UI picking models >\> RESULTS * \> 500 steps \[...\] batch size 128 \[...\] ~4.5k wav files \[...\] 11 epochs - mmm - I guess it's less about the large source sample and more about the epochs - but depending on the it/s I might be CBT'd if I have to have try and train off of my 2060 (if I can't slap DirectML onto it) * impressive it copied the accent - although it's a little bit of a shock how crunchy default tortoise output is, I'm already too spoiled by voicefixed-and-resampled-to-44K audio even with quick settings >\> This project is in its infantcy, and is in desperate need of contributors. If you have any suggestions / ideas, please visit the discussions page. If you have programming skills, try making a pull request! * that's where I come in I suppose >\> offload all of the work to other contributors * hah, yeah. Most definitely where I come in I suppose, not that I mind - funnily-ish enough, that's part of the reason I was wary of collaborating, as I wanted to see how things turned out independently from me As for what's on my plate (my own sort of to-do, I suppose to put it in writing): * get something trained just for me to play around with and get a good feel for it * document which settings work best, or at least tips and tricks (similar to my findings for Textual Inversion when it was in its infancy) * slap it into my web UI as a `Training` tab - parameters for handling the yml configurating parts - add in stuff for selecting autoregressive models (between default and trained and named models) - cram in gradio Progress stuff for any tqdm parts (I don't think I even checked if track_tqdm works all that well) * try and slap my crude DirectML wrapper onto it - I say try, because I tried doing it with kohya/sd-script LoRA trainer since I was curious and it couldn't - although, I can't imagine why a GPT2 trainer won't work for it, but a GPT2 inferencer will * something else, but I'll figure it out by the time I get to this point However, I make no promises on my to-do.

mrq changed title from ~~Existing voices and Random voice~~ to Implement Training

2023-02-16 04:17:57 +00:00

mrq closed this issue

2023-02-17 19:25:12 +00:00

mrq commented

2023-02-18 04:03:58 +00:00

I should have everything working for training under the new, cleaned up repo. It took a lot of headaches from how many oddball fixes were needed (thinking about them right now is making my right eye twitch), but it easily handles everything from:

generating the LJSpeech-formatted dataset
- uses openai/whisper to parse through a folder under voices
- a transcription is made, as well as automatically slicing and trimming according to the transcription
- although it's a little inconsistent, some things get sliced a little too liberally
generating the training YAMLs and putting them in the right place
- although I feel it still needs some love put into it
- it still needs some more love, I feel there's some redundancies and validation
spawning a subprocess to train it:
- tortoise-tts gets destructed to (try to) free up VRAM (although I would just check the Defer TTS Load setting and restarting the web UI)
- all stdout/stderr output gets forwarded back to the web UI (wish it autoscrolled)
- button to kill the spawned process

And all of this will reuse the existing tortoise-tts files, and the dvae.pth gets easily downloaded and stored alongside the other models. Literally zero configuration outside of providing your training material and the parameters.

It's not exactly what I tried to sought for:

DLAS was not written in mind to be loaded as a package, so some kludgy hacks needed to be done
I can't exactly train this locally, as:
- it seems the batch size needs to be above some arbitrary number or it'll throw vague errors
- I can't find a sweet spot to train it off my 2060, memory is too fragmented and there's just barely not enough space left
- adding DirectML to DLAS is entirely off the table, a cursory glance shows it'll be a huge nightmare to add in, because Microshit thinks it was a great idea to not go the ROCm/HIP way and just be a drop-in replacement for anything CUDA dependent
I have to resort to a colab to train it through the web UI, which was a separate hell in a handbasket

I *should* have everything working for training under the new, cleaned up [repo](https://git.ecker.tech/mrq/ai-voice-cloning). It took a lot of headaches from how many oddball fixes were needed (thinking about them right now is making my right eye twitch), but it easily handles everything from: * generating the LJSpeech-formatted dataset - uses openai/whisper to parse through a folder under voices - a transcription is made, as well as automatically slicing and trimming according to the transcription - although it's a little inconsistent, some things get sliced a little too liberally * generating the training YAMLs and putting them in the right place - although I feel it still needs some love put into it - it still needs some more love, I feel there's some redundancies and validation * spawning a subprocess to train it: - tortoise-tts gets destructed to (try to) free up VRAM (although I would just check the `Defer TTS Load` setting and restarting the web UI) - all stdout/stderr output gets forwarded back to the web UI (wish it autoscrolled) - button to kill the spawned process And all of this will reuse the existing tortoise-tts files, and the `dvae.pth` gets easily downloaded and stored alongside the other models. Literally zero configuration outside of providing your training material and the parameters. It's not exactly what I tried to sought for: * DLAS was ***not*** written in mind to be loaded as a package, so some kludgy hacks needed to be done * I can't exactly train this locally, as: - it seems the batch size needs to be above some arbitrary number or it'll throw vague errors - I can't find a sweet spot to train it off my 2060, memory is too fragmented and there's just barely not enough space left - adding DirectML to DLAS is entirely off the table, a cursory glance shows it'll be a huge nightmare to add in, because Microshit thinks it was a great idea to not go the ROCm/HIP way and just be a drop-in replacement for anything CUDA dependent * I have to resort to a colab to train it through the web UI, which was a separate hell in a handbasket

🎉 1

Armored1065 referenced this issue from mrq/ai-voice-cloning

2023-02-19 06:04:32 +00:00

Something went wrong #7

Sign in to join this conversation.