Implementing XTTS by coqui? #386

Open
opened 2023-09-15 09:05:01 +00:00 by maki6003 · 11 comments

https://huggingface.co/coqui/XTTS-v1
https://huggingface.co/spaces/coqui/xtts

its very fast and the quality is pretty good also though the cloning isnt perfect ... but idk if you can figure out how to improve it or make it fine tunable..or use its inference to improve speed of other models? idk tbh but its pretty cool.

i was also wondering if you'd like to join Synthetic Voices Discord since we use your project a lot there and share work.. would be great if we had you in there.

https://discord.gg/xUrTtfc9BT

https://huggingface.co/coqui/XTTS-v1 https://huggingface.co/spaces/coqui/xtts its very fast and the quality is pretty good also though the cloning isnt perfect ... but idk if you can figure out how to improve it or make it fine tunable..or use its inference to improve speed of other models? idk tbh but its pretty cool. i was also wondering if you'd like to join Synthetic Voices Discord since we use your project a lot there and share work.. would be great if we had you in there. https://discord.gg/xUrTtfc9BT
Owner

its very fast and the quality is pretty good also though the cloning isnt perfect ... but idk if you can figure out how to improve it or make it fine tunable..or use its inference to improve speed of other models? idk tbh but its pretty cool.

I'll take a gander at it when I get a chance.

ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 3-second audio clip. Built on Tortoise, ⓍTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. There is no need for an excessive amount of training data that spans countless hours.

Built on Tortoise

Oh...


For TorToiSe, what it boasts it can do is very promising, especially if their proprietary copy support streamed inference (which I highly doubt, TorToiSe just isn't an arch you can stream output from without gutting it).

Again, I'll take a gander when I can.

> its very fast and the quality is pretty good also though the cloning isnt perfect ... but idk if you can figure out how to improve it or make it fine tunable..or use its inference to improve speed of other models? idk tbh but its pretty cool. I'll take a gander at it when I get a chance. > ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 3-second audio clip. Built on Tortoise, ⓍTTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. There is no need for an excessive amount of training data that spans countless hours. > Built on Tortoise Oh... --- For TorToiSe, what it boasts it can do is very promising, especially if their proprietary copy support streamed inference (which I *highly* doubt, TorToiSe just isn't an arch you can stream output from without gutting it). Again, I'll take a gander when I can.
Owner

Think I got a great understanding of how it functions. It'd be better to simply "subjugate" (for lack of a better term) the XTTS model and vocab, than to implement a separate backend and delve in the further spaghetti of AIVC's code, with proper attributions and license notice of course.

  • Their UnifiedVoice analog ("masked" as TTS.tts.layers.xtts.GPT) seems to take almost the same model parameters outside of some extra bits to account for a larger vocab (denoted in the model JSON).
    • Fun fact: the model JSON even references phonemizers, and still locked to the 22050Hz mel spectrograms.
  • The vocab is simply expanded beyond the "BPE" tokenizer (being able to fit the token indices under a byte) by including a bunch of language tokens ([en], [ja], etc.), and a bunch of per-language text.
    • I'm rather skeptical of this approach, but this is just a subjective skepticism. I'm sure expanding the vocab tokens helps, but for audio I imagine it helps quite a lot to be able to reuse existing "phoneme" representations and cross your fingers your attention heads can correlate things to the language token in the beginning.
    • Their tokenizer code does "reference" (as nicely put as can be) my fix for Japanese is a smidge funny. I've always hated it, it's a bad bandaid. Bad enough I didn't even bother relying on my japanese.json tokenizer.
  • The stack neglects TorToiSe's extra features (the bandaid bloat I've called before, but they're very much necessary as I've found) like the CLVP/CVVP model or the wav2vec2 redaction at the end, or even nice creature comforts like inherently being able to load cached latents (that wasn't my doing originally, just the logic to make it work since base TorToiSe never saves them).
    • This also neglects the extra features like "smarter" latent calculation, BigVGAN, even DeepSpeed inferencing, even half-precision or AMP.

If anything, I don't think I even have to really bother with modifying the UnifiedVoice parameters or the tokenizer. I very much wouldn't be surprised if the model was trained off of the existing TorToiSe weights with the new vocab, but it's just speculation.

I don't know. I shouldn't be that peeved by this and call it a perverted injustice; after all, I originally did start working on TorToiSe to try and jumpstart interest in it, and it has. It's just... kind of gross that it's briefly mentioned it was "Built on Tortoise :)" when it's so very much is TorToiSe. Then again, it's not my right to peeve over it, as I'm not neonbjb, and I've probably done my own fair share of perverted injustices with working on TorToiSe, but at the very least I have the morality to credit it properly.

I'll spend some time fiddling with getting the weights loaded under mrq/tortoise-tts, it shouldn't be that hard. If it works, I'll offer a flag or something to one can use to make use of the weights (again, with proper attributions and licenses attached) in the web UI.

I do not expect this to be any more fantastic over the base model; TorToiSe has its own limitations, and hats off to neonbjb for however countless hours and sweat he poured into it. For all the bits of flack I give it, it's still fantastic when it works, and getting a model trained is no easy task. But at the end of the day, a GPT2 transformer that takes in speech latent features and an odd tokenizer to spit out tokens representing a mel spectorgram is always going to be limiting. Fantastic for its time, but there's better things, and you can't train out those issues. Hell, even addressing those issues in VALL-E, there's still issues.

Think I got a great understanding of how it functions. It'd be better to simply "subjugate" (for lack of a better term) the XTTS [model](https://huggingface.co/coqui/XTTS-v1/blob/main/model.pth) and vocab, than to implement a separate backend and delve in the further spaghetti of AIVC's code, with proper attributions and license notice of course. * Their UnifiedVoice analog ("masked" as `TTS.tts.layers.xtts.GPT`) seems to take [almost](https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/models/xtts.py#L310) the same model parameters outside of some extra bits to account for a larger vocab (denoted in the [model JSON](https://huggingface.co/coqui/XTTS-v1/raw/main/config.json)). - Fun fact: the model JSON even references phonemizers, and still locked to the 22050Hz mel spectrograms. * The [vocab](https://huggingface.co/coqui/XTTS-v1/raw/main/vocab.json) is simply expanded beyond the "BPE" tokenizer (being able to fit the token indices under a byte) by including a bunch of language tokens (`[en]`, `[ja]`, etc.), and a bunch of per-language text. - I'm rather skeptical of this approach, but this is just a subjective skepticism. I'm sure expanding the vocab tokens helps, but for audio I imagine it helps quite a lot to be able to reuse existing "phoneme" representations and cross your fingers your attention heads can correlate things to the language token in the beginning. - Their [tokenizer code](https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/layers/xtts/tokenizer.py#L255) does "reference" (as nicely put as can be) [my fix for Japanese](https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/layers/xtts/tokenizer.py#L255) is a smidge funny. I've always hated it, it's a bad bandaid. Bad enough I didn't even bother relying on my `japanese.json` tokenizer. * The stack neglects TorToiSe's extra features (the bandaid bloat I've called before, but they're very much necessary as I've found) like the CLVP/CVVP model or the wav2vec2 redaction at the end, or even nice creature comforts like inherently being able to load cached latents (that wasn't my doing originally, just the logic to make it work since base TorToiSe never saves them). - This also neglects the extra features like "smarter" latent calculation, BigVGAN, even DeepSpeed inferencing, even half-precision or AMP. If anything, I don't think I even have to really bother with modifying the UnifiedVoice parameters or the tokenizer. I very much wouldn't be surprised if the model *was* trained off of the existing TorToiSe weights with the new vocab, but it's just speculation. I don't know. I shouldn't be *that* peeved by this and call it a perverted injustice; after all, I originally did start working on TorToiSe to try and jumpstart interest in it, and it has. It's just... kind of gross that it's briefly mentioned it was "Built on Tortoise :)" when it's so very much *is* TorToiSe. Then again, it's not my right to peeve over it, as I'm not neonbjb, and I've probably done my own fair share of perverted injustices with working on TorToiSe, but at the very least I have the morality to credit it properly. I'll spend some time fiddling with getting the weights loaded under mrq/tortoise-tts, it shouldn't be that hard. If it works, I'll offer a flag or something to one can use to make use of the weights (again, with proper attributions and licenses attached) in the web UI. I *do not* expect this to be any more fantastic over the base model; TorToiSe has its own limitations, and hats off to neonbjb for however countless hours and sweat he poured into it. For all the bits of flack I give it, it's still *fantastic* when it works, and getting a model trained is no easy task. But at the end of the day, a GPT2 transformer that takes in speech latent features and an odd tokenizer to spit out tokens representing a mel spectorgram is always going to be limiting. Fantastic for its time, but there's better things, and you can't train out those issues. Hell, even addressing those issues in VALL-E, there's *still* issues.
Author

so in a sense.. what benefit could you get out of it? in inference speed? or quality?

so in a sense.. what benefit could you get out of it? in inference speed? or quality?
Owner

what benefit could you get out of it? in inference speed? or quality?

Hard to say until I actually use the model, both with Coqui and as a subjugated TorToiSe model.

Purely theoretically:

  • the lack of a CLVP/CVVP model to score candidates with is a big hint of a speedup and a moderate hint at quality.
    • Not only does this remove an additional step and having another model to load into VRAM, it hints that only one sequence needs to be sampled, rather than having a large batch size to sample for candidates.
    • Quality wise, this hints that the model is robust enough that it does not need to sample for candidates.
    • This is assuming there isn't some other method to it. In theory, beam searching in the sampler should help with finding more likely candidates at the cost of having more beams be a larger batch, but my glance at it didn't seem to hint that it does use beam searching, despite there being logic to handle it (if I remember right).
  • Inherently, the model being trained on multiple languages should help its zero-shot capabilities a lot, semi-regardless if the way it went about it through the tokenizer being rather odd. More languages and speakers should help the model in terms of variety, and zero-shot definitely needs more speakers.
    • To expand on this, the amount of speakers in the dataset used to train XTTS can also help its zero-shot capabilities.

(In short, quality yes, speed maybe)

Again, purely theoretical. I do expect it to be a quality boost, especially under a TorToiSe implementation with BigVGAN, and there might be a chance it performs faster when using it under Coqui. I just don't expect it to magically fix all of TorToiSe's issues, and using it with Coqui seems like it's a bit of a downgrade.

I need to figure out how to go about getting TorToiSe fired up, as I forgot I slotted out my 2060, and my 6800XTs are in my personal rig, but I could always just pause training on one of my GPUs and fiddle in TorToiSe with the XTTS model.


Although, I realized something. The weights for XTTS would need a little work to slot it into TorToiSe. The pickled file clocks in at 2.9GiB, while base TorToiSe clocks in at 1.7GiB. I imagine some other models' weights are included, so either I would need additional code to handle it, or I would have to redistribute the weights, which I guess as long as provide attribution and the license, it should be daijoubu.


The provided XTTS weights does include the AR, diffusion, and vocoder, but loading the pickled file requires Coqui to be already installed for the 'config' entry, so the best option now is just to redistribute a slightly modified copy of the pickled weights.

> what benefit could you get out of it? in inference speed? or quality? Hard to say until I actually use the model, both with Coqui and as a subjugated TorToiSe model. Purely theoretically: * the lack of a CLVP/CVVP model to score candidates with is a big hint of a speedup and a moderate hint at quality. - Not only does this remove an additional step and having another model to load into VRAM, it hints that only one sequence needs to be sampled, rather than having a large batch size to sample for candidates. - Quality wise, this hints that the model is robust enough that it does not need to sample for candidates. - This is assuming there isn't some other method to it. In theory, beam searching in the sampler should help with finding more likely candidates at the cost of having more beams be a larger batch, but my glance at it didn't seem to hint that it does use beam searching, despite there being logic to handle it (if I remember right). * Inherently, the model being trained on multiple languages should help its zero-shot capabilities a lot, semi-regardless if the way it went about it through the tokenizer being rather odd. More languages and speakers should help the model in terms of variety, and zero-shot definitely needs more speakers. - To expand on this, the amount of speakers in the dataset used to train XTTS can also help its zero-shot capabilities. (In short, quality yes, speed maybe) Again, purely theoretical. I do expect it to be a quality boost, especially under a TorToiSe implementation with BigVGAN, and there *might* be a chance it performs faster when using it under Coqui. I just don't expect it to magically fix all of TorToiSe's issues, and using it with Coqui seems like it's a bit of a downgrade. I need to figure out how to go about getting TorToiSe fired up, as I forgot I slotted out my 2060, and my 6800XTs are in my personal rig, but I could always just pause training on one of my GPUs and fiddle in TorToiSe with the XTTS model. --- Although, I realized something. The weights for XTTS would need a little work to slot it into TorToiSe. The pickled file clocks in at 2.9GiB, while base TorToiSe clocks in at 1.7GiB. I imagine some other models' weights are included, so either I would need additional code to handle it, or I would have to redistribute the weights, which I guess as long as provide attribution and the license, it should be daijoubu. --- The provided XTTS weights does include the AR, diffusion, and vocoder, but loading the pickled file requires Coqui to be already installed for the 'config' entry, so the best option now is just to redistribute a slightly modified copy of the pickled weights.
Owner

Alrighty, I've converted the weights over and did some small tweaks to load it in TorToiSe. When it's uploaded, it'll be under https://huggingface.co/ecker/coqui-xtts.

Samples (using Mitsuru Persona 4 as the reference voice, the text prompt I'm pretty sure was just something lolsorandum xD as a left over from testing VALL-E under Windows):

My thoughts:

  • I feel that increasing the sample count to hope for a better candidate doesn't really matter, as after all the CLVP/CVVP mostly helps pick the most realistic utterance, rather than just the most sound-alike utterance (unless I'm wrong).
  • I feel that the overall quality of the outputs are much nicer, and realistic, but I don't remember how much credance should be lent to BigVGAN as the vocoder.
  • As a voice cloner... it's not good. For the same latents, the voice varies quite a lot, as evident in the third utterance. I suppose the "TorToiSe to RVC" users will like this at least.
  • Utilizing the language codes with XTTS's tokenizer would require wav2vec2 redaction (The [I am happy], stuff), since [en] Something will trigger the redaction code and will throw an error. I suppose the one responsible for the implementation couldn't bother with some logic to make it work with it. I suppose I can patch this myself by, instead, having {en} and whatnot in the tokenizer vocab. It's necessary if a user wants to use it and leverage the cross-langual-linguistics.
  • I cannot imagine there being any performance differences from the weights themselves. The implementation under Coqui doesn't seem to have any magic as well outside of, as mentioned before, skipping the CLVP/CVVP pass.
  • I'm pretty sure the model was trained on not-so-varied voices. The voices I keep getting don't seem all that varied.
  • My god do I need to rewrite the web UI, and I suppose clean up ./tortoise/api.py's TTS(). I'm rather embarrassed by how it looks.

To reiterate, this is when using the model under TorToiSe. There could be some magic with using it under Coqui to get the voice clonability to be actually useful, but I doubt it. Finetunes can definitely be done, but the DLAS config YAML needs to be modified to use the XTTS weights.

Alrighty, I've converted the weights over and did some small tweaks to load it in TorToiSe. When it's uploaded, it'll be under https://huggingface.co/ecker/coqui-xtts. Samples (using Mitsuru Persona 4 as the reference voice, the text prompt I'm pretty sure was just something lolsorandum xD as a left over from testing VALL-E under Windows): * https://vocaroo.com/1jkspDibGw5j * https://vocaroo.com/1gfJnZUEpCpC * https://vocaroo.com/1nEskJ415beK My thoughts: * I *feel* that increasing the sample count to hope for a better candidate doesn't really matter, as after all the CLVP/CVVP mostly helps pick the most realistic utterance, rather than just the most sound-alike utterance (unless I'm wrong). * I *feel* that the overall quality of the outputs are much nicer, and realistic, but I don't remember how much credance should be lent to BigVGAN as the vocoder. * As a voice cloner... it's not good. For the same latents, the voice varies quite a lot, as evident in the third utterance. I suppose the "TorToiSe to RVC" users will like this at least. * Utilizing the language codes with XTTS's tokenizer would require wav2vec2 redaction (The `[I am happy],` stuff), since `[en] Something` will trigger the redaction code and will throw an error. I suppose the one responsible for the implementation couldn't bother with some logic to make it work with it. I suppose I can patch this myself by, instead, having `{en}` and whatnot in the tokenizer vocab. It's necessary if a user wants to use it and leverage the cross-langual-linguistics. * I cannot imagine there being any performance differences from the weights themselves. The implementation under Coqui doesn't seem to have any magic as well outside of, as mentioned before, skipping the CLVP/CVVP pass. * I'm pretty sure the model was trained on not-so-varied voices. The voices I keep getting don't seem all that varied. * My god do I need to rewrite the web UI, and I suppose clean up `./tortoise/api.py`'s TTS(). I'm rather embarrassed by how it looks. To reiterate, this is when using the model under TorToiSe. There could be some magic with using it under Coqui to get the voice clonability to be actually useful, but I doubt it. Finetunes can definitely be done, but the DLAS config YAML needs to be modified to use the XTTS weights.

Alrighty, I've converted the weights over and did some small tweaks to load it in TorToiSe. When it's uploaded, it'll be under https://huggingface.co/ecker/coqui-xtts.

Samples (using Mitsuru Persona 4 as the reference voice, the text prompt I'm pretty sure was just something lolsorandum xD as a left over from testing VALL-E under Windows):

My thoughts:

  • I feel that increasing the sample count to hope for a better candidate doesn't really matter, as after all the CLVP/CVVP mostly helps pick the most realistic utterance, rather than just the most sound-alike utterance (unless I'm wrong).
  • I feel that the overall quality of the outputs are much nicer, and realistic, but I don't remember how much credance should be lent to BigVGAN as the vocoder.
  • As a voice cloner... it's not good. For the same latents, the voice varies quite a lot, as evident in the third utterance. I suppose the "TorToiSe to RVC" users will like this at least.
  • Utilizing the language codes with XTTS's tokenizer would require wav2vec2 redaction (The [I am happy], stuff), since [en] Something will trigger the redaction code and will throw an error. I suppose the one responsible for the implementation couldn't bother with some logic to make it work with it. I suppose I can patch this myself by, instead, having {en} and whatnot in the tokenizer vocab. It's necessary if a user wants to use it and leverage the cross-langual-linguistics.
  • I cannot imagine there being any performance differences from the weights themselves. The implementation under Coqui doesn't seem to have any magic as well outside of, as mentioned before, skipping the CLVP/CVVP pass.
  • I'm pretty sure the model was trained on not-so-varied voices. The voices I keep getting don't seem all that varied.
  • My god do I need to rewrite the web UI, and I suppose clean up ./tortoise/api.py's TTS(). I'm rather embarrassed by how it looks.

To reiterate, this is when using the model under TorToiSe. There could be some magic with using it under Coqui to get the voice clonability to be actually useful, but I doubt it. Finetunes can definitely be done, but the DLAS config YAML needs to be modified to use the XTTS weights.

That would be great to have theirs model or maybe there way of tokenizer/cleaners, because I feel their multi lingual model is far better at saying words( in German on the one example I tried) compared to the dataset I've been training (with hours of data)plus nonomads advice/model on foreign utterances.

> Alrighty, I've converted the weights over and did some small tweaks to load it in TorToiSe. When it's uploaded, it'll be under https://huggingface.co/ecker/coqui-xtts. > > Samples (using Mitsuru Persona 4 as the reference voice, the text prompt I'm pretty sure was just something lolsorandum xD as a left over from testing VALL-E under Windows): > * https://vocaroo.com/1jkspDibGw5j > * https://vocaroo.com/1gfJnZUEpCpC > * https://vocaroo.com/1nEskJ415beK > > My thoughts: > * I *feel* that increasing the sample count to hope for a better candidate doesn't really matter, as after all the CLVP/CVVP mostly helps pick the most realistic utterance, rather than just the most sound-alike utterance (unless I'm wrong). > * I *feel* that the overall quality of the outputs are much nicer, and realistic, but I don't remember how much credance should be lent to BigVGAN as the vocoder. > * As a voice cloner... it's not good. For the same latents, the voice varies quite a lot, as evident in the third utterance. I suppose the "TorToiSe to RVC" users will like this at least. > * Utilizing the language codes with XTTS's tokenizer would require wav2vec2 redaction (The `[I am happy],` stuff), since `[en] Something` will trigger the redaction code and will throw an error. I suppose the one responsible for the implementation couldn't bother with some logic to make it work with it. I suppose I can patch this myself by, instead, having `{en}` and whatnot in the tokenizer vocab. It's necessary if a user wants to use it and leverage the cross-langual-linguistics. > * I cannot imagine there being any performance differences from the weights themselves. The implementation under Coqui doesn't seem to have any magic as well outside of, as mentioned before, skipping the CLVP/CVVP pass. > * I'm pretty sure the model was trained on not-so-varied voices. The voices I keep getting don't seem all that varied. > * My god do I need to rewrite the web UI, and I suppose clean up `./tortoise/api.py`'s TTS(). I'm rather embarrassed by how it looks. > > To reiterate, this is when using the model under TorToiSe. There could be some magic with using it under Coqui to get the voice clonability to be actually useful, but I doubt it. Finetunes can definitely be done, but the DLAS config YAML needs to be modified to use the XTTS weights. That would be great to have theirs model or maybe there way of tokenizer/cleaners, because I feel their multi lingual model is far better at saying words( in German on the one example I tried) compared to the dataset I've been training (with hours of data)plus nonomads advice/model on foreign utterances.
Stored autoregressive model to settings: ./models/tortoise/autoregressive.pth
Loading autoregressive model: ./models/tortoise/autoregressive.pth
Traceback (most recent call last):
  File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "H:\ai-voice-cloning\src\utils.py", line 3741, in update_autoregressive_model
    tts.load_autoregressive_model(autoregressive_model_path)
  File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\api.py", line 363, in load_autoregressive_model
    self.autoregressive.load_state_dict(torch.load(self.autoregressive_model_path))
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UnifiedVoice:
        size mismatch for text_embedding.weight: copying a param with shape torch.Size([5024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 1024]).
        size mismatch for text_head.weight: copying a param with shape torch.Size([5024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 1024]).
        size mismatch for text_head.bias: copying a param with shape torch.Size([5024]) from checkpoint, the shape in current model is torch.Size([256]).

So just trying to test this out, I downloaded all the model and tokenizer from here : https://huggingface.co/ecker/coqui-xtts/tree/main/models
image
And made sure to select them from the drop downs but getting the error above. Any ideas?

``` Stored autoregressive model to settings: ./models/tortoise/autoregressive.pth Loading autoregressive model: ./models/tortoise/autoregressive.pth Traceback (most recent call last): File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 394, in run_predict output = await app.get_blocks().process_api( File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1075, in process_api result = await self.call_function( File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 884, in call_function prediction = await anyio.to_thread.run_sync( File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run result = context.run(func, *args) File "H:\ai-voice-cloning\src\utils.py", line 3741, in update_autoregressive_model tts.load_autoregressive_model(autoregressive_model_path) File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\api.py", line 363, in load_autoregressive_model self.autoregressive.load_state_dict(torch.load(self.autoregressive_model_path)) File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for UnifiedVoice: size mismatch for text_embedding.weight: copying a param with shape torch.Size([5024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 1024]). size mismatch for text_head.weight: copying a param with shape torch.Size([5024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 1024]). size mismatch for text_head.bias: copying a param with shape torch.Size([5024]) from checkpoint, the shape in current model is torch.Size([256]). ``` So just trying to test this out, I downloaded all the model and tokenizer from here : https://huggingface.co/ecker/coqui-xtts/tree/main/models ![image](/attachments/904283b4-7e39-45ec-87a5-bbb0a782e0ec) And made sure to select them from the drop downs but getting the error above. Any ideas?
Owner

Make sure you've updated tortoise-tts with:

cd .\modules\tortoise-tts\
git pull
Make sure you've updated tortoise-tts with: ``` cd .\modules\tortoise-tts\ git pull ```

Make sure you've updated tortoise-tts with:

cd .\modules\tortoise-tts\
git pull

Yep it updated and I screwed up by not by downloading the Tokenizer correctly initally. However even after this I get this after re-creating the latents and selecting the models/diffusion/token :

H:\ai-voice-cloning>call .\venv\Scripts\activate.bat
Whisper detected
Traceback (most recent call last):
  File "H:\ai-voice-cloning\src\utils.py", line 85, in <module>
    from vall_e.emb.qnt import encode as valle_quantize
ModuleNotFoundError: No module named 'vall_e'

Traceback (most recent call last):
  File "H:\ai-voice-cloning\src\utils.py", line 105, in <module>
    import bark
ModuleNotFoundError: No module named 'bark'

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Loading TorToiSe... (AR: ./models/tortoise/autoregressive.pth, diffusion: ./models/tortoise/diffusion_decoder.pth, vocoder: bigvgan_base_24khz_100band)
Hardware acceleration found: cuda
use_deepspeed api_debug False
Loading tokenizer JSON: ./models/tokenizers/xtts.json
Loaded tokenizer
Loading autoregressive model: ./models/tortoise/autoregressive.pth
H:\ai-voice-cloning\venv\lib\site-packages\transformers\configuration_utils.py:363: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  warnings.warn(
Loaded autoregressive model
Loaded diffusion model
Loading vocoder model: bigvgan_base_24khz_100band
Loading vocoder model: bigvgan_base_24khz_100band.pth
Removing weight norm...
Loaded vocoder model
Loaded TTS, ready for generation.
H:\ai-voice-cloning\venv\lib\site-packages\torchaudio\functional\functional.py:1458: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
  warnings.warn(
Saved voice latents: ./voices/labssamples/cond_latents_e4ce21ea.pth
[1/1] Generating line: test tset test.
Loading voice: 11labssamples with model e4ce21ea
Loading voice: 11labssamples
Reading from latent: ./voices/11labssamples//cond_latents_e4ce21ea.pth
Traceback (most recent call last):
  File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "H:\ai-voice-cloning\src\webui.py", line 94, in generate_proxy
    raise e
  File "H:\ai-voice-cloning\src\webui.py", line 88, in generate_proxy
    sample, outputs, stats = generate(**kwargs)
  File "H:\ai-voice-cloning\src\utils.py", line 351, in generate
    return generate_tortoise(**kwargs)
  File "H:\ai-voice-cloning\src\utils.py", line 1211, in generate_tortoise
    gen, additionals = tts.tts(cut_text, **settings )
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\api.py", line 799, in tts
    clvp = self.clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False)
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\clvp.py", line 130, in forward
    text_latents = self.to_text_latent(masked_mean(self.text_transformer(text_emb, mask=text_mask), text_mask, dim=1))
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\arch_util.py", line 368, in forward
    h = self.transformer(x, **kwargs)
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\xtransformers.py", line 1252, in forward
    x, intermediates = self.attn_layers(x, mask=mask, mems=mems, return_hiddens=True, **kwargs)
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\xtransformers.py", line 981, in forward
    out, inter, k, v = block(x, None, mask, None, attn_mask, self.pia_pos_emb, rotary_pos_emb,
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\arch_util.py", line 345, in forward
    return partial(x, *args)
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\xtransformers.py", line 717, in forward
    attn = self.attn_fn(dots, dim=-1)
  File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\functional.py", line 1843, in softmax
    ret = input.softmax(dim)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also used the convert script to do it locally myself but that through the same error when I used them models. Seems linked to the new tokenizer?

> Make sure you've updated tortoise-tts with: > > ``` > cd .\modules\tortoise-tts\ > git pull > ``` Yep it updated and I screwed up by not by downloading the Tokenizer correctly initally. However even after this I get this after re-creating the latents and selecting the models/diffusion/token : ``` H:\ai-voice-cloning>call .\venv\Scripts\activate.bat Whisper detected Traceback (most recent call last): File "H:\ai-voice-cloning\src\utils.py", line 85, in <module> from vall_e.emb.qnt import encode as valle_quantize ModuleNotFoundError: No module named 'vall_e' Traceback (most recent call last): File "H:\ai-voice-cloning\src\utils.py", line 105, in <module> import bark ModuleNotFoundError: No module named 'bark' Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`. Loading TorToiSe... (AR: ./models/tortoise/autoregressive.pth, diffusion: ./models/tortoise/diffusion_decoder.pth, vocoder: bigvgan_base_24khz_100band) Hardware acceleration found: cuda use_deepspeed api_debug False Loading tokenizer JSON: ./models/tokenizers/xtts.json Loaded tokenizer Loading autoregressive model: ./models/tortoise/autoregressive.pth H:\ai-voice-cloning\venv\lib\site-packages\transformers\configuration_utils.py:363: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`. warnings.warn( Loaded autoregressive model Loaded diffusion model Loading vocoder model: bigvgan_base_24khz_100band Loading vocoder model: bigvgan_base_24khz_100band.pth Removing weight norm... Loaded vocoder model Loaded TTS, ready for generation. H:\ai-voice-cloning\venv\lib\site-packages\torchaudio\functional\functional.py:1458: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged. warnings.warn( Saved voice latents: ./voices/labssamples/cond_latents_e4ce21ea.pth [1/1] Generating line: test tset test. Loading voice: 11labssamples with model e4ce21ea Loading voice: 11labssamples Reading from latent: ./voices/11labssamples//cond_latents_e4ce21ea.pth Traceback (most recent call last): File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 394, in run_predict output = await app.get_blocks().process_api( File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1075, in process_api result = await self.call_function( File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 884, in call_function prediction = await anyio.to_thread.run_sync( File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "H:\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run result = context.run(func, *args) File "H:\ai-voice-cloning\venv\lib\site-packages\gradio\helpers.py", line 587, in tracked_fn response = fn(*args) File "H:\ai-voice-cloning\src\webui.py", line 94, in generate_proxy raise e File "H:\ai-voice-cloning\src\webui.py", line 88, in generate_proxy sample, outputs, stats = generate(**kwargs) File "H:\ai-voice-cloning\src\utils.py", line 351, in generate return generate_tortoise(**kwargs) File "H:\ai-voice-cloning\src\utils.py", line 1211, in generate_tortoise gen, additionals = tts.tts(cut_text, **settings ) File "H:\ai-voice-cloning\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\api.py", line 799, in tts clvp = self.clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False) File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\clvp.py", line 130, in forward text_latents = self.to_text_latent(masked_mean(self.text_transformer(text_emb, mask=text_mask), text_mask, dim=1)) File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\arch_util.py", line 368, in forward h = self.transformer(x, **kwargs) File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\xtransformers.py", line 1252, in forward x, intermediates = self.attn_layers(x, mask=mask, mems=mems, return_hiddens=True, **kwargs) File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\xtransformers.py", line 981, in forward out, inter, k, v = block(x, None, mask, None, attn_mask, self.pia_pos_emb, rotary_pos_emb, File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\arch_util.py", line 345, in forward return partial(x, *args) File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "h:\ai-voice-cloning\modules\tortoise-tts\tortoise\models\xtransformers.py", line 717, in forward attn = self.attn_fn(dots, dim=-1) File "H:\ai-voice-cloning\venv\lib\site-packages\torch\nn\functional.py", line 1843, in softmax ret = input.softmax(dim) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` Also used the convert script to do it locally myself but that through the same error when I used them models. Seems linked to the new tokenizer?
Owner

How strange, I did do some tests with XTTS's tokenizer but I didn't notice anything different. However, I did trigger some assertions when using [{lang}] Text, as it was trying to do redaction (The [I am very sad,] text stuff with wav2vec2), but I can easily fix that with editing the tokenizer or disabling redaction.

From the stack trace it looks like it's stemming from CLVP, which I wonder if it's because I've been doing my tests with the Unsqueeze batch for CLVP/CVVP or whatever on earth I called it in the web UI's settings. It'll do the CLVP/CVVP candidate sampling one by one rather than in batches. I suppose that should be the only difference maker, but I don't know why it would affect anything.

How strange, I did do some tests with XTTS's tokenizer but I didn't notice anything different. However, I did trigger some assertions when using `[{lang}] Text`, as it was trying to do redaction (The `[I am very sad,] text` stuff with wav2vec2), but I can easily fix that with editing the tokenizer or disabling redaction. From the stack trace it looks like it's stemming from CLVP, which I wonder if it's because I've been doing my tests with the `Unsqueeze batch for CLVP/CVVP` or whatever on earth I called it in the web UI's settings. It'll do the CLVP/CVVP candidate sampling one by one rather than in batches. I suppose that should be the only difference maker, but I don't know why it would affect anything.

How strange, I did do some tests with XTTS's tokenizer but I didn't notice anything different. However, I did trigger some assertions when using [{lang}] Text, as it was trying to do redaction (The [I am very sad,] text stuff with wav2vec2), but I can easily fix that with editing the tokenizer or disabling redaction.

From the stack trace it looks like it's stemming from CLVP, which I wonder if it's because I've been doing my tests with the Unsqueeze batch for CLVP/CVVP or whatever on earth I called it in the web UI's settings. It'll do the CLVP/CVVP candidate sampling one by one rather than in batches. I suppose that should be the only difference maker, but I don't know why it would affect anything.

I wonder if it's because I made the changes hereto use nanonomad latin tokenizer? Because the model runs on his tokenizer but not on the one you supplied. The strange thing is I get different accents each time (on nanonomads token) and it sounds nothing like the samples (I tried multiple). Sometimes it would be in a CN, ES, EN accent but speak in German etc different each generation. Maybe I need to revert to the standard files?

> How strange, I did do some tests with XTTS's tokenizer but I didn't notice anything different. However, I did trigger some assertions when using `[{lang}] Text`, as it was trying to do redaction (The `[I am very sad,] text` stuff with wav2vec2), but I can easily fix that with editing the tokenizer or disabling redaction. > > From the stack trace it looks like it's stemming from CLVP, which I wonder if it's because I've been doing my tests with the `Unsqueeze batch for CLVP/CVVP` or whatever on earth I called it in the web UI's settings. It'll do the CLVP/CVVP candidate sampling one by one rather than in batches. I suppose that should be the only difference maker, but I don't know why it would affect anything. I wonder if it's because I made the changes [here](https://huggingface.co/AOLCDROM/Tortoise-TTS-de)to use nanonomad latin tokenizer? Because the model runs on his tokenizer but not on the one you supplied. The strange thing is I get different accents each time (on nanonomads token) and it sounds nothing like the samples (I tried multiple). Sometimes it would be in a CN, ES, EN accent but speak in German etc different each generation. Maybe I need to revert to the standard files?
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#386
No description provided.