Implementing XTTS by coqui? #386
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#386
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
https://huggingface.co/coqui/XTTS-v1
https://huggingface.co/spaces/coqui/xtts
its very fast and the quality is pretty good also though the cloning isnt perfect ... but idk if you can figure out how to improve it or make it fine tunable..or use its inference to improve speed of other models? idk tbh but its pretty cool.
i was also wondering if you'd like to join Synthetic Voices Discord since we use your project a lot there and share work.. would be great if we had you in there.
https://discord.gg/xUrTtfc9BT
I'll take a gander at it when I get a chance.
Oh...
For TorToiSe, what it boasts it can do is very promising, especially if their proprietary copy support streamed inference (which I highly doubt, TorToiSe just isn't an arch you can stream output from without gutting it).
Again, I'll take a gander when I can.
Think I got a great understanding of how it functions. It'd be better to simply "subjugate" (for lack of a better term) the XTTS model and vocab, than to implement a separate backend and delve in the further spaghetti of AIVC's code, with proper attributions and license notice of course.
TTS.tts.layers.xtts.GPT
) seems to take almost the same model parameters outside of some extra bits to account for a larger vocab (denoted in the model JSON).[en]
,[ja]
, etc.), and a bunch of per-language text.japanese.json
tokenizer.If anything, I don't think I even have to really bother with modifying the UnifiedVoice parameters or the tokenizer. I very much wouldn't be surprised if the model was trained off of the existing TorToiSe weights with the new vocab, but it's just speculation.
I don't know. I shouldn't be that peeved by this and call it a perverted injustice; after all, I originally did start working on TorToiSe to try and jumpstart interest in it, and it has. It's just... kind of gross that it's briefly mentioned it was "Built on Tortoise :)" when it's so very much is TorToiSe. Then again, it's not my right to peeve over it, as I'm not neonbjb, and I've probably done my own fair share of perverted injustices with working on TorToiSe, but at the very least I have the morality to credit it properly.
I'll spend some time fiddling with getting the weights loaded under mrq/tortoise-tts, it shouldn't be that hard. If it works, I'll offer a flag or something to one can use to make use of the weights (again, with proper attributions and licenses attached) in the web UI.
I do not expect this to be any more fantastic over the base model; TorToiSe has its own limitations, and hats off to neonbjb for however countless hours and sweat he poured into it. For all the bits of flack I give it, it's still fantastic when it works, and getting a model trained is no easy task. But at the end of the day, a GPT2 transformer that takes in speech latent features and an odd tokenizer to spit out tokens representing a mel spectorgram is always going to be limiting. Fantastic for its time, but there's better things, and you can't train out those issues. Hell, even addressing those issues in VALL-E, there's still issues.
so in a sense.. what benefit could you get out of it? in inference speed? or quality?
Hard to say until I actually use the model, both with Coqui and as a subjugated TorToiSe model.
Purely theoretically:
(In short, quality yes, speed maybe)
Again, purely theoretical. I do expect it to be a quality boost, especially under a TorToiSe implementation with BigVGAN, and there might be a chance it performs faster when using it under Coqui. I just don't expect it to magically fix all of TorToiSe's issues, and using it with Coqui seems like it's a bit of a downgrade.
I need to figure out how to go about getting TorToiSe fired up, as I forgot I slotted out my 2060, and my 6800XTs are in my personal rig, but I could always just pause training on one of my GPUs and fiddle in TorToiSe with the XTTS model.
Although, I realized something. The weights for XTTS would need a little work to slot it into TorToiSe. The pickled file clocks in at 2.9GiB, while base TorToiSe clocks in at 1.7GiB. I imagine some other models' weights are included, so either I would need additional code to handle it, or I would have to redistribute the weights, which I guess as long as provide attribution and the license, it should be daijoubu.
The provided XTTS weights does include the AR, diffusion, and vocoder, but loading the pickled file requires Coqui to be already installed for the 'config' entry, so the best option now is just to redistribute a slightly modified copy of the pickled weights.
Alrighty, I've converted the weights over and did some small tweaks to load it in TorToiSe. When it's uploaded, it'll be under https://huggingface.co/ecker/coqui-xtts.
Samples (using Mitsuru Persona 4 as the reference voice, the text prompt I'm pretty sure was just something lolsorandum xD as a left over from testing VALL-E under Windows):
My thoughts:
[I am happy],
stuff), since[en] Something
will trigger the redaction code and will throw an error. I suppose the one responsible for the implementation couldn't bother with some logic to make it work with it. I suppose I can patch this myself by, instead, having{en}
and whatnot in the tokenizer vocab. It's necessary if a user wants to use it and leverage the cross-langual-linguistics../tortoise/api.py
's TTS(). I'm rather embarrassed by how it looks.To reiterate, this is when using the model under TorToiSe. There could be some magic with using it under Coqui to get the voice clonability to be actually useful, but I doubt it. Finetunes can definitely be done, but the DLAS config YAML needs to be modified to use the XTTS weights.
That would be great to have theirs model or maybe there way of tokenizer/cleaners, because I feel their multi lingual model is far better at saying words( in German on the one example I tried) compared to the dataset I've been training (with hours of data)plus nonomads advice/model on foreign utterances.
So just trying to test this out, I downloaded all the model and tokenizer from here : https://huggingface.co/ecker/coqui-xtts/tree/main/models
![image](/mrq/ai-voice-cloning/attachments/904283b4-7e39-45ec-87a5-bbb0a782e0ec)
And made sure to select them from the drop downs but getting the error above. Any ideas?
Make sure you've updated tortoise-tts with:
Yep it updated and I screwed up by not by downloading the Tokenizer correctly initally. However even after this I get this after re-creating the latents and selecting the models/diffusion/token :
Also used the convert script to do it locally myself but that through the same error when I used them models. Seems linked to the new tokenizer?
How strange, I did do some tests with XTTS's tokenizer but I didn't notice anything different. However, I did trigger some assertions when using
[{lang}] Text
, as it was trying to do redaction (The[I am very sad,] text
stuff with wav2vec2), but I can easily fix that with editing the tokenizer or disabling redaction.From the stack trace it looks like it's stemming from CLVP, which I wonder if it's because I've been doing my tests with the
Unsqueeze batch for CLVP/CVVP
or whatever on earth I called it in the web UI's settings. It'll do the CLVP/CVVP candidate sampling one by one rather than in batches. I suppose that should be the only difference maker, but I don't know why it would affect anything.I wonder if it's because I made the changes hereto use nanonomad latin tokenizer? Because the model runs on his tokenizer but not on the one you supplied. The strange thing is I get different accents each time (on nanonomads token) and it sounds nothing like the samples (I tried multiple). Sometimes it would be in a CN, ES, EN accent but speak in German etc different each generation. Maybe I need to revert to the standard files?