mrq adding significant American accent to same voice samples from tortoise-fast-tts #219
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#219
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Just wondering is anyone has any idea why this could happen? I just switched over from tortoise fast-tts a couple of days ago and I added all my custom voice samples to new voices via the utilities tab. However every pth that is created has a significant American accent added to it.
The samples are identical as mentioned and fast-tts has no issue getting quite a good match on some of them. Im assuming it must be a setting somewhere? Any guidance would be greatly appreciated!
My clips are all 20250 and wav files between 5-10 secs.
thanks in advance.
The fundamental difference between the two forks that would cause this is in the method of computing a voice's conditional latents. I can't remember how the 152334H fork handles computing its latents, but for sure with my fork you'll need to wrangle it with playing around with the voice chunk size and regenerating them, or playing with your sample count.
Firstly thank you you so much for actively responding. Really appreciate your time.
Okay that does make sense and it’s along the lines of my assumptions, that something was happening when the latents were computed.
I’m actually not sure if that one produces a pt file as yours does to be honest I’ve tried to figure it out but I cant find any corresponding files. But that doesn’t make any sense does it? Can it potentially just run off just the sample files?
The generation time is faster on the fast fork but yours produces much more stable outputs with some really nice nuances. Which is why I really want to try and get the accent issue sorted.
I will do some experimenting today with the chunks and report back. Will leave the thread open to post my findings.
From what I remember, the 152334H doesn't save the latents it generates, but it can load precomputed latents (since it's a feature inherent to TorToiSe anyways). So you can always throw in latents generated with my fork into 152334H (I think the only thing is that you need the voice folder to only have the latents in it).
I'm still not too sure why there's a discrepancy. I'd double check what the sample batch size is set to under settings (or manually set it). My only other guess might have to do with the diffuser used,
I spent about 10 hours trying to get something to sound similar on both but it just won’t.
Then I started trying to fix the streamlit webui and add some more settings into it. It’s a bit of a crap show lol.
I’m trying to figure out what parameters equate to your webui.
I first assumed Number of Diffusion Steps in the tortoise api.py = Samples in your gui.
But then there is also autoregressive samples?
And autoregressive batch count. Which I though was same as voice chunks?
But I’m just lost at this point. Gah!
@mrq thank you for that it actually helped a lot. I completely dissected the streamlit app now and think I have a slightly better idea of what is going on. But I have no idea how it all ties together. So I will leave my findings here.
I decided to start at the conditioning_latents. So I pulled out the latents Fork 152334H was generating and ran them through yours. Hello sweet Charles Dance. At this point I'm like - okay, its just a difference in how the latents are computed. (As we suspected.)
But something was bugging me so I went back to the api and the app. This is where i realised Fork 152334H has some weird way of computing the auto aggressive batch count.
It has a value called
Steps
This value is used to to control the auto aggressive samples (in some god forsaken way) It seems to do some sort of calculation like2 ** (sampler + 1)
which is decided by which sampler is used. The weirdest part is it is passed back to the tortoise api throughdiffusion_iterations
. (my head hurts). Maybe I’m completely off but that’s the only l thing that isn’t tracking right now.This also seems to somehow affect the way latents are computed as well which could explain the difference. But don't hold me to that. The how is way above my knowledge grade :/
Good thing is at least I can get the latents and push them into this version for a quick fix. If anyone wants to do the same I forked a 152334H GUI version with more control here.
I can also confirm that fine tuning does definitely give things a much better likeness and fixes the accent issues. Just more work and time. But as always more effort = better reward.
Thank you for your amazing work again!
@Acephalia I've been running into a similar issue trying to clone David Attenborough's voice. The mrq project one comes out with the accent diminished noticeably. I'd like to try your solution out but wanted to clear up a couple things.
Is your repo basically the fast-TTS fork with the webUI from mrq built on top? The mrq repo has tools for preparing a dataset for training and subsequent training...are these also in your project?
I'm working with an AMD GPU, so building these things can be a huge PITA. As much info on what I'm getting into as possible before I start is valuable.
Thanks!