mrq adding significant American accent to same voice samples from tortoise-fast-tts #219

Open
opened 2023-04-28 11:37:10 +00:00 by Acephalia · 7 comments

Just wondering is anyone has any idea why this could happen? I just switched over from tortoise fast-tts a couple of days ago and I added all my custom voice samples to new voices via the utilities tab. However every pth that is created has a significant American accent added to it.

The samples are identical as mentioned and fast-tts has no issue getting quite a good match on some of them. Im assuming it must be a setting somewhere? Any guidance would be greatly appreciated!

My clips are all 20250 and wav files between 5-10 secs.

thanks in advance.

Just wondering is anyone has any idea why this could happen? I just switched over from tortoise fast-tts a couple of days ago and I added all my custom voice samples to new voices via the utilities tab. However every pth that is created has a significant American accent added to it. The samples are identical as mentioned and fast-tts has no issue getting quite a good match on some of them. Im assuming it must be a setting somewhere? Any guidance would be greatly appreciated! My clips are all 20250 and wav files between 5-10 secs. thanks in advance.
Owner

The fundamental difference between the two forks that would cause this is in the method of computing a voice's conditional latents. I can't remember how the 152334H fork handles computing its latents, but for sure with my fork you'll need to wrangle it with playing around with the voice chunk size and regenerating them, or playing with your sample count.

The fundamental difference between the two forks that would cause this is in the method of computing a voice's conditional latents. I can't remember how the 152334H fork handles computing its latents, but for sure with my fork you'll need to wrangle it with playing around with the voice chunk size and regenerating them, or playing with your sample count.
Author

Firstly thank you you so much for actively responding. Really appreciate your time.

Okay that does make sense and it’s along the lines of my assumptions, that something was happening when the latents were computed.

I’m actually not sure if that one produces a pt file as yours does to be honest I’ve tried to figure it out but I cant find any corresponding files. But that doesn’t make any sense does it? Can it potentially just run off just the sample files?

The generation time is faster on the fast fork but yours produces much more stable outputs with some really nice nuances. Which is why I really want to try and get the accent issue sorted.

I will do some experimenting today with the chunks and report back. Will leave the thread open to post my findings.

Firstly thank you you so much for actively responding. Really appreciate your time. Okay that does make sense and it’s along the lines of my assumptions, that something was happening when the latents were computed. I’m actually not sure if that one produces a pt file as yours does to be honest I’ve tried to figure it out but I cant find any corresponding files. But that doesn’t make any sense does it? Can it potentially just run off just the sample files? The generation time is faster on the fast fork but yours produces much more stable outputs with some really nice nuances. Which is why I really want to try and get the accent issue sorted. I will do some experimenting today with the chunks and report back. Will leave the thread open to post my findings.
Owner

I’m actually not sure if that one produces a pt file as yours does to be honest I’ve tried to figure it out but I cant find any corresponding files.

From what I remember, the 152334H doesn't save the latents it generates, but it can load precomputed latents (since it's a feature inherent to TorToiSe anyways). So you can always throw in latents generated with my fork into 152334H (I think the only thing is that you need the voice folder to only have the latents in it).

The generation time is faster on the fast fork

I'm still not too sure why there's a discrepancy. I'd double check what the sample batch size is set to under settings (or manually set it). My only other guess might have to do with the diffuser used,

> I’m actually not sure if that one produces a pt file as yours does to be honest I’ve tried to figure it out but I cant find any corresponding files. From what I remember, the 152334H doesn't save the latents it generates, but it can load precomputed latents (since it's a feature inherent to TorToiSe anyways). So you can always throw in latents generated with my fork into 152334H (I think the only thing is that you need the voice folder to *only* have the latents in it). > The generation time is faster on the fast fork I'm still not too sure why there's a discrepancy. I'd double check what the sample batch size is set to under settings (or manually set it). My only other guess might have to do with the diffuser used,
Author

I spent about 10 hours trying to get something to sound similar on both but it just won’t.

Then I started trying to fix the streamlit webui and add some more settings into it. It’s a bit of a crap show lol.

I’m trying to figure out what parameters equate to your webui.

I first assumed Number of Diffusion Steps in the tortoise api.py = Samples in your gui.

But then there is also autoregressive samples?

And autoregressive batch count. Which I though was same as voice chunks?

But I’m just lost at this point. Gah!

I spent about 10 hours trying to get something to sound similar on both but it just won’t. Then I started trying to fix the streamlit webui and add some more settings into it. It’s a bit of a crap show lol. I’m trying to figure out what parameters equate to your webui. I first assumed Number of Diffusion Steps in the tortoise api.py = Samples in your gui. But then there is also autoregressive samples? And autoregressive batch count. Which I though was same as voice chunks? But I’m just lost at this point. Gah!
Owner

Number of Diffusion Steps
should map to Iterations (because it's the amount of iterations taken of diffusion to create the waveform).

autoregressive samples
should map to Samples (because it's the samples it takes to pick the best amongst them).

autoregressive batch count
should map to Sample Batch Size under Settings.

same as voice chunks
Voice chunks is how many pieces to split your source voice for computing the latents on.

> Number of Diffusion Steps should map to `Iterations` (because it's the amount of iterations taken of diffusion to create the waveform). > autoregressive samples should map to `Samples` (because it's the samples it takes to pick the best amongst them). >autoregressive batch count should map to `Sample Batch Size` under `Settings`. >same as voice chunks Voice chunks is how many pieces to split your source voice for computing the latents on.
Author

@mrq thank you for that it actually helped a lot. I completely dissected the streamlit app now and think I have a slightly better idea of what is going on. But I have no idea how it all ties together. So I will leave my findings here.

I decided to start at the conditioning_latents. So I pulled out the latents Fork 152334H was generating and ran them through yours. Hello sweet Charles Dance. At this point I'm like - okay, its just a difference in how the latents are computed. (As we suspected.)

But something was bugging me so I went back to the api and the app. This is where i realised Fork 152334H has some weird way of computing the auto aggressive batch count.

It has a value called Steps This value is used to to control the auto aggressive samples (in some god forsaken way) It seems to do some sort of calculation like 2 ** (sampler + 1) which is decided by which sampler is used. The weirdest part is it is passed back to the tortoise api through diffusion_iterations. (my head hurts). Maybe I’m completely off but that’s the only l thing that isn’t tracking right now.

This also seems to somehow affect the way latents are computed as well which could explain the difference. But don't hold me to that. The how is way above my knowledge grade :/

Good thing is at least I can get the latents and push them into this version for a quick fix. If anyone wants to do the same I forked a 152334H GUI version with more control here.

I can also confirm that fine tuning does definitely give things a much better likeness and fixes the accent issues. Just more work and time. But as always more effort = better reward.

Thank you for your amazing work again!

@mrq thank you for that it actually helped a lot. I completely dissected the streamlit app now and think I have a slightly better idea of what is going on. But I have no idea how it all ties together. So I will leave my findings here. I decided to start at the conditioning_latents. So I pulled out the latents Fork 152334H was generating and ran them through yours. Hello sweet Charles Dance. At this point I'm like - okay, its just a difference in how the latents are computed. (As we suspected.) But something was bugging me so I went back to the api and the app. This is where i realised Fork 152334H has some weird way of computing the auto aggressive batch count. It has a value called `Steps` This value is used to to control the auto aggressive samples (in some god forsaken way) It seems to do some sort of calculation like `2 ** (sampler + 1)` which is decided by which sampler is used. The weirdest part is it is passed back to the tortoise api through `diffusion_iterations`. (my head hurts). Maybe I’m completely off but that’s the only l thing that isn’t tracking right now. This also seems to somehow affect the way latents are computed as well which could explain the difference. But don't hold me to that. The how is way above my knowledge grade :/ Good thing is at least I can get the latents and push them into this version for a quick fix. If anyone wants to do the same I forked a 152334H GUI version with more control [here](https://github.com/Acephalia/tortoise-tts-fast-GUI). I can also confirm that fine tuning does definitely give things a much better likeness and fixes the accent issues. Just more work and time. But as always more effort = better reward. Thank you for your amazing work again!

@Acephalia I've been running into a similar issue trying to clone David Attenborough's voice. The mrq project one comes out with the accent diminished noticeably. I'd like to try your solution out but wanted to clear up a couple things.

Is your repo basically the fast-TTS fork with the webUI from mrq built on top? The mrq repo has tools for preparing a dataset for training and subsequent training...are these also in your project?

I'm working with an AMD GPU, so building these things can be a huge PITA. As much info on what I'm getting into as possible before I start is valuable.

Thanks!

@Acephalia I've been running into a similar issue trying to clone David Attenborough's voice. The mrq project one comes out with the accent diminished noticeably. I'd like to try your solution out but wanted to clear up a couple things. Is your repo basically the fast-TTS fork with the webUI from mrq built on top? The mrq repo has tools for preparing a dataset for training and subsequent training...are these also in your project? I'm working with an AMD GPU, so building these things can be a huge PITA. As much info on what I'm getting into as possible before I start is valuable. Thanks!
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#219
No description provided.