Tortoise->normalize->rvc #309

New Issue

helloitsme · 2023-07-19T00:10:34Z

helloitsme commented

2023-07-19 00:10:34 +00:00

Have you messed with Mangio's RVC fork?

I've actually thought about running it through an RVC to see how things are cleaned up. The output (finetuned or not) is fine, but both the actual audio quality is a bit lacking, and there's occasional issues in the actual speech here and there, so I imagine running it through RVC would help clean up things a lot. If it works out, I suppose it'll end up getting added to the web UI anyhow, and it can be used for TorToiSe in lieu of VoiceFixer (which I would like to replace, since for however long I've had it in the stack, it would consistently have some crackle at the end).

It would be a nice way to try and bridge the gap between "fine" enough output from my VALL-E model and "good enough to use", as I worry manually training the model some more would take an astronomical amount of more time (or data).

Originally posted by @mrq in /mrq/ai-voice-cloning/issues/152#issuecomment-1934

What I've found more specifically is that I can skate with faster output from here (lower samples and lower iterations) because rvc seems to "boil" down the input audio and then reapply its own latents to it. If the input audio is already in the ballpark, then it will come out nicer. How do I know this? I have tortoise audio trained on one dataset and rvc trained on different dataset from 20 years in the future (same speaker). Despite the sound difference due to age, it can still blend very very well on a different dataset because the speaker is the same. I've tried likewise on the same dataset for both and it sounds good as well, but I prefer the voice from the two datasets blended.

I definitely can understand the challenge for trying to train two models... RVC takes a couple hours in my experience for 200ish epochs. That said, it's mandatory for MD now because the quality is just night and day better as a final polish. I do also normalize the audio in between, btw.

> Have you messed with Mangio's RVC fork? I've actually thought about running it through an RVC to see how things are cleaned up. The output (finetuned or not) is *fine*, but both the actual audio quality is a bit lacking, and there's occasional issues in the actual speech here and there, so I imagine running it through RVC would help clean up things a lot. If it works out, I suppose it'll end up getting added to the web UI anyhow, and it can be used for TorToiSe in lieu of VoiceFixer (which I would like to replace, since for however long I've had it in the stack, it would consistently have some crackle at the end). It would be a nice way to try and bridge the gap between "fine" enough output from my VALL-E model and "good enough to use", as I worry manually training the model some more would take an astronomical amount of more time (or data). _Originally posted by @mrq in /mrq/ai-voice-cloning/issues/152#issuecomment-1934_ What I've found more specifically is that I can skate with faster output from here (lower samples and lower iterations) because rvc seems to "boil" down the input audio and then reapply its own latents to it. If the input audio is already in the ballpark, then it will come out nicer. How do I know this? I have tortoise audio trained on one dataset and rvc trained on different dataset from 20 years in the future (same speaker). Despite the sound difference due to age, it can still blend very very well on a different dataset because the speaker is the same. I've tried likewise on the same dataset for both and it sounds good as well, but I prefer the voice from the two datasets blended. I definitely can understand the challenge for trying to train two models... RVC takes a couple hours in my experience for 200ish epochs. That said, it's mandatory for MD now because the quality is just night and day better as a final polish. I do also normalize the audio in between, btw.

helloitsme closed this issue

2023-07-19 00:11:10 +00:00

Sign in to join this conversation.