Tortoise->normalize->rvc #309
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#309
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I've actually thought about running it through an RVC to see how things are cleaned up. The output (finetuned or not) is fine, but both the actual audio quality is a bit lacking, and there's occasional issues in the actual speech here and there, so I imagine running it through RVC would help clean up things a lot. If it works out, I suppose it'll end up getting added to the web UI anyhow, and it can be used for TorToiSe in lieu of VoiceFixer (which I would like to replace, since for however long I've had it in the stack, it would consistently have some crackle at the end).
It would be a nice way to try and bridge the gap between "fine" enough output from my VALL-E model and "good enough to use", as I worry manually training the model some more would take an astronomical amount of more time (or data).
Originally posted by @mrq in /mrq/ai-voice-cloning/issues/152#issuecomment-1934
What I've found more specifically is that I can skate with faster output from here (lower samples and lower iterations) because rvc seems to "boil" down the input audio and then reapply its own latents to it. If the input audio is already in the ballpark, then it will come out nicer. How do I know this? I have tortoise audio trained on one dataset and rvc trained on different dataset from 20 years in the future (same speaker). Despite the sound difference due to age, it can still blend very very well on a different dataset because the speaker is the same. I've tried likewise on the same dataset for both and it sounds good as well, but I prefer the voice from the two datasets blended.
I definitely can understand the challenge for trying to train two models... RVC takes a couple hours in my experience for 200ish epochs. That said, it's mandatory for MD now because the quality is just night and day better as a final polish. I do also normalize the audio in between, btw.