Assorted feedback and some questions #41
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#41
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
First, thank you for making this. I know you didn't write the engine, but you're trying to make it "fool proof", and it's a huge help. Today was my first time using voice generation and I'm blown away.
I signed up so I could help improve the wiki from the perspective of a beginner like me but I can't seem to be able to edit the wiki, so I'll post my feedback here and hopefully you can do it. I also have a few questions.
Feedback
Feedback 1: not clear how to get started
You cover installation and collecting samples, but there's nothing about the next step. As a new user, I had no idea if I'm supposed to collect audio samples then write a prompt then click Generate (like img2img), or if I'm supposed to train a new voice first from the Training tab. It didn't help that almost anything I did resulted in a crash due to my low VRAM. I figured it out by trial and error but basically there's a step missing on the wiki that would help beginners. Call it "Basic usage and testing". I would include this:
(The "attached clip" is a 5-second sample of JRE I attached to this issue, you could link that on the wiki)
Feedback 2: Voice directory is confusing/wrong
There's this part in the wiki under Collecting Samples: "After preparing your clips as WAV files at a sample rate of 22050 Hz, open up the tortoise-tts folder you're working in, navigate to the voices folder, create a new folder in whatever name you want, then dump your clips into that folder. While you're in the voice folder, you can take a look at the other provided voices. !NOTE!: Before 2023.02.10, voices used to be stored under .\tortoise\voices, but has been moved up one folder. Compatibily is maintained with the old voice folder, but will take priority."
This is a lot of confusing text. You're actually giving wrong info, "navigate to tortoise-tts/voices", but that folder isn't actually checked, only /voices/ is. Replace this whole paragraph with "Place your .wav files in ai-voice-cloning/voices/my-new-voice-1".
Also, it says "take a look at the other provided voices": there are no provided voices. I guess you deleted them and forgot to update the wiki.
Feedback 3: 22050Hz sample rate info is messy
First, I'm not sure if we actually need 22050. I created a voice folder with a 48k wav and it worked fine (ie output sounds identical to the 22KHz sample to me).
If it has a not-so-obvious effect on quality, then know that your advice about on the Collecting Samples page saying to use the convert.bat script in /tortoise/convert/ is wrong.
That script (just a basic "ffmpeg -i input.mp3 -ac 1 output.wav") keeps the same sample rate as the input file. At least for the Joe Rogan sample from the Mega you linked, I got a .wav with 48k sample rate. The proper command for 22k wav is:
ffmpeg -i input.ogg -ar 22050 output.wav
An alternative command which uses the correct sample AND creates a 5 second clip from between seconds 2 and 7:
ffmpeg -i input.ogg -ar 22050 -ss 00:00:02.000 -to 00:00:07.000 output.wav
So if 22k doesn't matter, remove any mention of sample rate from the wiki.
Questions
Question 1: can you please explain why I would want to train a voice? Voice cloning seems to work fine from just an input voice file (at least for English). So what is the benefit of the Training tab?
Question 2: can you document the prompt syntax, if any? How can we make it speak different lines with different emotions, and emphasize specific words?
My biggest issue for basic usage is that it seems to ignore question marks: questions like "What's up? You called me?" are pronounced as "What's up. You called me."
Question 3: my gens with an input voice that's just 5 sec clip of Joe Rogan came out sounding perfectly like him, I was amazed! But when I tried a 10-sec clip of my own voice blandly reading Wikipedia into my mic, it didn't sound as good. I think it's mainly because the output had a british accent, but my real voice when I speak english has a french accent. Can anything be done to have english output use my own french accent?
Question 4: related to Q2, how do foreign languages work? I tried inputting a french audio sample and writing a prompt in french, but the output is wrong, it's clearly trying to (incorrectly) pronounce those words as if they were english words.
Question 5: could you document what Voice Chunk does? It's not mentioned in the Glossary. On a 2m15s reference clip, setting it to 1 requires 5.9GB VRAM. Setting it to 4 required 3.1GB. I guess this is splitting the audio input into multiple smaller/easier tasks. Does this have an effect on quality?
I'll most definitely read this later, but since I just woke up (and might fall back asleep), I'll just say:
really highlights how neglected the wiki has been as I've been dumping more and more of my focus in the web UI itself rather than it being just a document on how to voice clone. Just keep that in mind.
It needs some cleanup, yes, but my priorities (and free time) have been disheveled for the past two weeks or so.
Alright, I'm relatively awake now and should have some time.
I've combed through the wiki and cleaned it up to address the feedback, I guess there were a lot of outdated info than I remember. I reiterate again that those errors are remnants from when:
Now for the questions:
As a straightforward (English) speech synthesis tool, it works perfectly fine, even with a
random
voice. Some voices I ran through it also give very decent output. However, there's a seemingly clear point where some voices don't "synergize" well with the base model, as the base model was (seemingly) mostly trained on male audiobook narrators.For other voices outside of that (non-English voices, a majority of female voices), finetuning comes in. If you can spare the compute time, you can most definitely get better output for a specific voice (if using a narrow, specific dataset) or for a better range in voices (if using a very, very large dataset). I suppose this is more of an analog to finetuning the Stable Diffusion model itself to one of the many variants out there.
In short, if the base model works fine with what you got, great, stick with it. If you're using a voice that's above and beyond what the base model likes, or in another language, then finetune the model for the voice.
I honestly don't have any outside of the what the base tortoise-tts offers.
If anything, it's like 11.AI where you just have to try and see what works for a prompt, and keep editing. A lot of the nuances aren't really able to be directly influenced, just suggested.
But unlike 11.AI, you can use prompt
editing"engineering" with words in brackets. I genuinely can't come up with any examples, as it's just something you need to toy with, but it will generate the sentence, but redact what's in brackets from the output, but keep in how a given prompt will be delivered.If I remember right, in some of my testings, I was able to get some emphasizing with quotation marks. I feel I've had pretty good consistency with question marks, given a lot of my testings base on using James Sunderland of Silent Hill 2 saying "Is that really you, Mary?", and it's relatively reproduced right.
Funny, that's similar to what I felt with Joe Rogan: a conveniently perfect output, but using a rather mundane niche but very normal American-English voice clip and it turned rather British.
My cope it's how the voice latents are computed, as there's some magic value for a given dataset (the voice, the sentences, how they're structured, etc., too many variables to really figure out a "good" value for given inputs) on what to chunk them at. I've tried playing around with it, since my underwhelming output was done after I added the voice chunk method.
In my very early testings with TorToiSe, I've tried some very basic katakana and it was able to pronounce a bit of it right, but immediately falls apart after a short while.
Finetuning though, with a small dataset of Japanese tremendously helped with it being able to reproduce actual Japanese, but with some quirks (that I have briefly documented in the Training section).
I imagine for even different accents, not just different languages, you'll need to finetune against those accents.
Right, I could have sworn I did, but I think it was on the rentry/old README and it managed to get overwritten with an older copy of either, so it probably got lost and I didn't realize.
The glossary was written well before I've dove in and did a lot of improvements. And desu the glossary moreso covers technical terms with voice generating itself, rather than concepts I've added on top of TorToiSe.
Voice Chunking
refers to how a voice's conditional latents (a brief model to capture a voice's traits). The original TorToiSe only used the first ~four seconds of a voice file. I've played around with other ways to go about remedying this, as the way it originally did it left some room to play with. Fast forward to how it ended up, with letting the user decide how many pieces the combined voice clip is sliced. This originally was to work around exhausting remaining VRAM, but some other results emerged.Larger chunks (small chunk size count) tends to have it process faster at a cost of higher VRAM, yes, similar to batch sizes.
Depending on how you slice the combined voice clip, you'll get better (or worse) generated output. It most definitely is a value you play around with, as it noticeably does affect the output, and there's always a "sweet" spot value. I assume this is entirely dependent on the content of the provided audio inputs (desu I'm having trouble trying to express the concept, I'll clarify it better later in the documentation).
Apologies if anything sounds off, my brain's been fried the past two weeks, and I feel like I'm getting close to frying it again today.
Thanks for getting back to me, especially you seem so busy.
If you don't mind giving me wiki edit access, I'll help out on that front. I won't write about anything I don't understand (like anything related to the science or tools). Just beginner-oriented stuff. If you're worried about vandalism, can you give me write permissions to a single Beginner Walkthrough page and I'll stick to that?
Btw while helping a friend set it up who also has a 6GB card, I found that it's essential to apply your suggested changes from #25 to be able to generate without running out of VRAM. So the Quick Test guide I wrote in the OP is incorrect, it won't work on a 6GB card without those changes.
Doesn't seem I can give specific access to the wiki, I believe I'd have to grant you access as a collaborator for this repo entirely.
I'll get around to rewriting most of it eventually.
Yeah, it's what leverages reducing VRAM significantly enough to get it to train on my 2060. Recent installs should be able to use it seamless, but any setups before that would need to manually set it up, as per the instructions there.
In that case, how about pasting what I wrote here in a new Beginner Walkthrough page in the short term? When you have free time you can improve the wiki. I promise it will help beginners, I'm a techie but new to AI, and I had no clue what to do (plus didn't help that everything I tried caused a VRAM error and I didn't know if there was minimum requirements) .
Attach the .wav attachment in the OP to the page.
So we need 22khz or not? Am I fine to use 44 and 48? Also mono/stereo matters?
Thank you for this repo, btw. I was chugging along with tortoise fast and this is much much better.
I was curious about this too.. is the /voices directory meant to be empty?
It's fine if so, I just wanted to check as I'm concerned other things might be missing too!