Assorted feedback and some questions #41

Closed
opened 2023-02-26 11:01:16 +07:00 by ramirez3 · 7 comments

First, thank you for making this. I know you didn't write the engine, but you're trying to make it "fool proof", and it's a huge help. Today was my first time using voice generation and I'm blown away.

I signed up so I could help improve the wiki from the perspective of a beginner like me but I can't seem to be able to edit the wiki, so I'll post my feedback here and hopefully you can do it. I also have a few questions.

Feedback

Feedback 1: not clear how to get started
You cover installation and collecting samples, but there's nothing about the next step. As a new user, I had no idea if I'm supposed to collect audio samples then write a prompt then click Generate (like img2img), or if I'm supposed to train a new voice first from the Training tab. It didn't help that almost anything I did resulted in a crash due to my low VRAM. I figured it out by trial and error but basically there's a step missing on the wiki that would help beginners. Call it "Basic usage and testing". I would include this:

Quick test of your setup:
-Create a 5 second voice clip, or use the attached clip, place it under ai-voice-cloning/voices/MyTest/clip.wav
-Click Refresh Voice List in the bottom left of the UI, then select MyTest in the Voices dropdown
-Under Prompt, write the text you would like to produce, such as "Chimpanzees are truly a beautiful animal"
-Set preset to Ultra Fast
-Click Generate
-Wait for the results to appear in the UI, click Play to listen to them

The above is confirmed to work on a 1060 with 6GB VRAM, and will probably work on a 4GB. Note that VRAM requirements increase with the length of your data samples. If the above test works but you, but future generations with other voice files fail with an error, you likely ran out of VRAM. Use shorter audio clips. You may also reduce VRAM usage by increasing Voice Chunks. Experiment as necessary.

The lengthiest inference attempted on a 6GB card was a 2min15sec clip, on Ultra Fast, with 1 Voice Chunk. It required exactly 5.9GB VRAM.

(The "attached clip" is a 5-second sample of JRE I attached to this issue, you could link that on the wiki)

Feedback 2: Voice directory is confusing/wrong
There's this part in the wiki under Collecting Samples: "After preparing your clips as WAV files at a sample rate of 22050 Hz, open up the tortoise-tts folder you're working in, navigate to the voices folder, create a new folder in whatever name you want, then dump your clips into that folder. While you're in the voice folder, you can take a look at the other provided voices. !NOTE!: Before 2023.02.10, voices used to be stored under .\tortoise\voices, but has been moved up one folder. Compatibily is maintained with the old voice folder, but will take priority."

This is a lot of confusing text. You're actually giving wrong info, "navigate to tortoise-tts/voices", but that folder isn't actually checked, only /voices/ is. Replace this whole paragraph with "Place your .wav files in ai-voice-cloning/voices/my-new-voice-1".

Also, it says "take a look at the other provided voices": there are no provided voices. I guess you deleted them and forgot to update the wiki.

Feedback 3: 22050Hz sample rate info is messy
First, I'm not sure if we actually need 22050. I created a voice folder with a 48k wav and it worked fine (ie output sounds identical to the 22KHz sample to me).

If it has a not-so-obvious effect on quality, then know that your advice about on the Collecting Samples page saying to use the convert.bat script in /tortoise/convert/ is wrong.

That script (just a basic "ffmpeg -i input.mp3 -ac 1 output.wav") keeps the same sample rate as the input file. At least for the Joe Rogan sample from the Mega you linked, I got a .wav with 48k sample rate. The proper command for 22k wav is:
ffmpeg -i input.ogg -ar 22050 output.wav

An alternative command which uses the correct sample AND creates a 5 second clip from between seconds 2 and 7:
ffmpeg -i input.ogg -ar 22050 -ss 00:00:02.000 -to 00:00:07.000 output.wav

So if 22k doesn't matter, remove any mention of sample rate from the wiki.

Questions

Question 1: can you please explain why I would want to train a voice? Voice cloning seems to work fine from just an input voice file (at least for English). So what is the benefit of the Training tab?

Question 2: can you document the prompt syntax, if any? How can we make it speak different lines with different emotions, and emphasize specific words?
My biggest issue for basic usage is that it seems to ignore question marks: questions like "What's up? You called me?" are pronounced as "What's up. You called me."

Question 3: my gens with an input voice that's just 5 sec clip of Joe Rogan came out sounding perfectly like him, I was amazed! But when I tried a 10-sec clip of my own voice blandly reading Wikipedia into my mic, it didn't sound as good. I think it's mainly because the output had a british accent, but my real voice when I speak english has a french accent. Can anything be done to have english output use my own french accent?

Question 4: related to Q2, how do foreign languages work? I tried inputting a french audio sample and writing a prompt in french, but the output is wrong, it's clearly trying to (incorrectly) pronounce those words as if they were english words.

Question 5: could you document what Voice Chunk does? It's not mentioned in the Glossary. On a 2m15s reference clip, setting it to 1 requires 5.9GB VRAM. Setting it to 4 required 3.1GB. I guess this is splitting the audio input into multiple smaller/easier tasks. Does this have an effect on quality?

First, thank you for making this. I know you didn't write the engine, but you're trying to make it "fool proof", and it's a huge help. Today was my first time using voice generation and I'm blown away. I signed up so I could help improve the wiki from the perspective of a beginner like me but I can't seem to be able to edit the wiki, so I'll post my feedback here and hopefully you can do it. I also have a few questions. # Feedback **Feedback 1: not clear how to get started** You cover installation and collecting samples, but there's nothing about the next step. As a new user, I had no idea if I'm supposed to collect audio samples then write a prompt then click Generate (like img2img), or if I'm supposed to train a new voice first from the Training tab. It didn't help that almost anything I did resulted in a crash due to my low VRAM. I figured it out by trial and error but basically there's a step missing on the wiki that would help beginners. Call it "Basic usage and testing". I would include this: ``` Quick test of your setup: -Create a 5 second voice clip, or use the attached clip, place it under ai-voice-cloning/voices/MyTest/clip.wav -Click Refresh Voice List in the bottom left of the UI, then select MyTest in the Voices dropdown -Under Prompt, write the text you would like to produce, such as "Chimpanzees are truly a beautiful animal" -Set preset to Ultra Fast -Click Generate -Wait for the results to appear in the UI, click Play to listen to them The above is confirmed to work on a 1060 with 6GB VRAM, and will probably work on a 4GB. Note that VRAM requirements increase with the length of your data samples. If the above test works but you, but future generations with other voice files fail with an error, you likely ran out of VRAM. Use shorter audio clips. You may also reduce VRAM usage by increasing Voice Chunks. Experiment as necessary. The lengthiest inference attempted on a 6GB card was a 2min15sec clip, on Ultra Fast, with 1 Voice Chunk. It required exactly 5.9GB VRAM. ``` (The "attached clip" is a 5-second sample of JRE I attached to this issue, you could link that on the wiki) **Feedback 2: Voice directory is confusing/wrong** There's this part in the wiki under Collecting Samples: *"After preparing your clips as WAV files at a sample rate of 22050 Hz, open up the tortoise-tts folder you're working in, navigate to the voices folder, create a new folder in whatever name you want, then dump your clips into that folder. While you're in the voice folder, you can take a look at the other provided voices. !NOTE!: Before 2023.02.10, voices used to be stored under .\tortoise\voices\, but has been moved up one folder. Compatibily is maintained with the old voice folder, but will take priority.*" This is a lot of confusing text. You're actually giving wrong info, "navigate to tortoise-tts/voices", but that folder isn't actually checked, only /voices/ is. Replace this whole paragraph with *"Place your .wav files in ai-voice-cloning/voices/my-new-voice-1".* Also, it says *"take a look at the other provided voices"*: there are no provided voices. I guess you deleted them and forgot to update the wiki. **Feedback 3: 22050Hz sample rate info is messy** First, I'm not sure if we actually need 22050. I created a voice folder with a 48k wav and it worked fine (ie output sounds identical to the 22KHz sample to me). If it has a not-so-obvious effect on quality, then know that your advice about on the Collecting Samples page saying to use the convert.bat script in /tortoise/convert/ is wrong. That script (just a basic "ffmpeg -i input.mp3 -ac 1 output.wav") keeps the same sample rate as the input file. At least for the Joe Rogan sample from the Mega you linked, I got a .wav with 48k sample rate. The proper command for 22k wav is: ffmpeg -i input.ogg -ar 22050 output.wav An alternative command which uses the correct sample AND creates a 5 second clip from between seconds 2 and 7: ffmpeg -i input.ogg -ar 22050 -ss 00:00:02.000 -to 00:00:07.000 output.wav So if 22k doesn't matter, remove any mention of sample rate from the wiki. # Questions **Question 1:** can you please explain why I would want to train a voice? Voice cloning seems to work fine from just an input voice file (at least for English). So what is the benefit of the Training tab? **Question 2:** can you document the prompt syntax, if any? How can we make it speak different lines with different emotions, and emphasize specific words? My biggest issue for basic usage is that it seems to ignore question marks: questions like "What's up? You called me?" are pronounced as "What's up. You called me." **Question 3:** my gens with an input voice that's just 5 sec clip of Joe Rogan came out sounding perfectly like him, I was amazed! But when I tried a 10-sec clip of my own voice blandly reading Wikipedia into my mic, it didn't sound as good. I think it's mainly because the output had a british accent, but my real voice when I speak english has a french accent. Can anything be done to have english output use my own french accent? **Question 4:** related to Q2, how do foreign languages work? I tried inputting a french audio sample and writing a prompt in french, but the output is wrong, it's clearly trying to (incorrectly) pronounce those words as if they were english words. **Question 5:** could you document what Voice Chunk does? It's not mentioned in the Glossary. On a 2m15s reference clip, setting it to 1 requires 5.9GB VRAM. Setting it to 4 required 3.1GB. I guess this is splitting the audio input into multiple smaller/easier tasks. Does this have an effect on quality?

I'll most definitely read this later, but since I just woke up (and might fall back asleep), I'll just say:

this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing.

really highlights how neglected the wiki has been as I've been dumping more and more of my focus in the web UI itself rather than it being just a document on how to voice clone. Just keep that in mind.

It needs some cleanup, yes, but my priorities (and free time) have been disheveled for the past two weeks or so.

I'll most definitely read this later, but since I just woke up (and might fall back asleep), I'll just say: > this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing. really highlights how neglected the wiki has been as I've been dumping more and more of my focus in the web UI itself rather than it being just a document on how to voice clone. Just keep that in mind. It needs some cleanup, yes, but my priorities (and free time) have been disheveled for the past two weeks or so.

Alright, I'm relatively awake now and should have some time.

I've combed through the wiki and cleaned it up to address the feedback, I guess there were a lot of outdated info than I remember. I reiterate again that those errors are remnants from when:

  • the documentation was either just a rentry for info on using tortoise-tts and not a full blown software suite
  • assumptions were made that it required WAVs that favored a specific sample rate and encoding due to cursory glances at the original code (which were wrong assumptions, as later found out by my further digging and playing around)
  • TorToiSe didn't have any working utilities for training/finetuning new models, so the only generation was with the base model, as tortoise is a zero-shot speech synthesis tool.

Now for the questions:

Question 1: can you please explain why I would want to train a voice? Voice cloning seems to work fine from just an input voice file (at least for English). So what is the benefit of the Training tab?

As a straightforward (English) speech synthesis tool, it works perfectly fine, even with a random voice. Some voices I ran through it also give very decent output. However, there's a seemingly clear point where some voices don't "synergize" well with the base model, as the base model was (seemingly) mostly trained on male audiobook narrators.

For other voices outside of that (non-English voices, a majority of female voices), finetuning comes in. If you can spare the compute time, you can most definitely get better output for a specific voice (if using a narrow, specific dataset) or for a better range in voices (if using a very, very large dataset). I suppose this is more of an analog to finetuning the Stable Diffusion model itself to one of the many variants out there.

In short, if the base model works fine with what you got, great, stick with it. If you're using a voice that's above and beyond what the base model likes, or in another language, then finetune the model for the voice.

Question 2: can you document the prompt syntax, if any?

I honestly don't have any outside of the what the base tortoise-tts offers.

If anything, it's like 11.AI where you just have to try and see what works for a prompt, and keep editing. A lot of the nuances aren't really able to be directly influenced, just suggested.

How can we make it speak different lines with different emotions,

But unlike 11.AI, you can use prompt editing "engineering" with words in brackets. I genuinely can't come up with any examples, as it's just something you need to toy with, but it will generate the sentence, but redact what's in brackets from the output, but keep in how a given prompt will be delivered.

and emphasize specific words?
My biggest issue for basic usage is that it seems to ignore question marks

If I remember right, in some of my testings, I was able to get some emphasizing with quotation marks. I feel I've had pretty good consistency with question marks, given a lot of my testings base on using James Sunderland of Silent Hill 2 saying "Is that really you, Mary?", and it's relatively reproduced right.

Question 3: my gens with an input voice that's just 5 sec clip of Joe Rogan came out sounding perfectly like him, I was amazed! But when I tried a 10-sec clip of my own voice blandly reading Wikipedia into my mic, it didn't sound as good.

Funny, that's similar to what I felt with Joe Rogan: a conveniently perfect output, but using a rather mundane niche but very normal American-English voice clip and it turned rather British.

My cope it's how the voice latents are computed, as there's some magic value for a given dataset (the voice, the sentences, how they're structured, etc., too many variables to really figure out a "good" value for given inputs) on what to chunk them at. I've tried playing around with it, since my underwhelming output was done after I added the voice chunk method.

Question 4: related to Q2, how do foreign languages work?

In my very early testings with TorToiSe, I've tried some very basic katakana and it was able to pronounce a bit of it right, but immediately falls apart after a short while.

Finetuning though, with a small dataset of Japanese tremendously helped with it being able to reproduce actual Japanese, but with some quirks (that I have briefly documented in the Training section).

I imagine for even different accents, not just different languages, you'll need to finetune against those accents.

Question 5: could you document what Voice Chunk does?

Right, I could have sworn I did, but I think it was on the rentry/old README and it managed to get overwritten with an older copy of either, so it probably got lost and I didn't realize.

It's not mentioned in the Glossary.

The glossary was written well before I've dove in and did a lot of improvements. And desu the glossary moreso covers technical terms with voice generating itself, rather than concepts I've added on top of TorToiSe.

Voice Chunking refers to how a voice's conditional latents (a brief model to capture a voice's traits). The original TorToiSe only used the first ~four seconds of a voice file. I've played around with other ways to go about remedying this, as the way it originally did it left some room to play with. Fast forward to how it ended up, with letting the user decide how many pieces the combined voice clip is sliced. This originally was to work around exhausting remaining VRAM, but some other results emerged.

On a 2m15s reference clip, setting it to 1 requires 5.9GB VRAM. Setting it to 4 required 3.1GB. I guess this is splitting the audio input into multiple smaller/easier tasks.

Larger chunks (small chunk size count) tends to have it process faster at a cost of higher VRAM, yes, similar to batch sizes.

Does this have an effect on quality?

Depending on how you slice the combined voice clip, you'll get better (or worse) generated output. It most definitely is a value you play around with, as it noticeably does affect the output, and there's always a "sweet" spot value. I assume this is entirely dependent on the content of the provided audio inputs (desu I'm having trouble trying to express the concept, I'll clarify it better later in the documentation).


Apologies if anything sounds off, my brain's been fried the past two weeks, and I feel like I'm getting close to frying it again today.

Alright, I'm relatively awake now and should have some time. I've combed through the wiki and cleaned it up to address the feedback, I guess there were a lot of outdated info than I remember. I reiterate again that those errors are remnants from when: * the documentation was either just a rentry for info on using tortoise-tts and not a full blown software suite * assumptions were made that it required WAVs that favored a specific sample rate and encoding due to cursory glances at the original code (which were wrong assumptions, as later found out by my further digging and playing around) * TorToiSe didn't have any working utilities for training/finetuning new models, so the only generation was with the base model, as tortoise *is* a zero-shot speech synthesis tool. Now for the questions: > Question 1: can you please explain why I would want to train a voice? Voice cloning seems to work fine from just an input voice file (at least for English). So what is the benefit of the Training tab? As a straightforward (English) speech synthesis tool, it works perfectly fine, even with a `random` voice. Some voices I ran through it also give very decent output. However, there's a seemingly clear point where some voices don't "synergize" well with the base model, as the base model was (seemingly) mostly trained on male audiobook narrators. For other voices outside of that (non-English voices, a majority of female voices), finetuning comes in. If you can spare the compute time, you can most definitely get better output for a specific voice (if using a narrow, specific dataset) or for a better range in voices (if using a very, very large dataset). I suppose this is more of an analog to finetuning the Stable Diffusion model itself to one of the many variants out there. In short, if the base model works fine with what you got, great, stick with it. If you're using a voice that's above and beyond what the base model likes, or in another language, then finetune the model for the voice. > Question 2: can you document the prompt syntax, if any? I honestly don't have any outside of the what the [base tortoise-tts](https://github.com/neonbjb/tortoise-tts) offers. If anything, it's like 11.AI where you just have to try and see what works for a prompt, and keep editing. A lot of the nuances aren't really able to be directly influenced, just suggested. > How can we make it speak different lines with different emotions, But unlike 11.AI, you can use prompt ~~editing~~ "engineering" with words in brackets. I genuinely can't come up with any examples, as it's just something you need to toy with, but it will generate the sentence, but redact what's in brackets from the output, but keep in how a given prompt will be delivered. > and emphasize specific words? > My biggest issue for basic usage is that it seems to ignore question marks If I remember right, in some of my testings, I was able to get some emphasizing with quotation marks. I feel I've had pretty good consistency with question marks, given a lot of my testings base on using James Sunderland of Silent Hill 2 saying "Is that really you, Mary?", and it's relatively reproduced right. > Question 3: my gens with an input voice that's just 5 sec clip of Joe Rogan came out sounding perfectly like him, I was amazed! But when I tried a 10-sec clip of my own voice blandly reading Wikipedia into my mic, it didn't sound as good. Funny, that's similar to what I felt with Joe Rogan: a conveniently perfect output, but using a rather mundane niche but very normal American-English voice clip and it turned rather British. My cope it's how the voice latents are computed, as there's some magic value for a given dataset (the voice, the sentences, how they're structured, etc., too many variables to really figure out a "good" value for given inputs) on what to chunk them at. I've tried playing around with it, since my underwhelming output was done after I added the voice chunk method. > Question 4: related to Q2, how do foreign languages work? In my very early testings with TorToiSe, I've tried some very basic katakana and it was able to pronounce a bit of it right, but immediately falls apart after a short while. Finetuning though, with a small dataset of Japanese tremendously helped with it being able to reproduce actual Japanese, but with some quirks (that I have briefly documented in the Training section). I imagine for even different accents, not just different languages, you'll need to finetune against those accents. > Question 5: could you document what Voice Chunk does? Right, I could have sworn I did, but I think it was on the rentry/old README and it managed to get overwritten with an older copy of either, so it probably got lost and I didn't realize. > It's not mentioned in the Glossary. The glossary was written well before I've dove in and did a lot of improvements. And desu the glossary moreso covers technical terms with voice generating itself, rather than concepts I've added on top of TorToiSe. `Voice Chunking` refers to how a voice's conditional latents (a brief model to capture a voice's traits). The original TorToiSe only used the first ~four seconds of a voice file. I've played around with other ways to go about remedying this, as the way it originally did it left some room to play with. Fast forward to how it ended up, with letting the user decide how many pieces the combined voice clip is sliced. This originally was to work around exhausting remaining VRAM, but some other results emerged. > On a 2m15s reference clip, setting it to 1 requires 5.9GB VRAM. Setting it to 4 required 3.1GB. I guess this is splitting the audio input into multiple smaller/easier tasks. Larger chunks (small chunk size count) tends to have it process faster at a cost of higher VRAM, yes, similar to batch sizes. > Does this have an effect on quality? Depending on how you slice the combined voice clip, you'll get better (or worse) generated output. It most definitely is a value you play around with, as it noticeably does affect the output, and there's always a "sweet" spot value. I assume this is entirely dependent on the content of the provided audio inputs (desu I'm having trouble trying to express the concept, I'll clarify it better later in the documentation). ---- Apologies if anything sounds off, my brain's been fried the past two weeks, and I feel like I'm getting close to frying it again today.

Thanks for getting back to me, especially you seem so busy.

If you don't mind giving me wiki edit access, I'll help out on that front. I won't write about anything I don't understand (like anything related to the science or tools). Just beginner-oriented stuff. If you're worried about vandalism, can you give me write permissions to a single Beginner Walkthrough page and I'll stick to that?

Btw while helping a friend set it up who also has a 6GB card, I found that it's essential to apply your suggested changes from #25 to be able to generate without running out of VRAM. So the Quick Test guide I wrote in the OP is incorrect, it won't work on a 6GB card without those changes.

Thanks for getting back to me, especially you seem so busy. If you don't mind giving me wiki edit access, I'll help out on that front. I won't write about anything I don't understand (like anything related to the science or tools). Just beginner-oriented stuff. If you're worried about vandalism, can you give me write permissions to a single Beginner Walkthrough page and I'll stick to that? Btw while helping a friend set it up who also has a 6GB card, I found that it's essential to apply your suggested changes from #25 to be able to generate without running out of VRAM. So the Quick Test guide I wrote in the OP is incorrect, it won't work on a 6GB card without those changes.

Doesn't seem I can give specific access to the wiki, I believe I'd have to grant you access as a collaborator for this repo entirely.

I'll get around to rewriting most of it eventually.

Btw while helping a friend set it up who also has a 6GB card, I found that it's essential to apply your suggested changes from #25 to be able to generate without running out of VRAM.

Yeah, it's what leverages reducing VRAM significantly enough to get it to train on my 2060. Recent installs should be able to use it seamless, but any setups before that would need to manually set it up, as per the instructions there.

Doesn't seem I can give specific access to the wiki, I believe I'd have to grant you access as a collaborator for this repo entirely. I'll get around to rewriting most of it eventually. > Btw while helping a friend set it up who also has a 6GB card, I found that it's essential to apply your suggested changes from #25 to be able to generate without running out of VRAM. Yeah, it's what leverages reducing VRAM significantly enough to get it to train on my 2060. Recent installs should be able to use it seamless, but any setups before that would need to manually set it up, as per the instructions there.

In that case, how about pasting what I wrote here in a new Beginner Walkthrough page in the short term? When you have free time you can improve the wiki. I promise it will help beginners, I'm a techie but new to AI, and I had no clue what to do (plus didn't help that everything I tried caused a VRAM error and I didn't know if there was minimum requirements) .

1. Follow the installation instructions as documented on https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Installation
2. Launch start.bat and browse to http://127.0.0.1:7860 to access the web UI
3. Place the attached voice clip, or your own 5 second voice clip, under ai-voice-cloning/voices/MyTest/whatever.wav
4. Click Refresh Voice List in the bottom left of the UI, then select MyTest in the Voices dropdown
5. Under Prompt, write the text you would like to produce, such as "Chimpanzees are truly a beautiful animal"
6. Set preset to Ultra Fast
7. Click Generate
8. Wait for the results to appear in the UI, click Play to listen to them

The above is confirmed to work on a 1060 with 6GB VRAM, and may also work on a 4GB. Note that VRAM requirements increase with the length of your data samples. 

If the above test works but you, but future generations with other voice files fail with a CUDA memory error, you ran out of VRAM. Use shorter audio clips, or buy a better video card. For larger audio clips you may also reduce VRAM usage by increasing the number of Voice Chunks in the UI. Experiment as necessary. On Windows you can monitor your GPU's VRAM usage in Task Manager under the Performance tab. Look for "Dedicated GPU memory"

Attach the .wav attachment in the OP to the page.

In that case, how about pasting what I wrote here in a new Beginner Walkthrough page in the short term? When you have free time you can improve the wiki. I promise it will help beginners, I'm a techie but new to AI, and I had no clue what to do (plus didn't help that everything I tried caused a VRAM error and I didn't know if there was minimum requirements) . ``` 1. Follow the installation instructions as documented on https://git.ecker.tech/mrq/ai-voice-cloning/wiki/Installation 2. Launch start.bat and browse to http://127.0.0.1:7860 to access the web UI 3. Place the attached voice clip, or your own 5 second voice clip, under ai-voice-cloning/voices/MyTest/whatever.wav 4. Click Refresh Voice List in the bottom left of the UI, then select MyTest in the Voices dropdown 5. Under Prompt, write the text you would like to produce, such as "Chimpanzees are truly a beautiful animal" 6. Set preset to Ultra Fast 7. Click Generate 8. Wait for the results to appear in the UI, click Play to listen to them The above is confirmed to work on a 1060 with 6GB VRAM, and may also work on a 4GB. Note that VRAM requirements increase with the length of your data samples. If the above test works but you, but future generations with other voice files fail with a CUDA memory error, you ran out of VRAM. Use shorter audio clips, or buy a better video card. For larger audio clips you may also reduce VRAM usage by increasing the number of Voice Chunks in the UI. Experiment as necessary. On Windows you can monitor your GPU's VRAM usage in Task Manager under the Performance tab. Look for "Dedicated GPU memory" ``` Attach the .wav attachment in the OP to the page.

So we need 22khz or not? Am I fine to use 44 and 48? Also mono/stereo matters?

Thank you for this repo, btw. I was chugging along with tortoise fast and this is much much better.

So we need 22khz or not? Am I fine to use 44 and 48? Also mono/stereo matters? Thank you for this repo, btw. I was chugging along with tortoise fast and this is much much better.

Feedback 2: Voice directory is confusing/wrong
There's this part in the wiki under Collecting Samples: "After preparing your clips as WAV files at a sample rate of 22050 Hz, open up the tortoise-tts folder you're working in, navigate to the voices folder, create a new folder in whatever name you want, then dump your clips into that folder. While you're in the voice folder, you can take a look at the other provided voices. !NOTE!: Before 2023.02.10, voices used to be stored under .\tortoise\voices, but has been moved up one folder. Compatibily is maintained with the old voice folder, but will take priority."

I was curious about this too.. is the /voices directory meant to be empty?

It's fine if so, I just wanted to check as I'm concerned other things might be missing too!

> Feedback 2: Voice directory is confusing/wrong > There's this part in the wiki under Collecting Samples: "After preparing your clips as WAV files at a sample rate of 22050 Hz, open up the tortoise-tts folder you're working in, navigate to the voices folder, create a new folder in whatever name you want, then dump your clips into that folder. While you're in the voice folder, you can take a look at the other provided voices. !NOTE!: Before 2023.02.10, voices used to be stored under .\tortoise\voices, but has been moved up one folder. Compatibily is maintained with the old voice folder, but will take priority." I was curious about this too.. is the /voices directory meant to be empty? It's fine if so, I just wanted to check as I'm concerned other things might be missing too!
mrq closed this issue 2023-03-13 17:38:42 +07:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#41
There is no content yet.