Random voices appear at random when specific voice selected #42

Closed
opened 2023-03-13 20:37:33 +00:00 by thangalin · 3 comments

Running ./scripts/tortoise_tts.py -O /tmp/clips -v chloe < file.txt will sometimes produce audio clips using a randomly selected voice. Happens about 3 times out of every 75 clips.

Is it possible to eliminate the random voice selection altogether, to force users to select a voice?

Running `./scripts/tortoise_tts.py -O /tmp/clips -v chloe < file.txt` will sometimes produce audio clips using a randomly selected voice. Happens about 3 times out of every 75 clips. Is it possible to eliminate the random voice selection altogether, to force users to select a voice?
Owner

Running ./scripts/tortoise_tts.py

Beyond my scope, as:

  • I haven't touched that file at all.
  • I'm not supporting manually invoking tortoise-tts.
  • my priority lies with mrq/ai-voice-cloning for using this tortoise-tts as a dependency.
    • CLI support isn't implemented at the moment, as that's very low priority.

I can assume what you're thinking is a random voice is just how vanilla TorToiSe will behave (given the default model, simple variance will get you something not resembling the base model). You might want to lower the temperature to reduce variance, if it's the case.

I imagine you're trying to use it like 152334H/tortoise-tts-fast. Don't.

> Running `./scripts/tortoise_tts.py` Beyond my scope, as: * I haven't touched that file at all. * I'm not supporting manually invoking tortoise-tts. * my priority lies with [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning) for using this tortoise-tts as a dependency. - CLI support isn't implemented at the moment, as that's very low priority. I can *assume* what you're thinking is a random voice is just how vanilla TorToiSe will behave (given the default model, simple variance will get you something not resembling the base model). You might want to lower the temperature to reduce variance, if it's the case. I imagine you're trying to use it like [152334H/tortoise-tts-fast](https://github.com/152334H/tortoise-tts-fast/). Don't.
mrq closed this issue 2023-03-13 22:01:22 +00:00
Author

For context, I'm writing a novel having three narrators. The following script runs TorToiSe TTS on each plain text chapter:

function txt_to_tts() {
  pushd $HOME/archives/tortoise-tts
    
  for i in $HOME/dev/novel/chapter/??-*.txt; do
    F=$(basename $i);
    VOICE=$(echo ${F%.*} | cut -c 4-);
    CHAPTER=$(echo ${F%.*} | cut -c -2);
    D=$(dirname $i);
    CLIPS="$D/../audio/$CHAPTER-$VOICE";
    mkdir -p $CLIPS;
    $HOME/dev/tts/scripts/tortoise_tts.py -p high_quality -O $CLIPS -v $VOICE < $i;
  done
  
  popd
}

In effect, I'm trying to instruct TorToiSe TTS to use a specific narrator for each chapter. Chapter 9 generates 121 WAV files. Of those, 28 are narrated in an unexpected random male voice and the remainder are in the desired narrator's female voice. IMO, this breaks the principle of least astonishment. (As a user, I've explicitly set the voice to use, but TorToiSe seems to pick a different voice at random.)

It looks like --cvvp-amount CVVP_AMOUNT can reduce the likelihood of multiple speakers, which is what I think is happening, but there's no documentation on the max value (0 being default). Is the float's range from 0 to 1?

I looked at the docs for temperature, and there are a few issues:

  • Which temperature controls whether a random voice is used versus the voice specified on the command-line? --temperature or --diffusion-temperature?
  • If --temperature controls the variance, what are its min/max/default values? The help only states "The softmax temperature of the autoregressive model."
For context, I'm writing a novel having three narrators. The following script runs TorToiSe TTS on each plain text chapter: ``` function txt_to_tts() { pushd $HOME/archives/tortoise-tts for i in $HOME/dev/novel/chapter/??-*.txt; do F=$(basename $i); VOICE=$(echo ${F%.*} | cut -c 4-); CHAPTER=$(echo ${F%.*} | cut -c -2); D=$(dirname $i); CLIPS="$D/../audio/$CHAPTER-$VOICE"; mkdir -p $CLIPS; $HOME/dev/tts/scripts/tortoise_tts.py -p high_quality -O $CLIPS -v $VOICE < $i; done popd } ``` In effect, I'm trying to instruct TorToiSe TTS to use a specific narrator for each chapter. Chapter 9 generates 121 WAV files. Of those, 28 are narrated in an unexpected random male voice and the remainder are in the desired narrator's female voice. IMO, this breaks the principle of least astonishment. (As a user, I've explicitly set the voice to use, but TorToiSe seems to pick a different voice at random.) It looks like `--cvvp-amount CVVP_AMOUNT` can reduce the likelihood of multiple speakers, which is what I think is happening, but there's no documentation on the max value (0 being default). Is the float's range from 0 to 1? I looked at the docs for temperature, and there are a few issues: * Which temperature controls whether a random voice is used versus the voice specified on the command-line? `--temperature` or `--diffusion-temperature`? * If `--temperature` controls the variance, what are its min/max/default values? The help only states "The softmax temperature of the autoregressive model."
Author

By re-sampling with a higher audio quality source and tweaking the command-line arguments, the same voice is used consistently throughout the narration:

./scripts/tortoise_tts.py \
  --cvvp-amount 0.5
  -p high_quality \
  -O /tmp/audio/directory \
  -v cassandra < chapter/03.txt
By re-sampling with a higher audio quality source and tweaking the command-line arguments, the same voice is used consistently throughout the narration: ``` bash ./scripts/tortoise_tts.py \ --cvvp-amount 0.5 -p high_quality \ -O /tmp/audio/directory \ -v cassandra < chapter/03.txt ```
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/tortoise-tts#42
No description provided.