Is it possible to introduce voice from file instead of mic? #315

New Issue

tortoise · 2023-07-26T15:30:35Z

tortoise commented

2023-07-26 15:30:35 +00:00

This is really a great contribution.

Is there any way to do like taking sample from mp3 or wav?

This is really a great contribution. Is there any way to do like taking sample from mp3 or wav?

xasima commented

2023-07-27 08:00:37 +00:00

Sorry to comment that from only the user experience, but it's presumable perform via

run ./start.sh with default backend (tortoise)
open http://127.0.0.1:7860/
go to utilities tab, specify voice name, load wav file of the voice example. Click import voice, then
return back to the Generate tab, click Refresh Voice List
use new voice that appears, OR remain the voice field empty, but specify a voice at the start of the line with a JSON, e.g. {"voice": "random"}

WAV seems to be at a sample rate of 22050 Hz https://git.ecker.tech/lightmare/tortoise-tts, at least around 10 seconds of data, if different emotions would be used, then might need to express them with explicit word mention to later map to the emotion prompts, different speed and variety (if later training / finetuning #307, or performing emotion transfer from other voices https://github.com/neonbjb/tortoise-tts/issues/16), clearly spoken, no background noises, only one speaker, audio which ends after a sentence ends.

Sorry to comment that from only the user experience, but it's presumable perform via - run ./start.sh with default backend (tortoise) - open http://127.0.0.1:7860/ - go to utilities tab, specify voice name, load wav file of the voice example. Click import voice, then - return back to the Generate tab, click Refresh Voice List - use new voice that appears, OR remain the voice field empty, but specify a voice at the start of the line with a JSON, e.g. {"voice": "random"} WAV seems to be at a sample rate of 22050 Hz https://git.ecker.tech/lightmare/tortoise-tts, at least around 10 seconds of data, if different emotions would be used, then might need to express them with explicit word mention to later map to the emotion prompts, different speed and variety (if later training / finetuning https://git.ecker.tech/mrq/ai-voice-cloning/issues/307, or performing emotion transfer from other voices https://github.com/neonbjb/tortoise-tts/issues/16), clearly spoken, no background noises, only one speaker, audio which ends after a sentence ends.