Is it possible to introduce voice from file instead of mic? #315

Closed
opened 2023-07-26 15:30:35 +00:00 by tortoise · 2 comments

This is really a great contribution.

Is there any way to do like taking sample from mp3 or wav?

This is really a great contribution. Is there any way to do like taking sample from mp3 or wav?

Sorry to comment that from only the user experience, but it's presumable perform via

  • run ./start.sh with default backend (tortoise)
  • open http://127.0.0.1:7860/
  • go to utilities tab, specify voice name, load wav file of the voice example. Click import voice, then
  • return back to the Generate tab, click Refresh Voice List
  • use new voice that appears, OR remain the voice field empty, but specify a voice at the start of the line with a JSON, e.g. {"voice": "random"}

WAV seems to be at a sample rate of 22050 Hz https://git.ecker.tech/lightmare/tortoise-tts, at least around 10 seconds of data, if different emotions would be used, then might need to express them with explicit word mention to later map to the emotion prompts, different speed and variety (if later training / finetuning #307, or performing emotion transfer from other voices https://github.com/neonbjb/tortoise-tts/issues/16), clearly spoken, no background noises, only one speaker, audio which ends after a sentence ends.

Sorry to comment that from only the user experience, but it's presumable perform via - run ./start.sh with default backend (tortoise) - open http://127.0.0.1:7860/ - go to utilities tab, specify voice name, load wav file of the voice example. Click import voice, then - return back to the Generate tab, click Refresh Voice List - use new voice that appears, OR remain the voice field empty, but specify a voice at the start of the line with a JSON, e.g. {"voice": "random"} WAV seems to be at a sample rate of 22050 Hz https://git.ecker.tech/lightmare/tortoise-tts, at least around 10 seconds of data, if different emotions would be used, then might need to express them with explicit word mention to later map to the emotion prompts, different speed and variety (if later training / finetuning https://git.ecker.tech/mrq/ai-voice-cloning/issues/307, or performing emotion transfer from other voices https://github.com/neonbjb/tortoise-tts/issues/16), clearly spoken, no background noises, only one speaker, audio which ends after a sentence ends.
Author

Thank you so much. Great help. So nice of you.

Thank you so much. Great help. So nice of you.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#315
No description provided.