web interface #397

Closed
opened 2023-09-21 15:26:59 +00:00 by tortoise · 3 comments

This is indeed very nice module and I have really experienced it very excellent.

As far as I understood, it adds emotions as well bz web interface, can I have the diagram how it works to add emotions_ ? or which model is the main source to add emotions? is this CLIP or CLAP or else?
Major cloning takes place with CLVP and diffusion.

Please if you can guide me would let me understand its workflow easily.

This is indeed very nice module and I have really experienced it very excellent. As far as I understood, it adds emotions as well bz web interface, can I have the diagram how it works to add emotions_ ? or which model is the main source to add emotions? is this CLIP or CLAP or else? Major cloning takes place with CLVP and diffusion. Please if you can guide me would let me understand its workflow easily.
Owner

The "emotion" control is simply just adding in [I am really {emotion}], to the start of the text prompt. If I remember right from the original neonbjb/tortoise-tts repo, it "leverages" the AR model's ability to derive emotion from the text prompt, and then redacts the text with wav2vec2 alignment at the end of the inference call.

I never found it useful enough in my testing, but it was a feature toted in the original, and when porting all the available features from the do_tts.py into a web UI, it was also carried over for feature completeness.

Additionally, you can also make use of the text redaction to try and influence the output by wrapping the text in []. It will try and influence the final output but will get removed in the final clip.

The "emotion" control is simply just adding in `[I am really {emotion}], ` to the start of the text prompt. If I remember right from the original [neonbjb/tortoise-tts](https://github.com/neonbjb/tortoise-tts) repo, it "leverages" the AR model's ability to derive emotion from the text prompt, and then redacts the text with wav2vec2 alignment at the end of the inference call. I never found it useful enough in my testing, but it was a feature toted in the original, and when porting all the available features from the `do_tts.py` into a web UI, it was also carried over for feature completeness. Additionally, you can also make use of the text redaction to try and influence the output by wrapping the text in `[]`. It will try and influence the final output but will get removed in the final clip.
Author

Thank you. Really nice explanation.

Thank you. Really nice explanation.
Author

Indeed a nice helping reply. Not only that. Really a great structure and nice interface.

Indeed a nice helping reply. Not only that. Really a great structure and nice interface.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#397
No description provided.