web interface #397

New Issue

tortoise · 2023-09-21T15:26:59Z

tortoise commented

2023-09-21 15:26:59 +00:00

This is indeed very nice module and I have really experienced it very excellent.

As far as I understood, it adds emotions as well bz web interface, can I have the diagram how it works to add emotions_ ? or which model is the main source to add emotions? is this CLIP or CLAP or else?
Major cloning takes place with CLVP and diffusion.

Please if you can guide me would let me understand its workflow easily.

This is indeed very nice module and I have really experienced it very excellent. As far as I understood, it adds emotions as well bz web interface, can I have the diagram how it works to add emotions_ ? or which model is the main source to add emotions? is this CLIP or CLAP or else? Major cloning takes place with CLVP and diffusion. Please if you can guide me would let me understand its workflow easily.

mrq commented

2023-09-21 22:36:59 +00:00

The "emotion" control is simply just adding in [I am really {emotion}], to the start of the text prompt. If I remember right from the original neonbjb/tortoise-tts repo, it "leverages" the AR model's ability to derive emotion from the text prompt, and then redacts the text with wav2vec2 alignment at the end of the inference call.

I never found it useful enough in my testing, but it was a feature toted in the original, and when porting all the available features from the do_tts.py into a web UI, it was also carried over for feature completeness.

Additionally, you can also make use of the text redaction to try and influence the output by wrapping the text in []. It will try and influence the final output but will get removed in the final clip.

The "emotion" control is simply just adding in `[I am really {emotion}], ` to the start of the text prompt. If I remember right from the original [neonbjb/tortoise-tts](https://github.com/neonbjb/tortoise-tts) repo, it "leverages" the AR model's ability to derive emotion from the text prompt, and then redacts the text with wav2vec2 alignment at the end of the inference call. I never found it useful enough in my testing, but it was a feature toted in the original, and when porting all the available features from the `do_tts.py` into a web UI, it was also carried over for feature completeness. Additionally, you can also make use of the text redaction to try and influence the output by wrapping the text in `[]`. It will try and influence the final output but will get removed in the final clip.

👍 1

tortoise commented

2023-09-22 07:59:09 +00:00

Thank you. Really nice explanation.