Google Translatotron 3: Speech to Speech Translation with Monolingual Data #277

Open
opened 2023-06-21 19:33:53 +00:00 by helloitsme · 1 comment

Abstract - https://arxiv.org/abs/2305.17547
Paper - https://arxiv.org/pdf/2305.17547.pdf
Website with examples - https://google-research.github.io/lingvo-lab/translatotron3/

Translatotron 3, an unsupervised direct speech-to-speech translation model. It
uses the unsupervised embedding word mapping technique and a back-translation training procedure.
Unlike the previous approaches, the proposed approach can implicitly preserve some elements of
para-/non-linguistic characteristics in the source speech. We demonstrated that the proposed approach
improved upon the unsupervised cascade baseline (up to 10.51 increase in BLEU) and approached
the performance of supervised systems on the CVSS dataset (by 1.95 gap in BLEU). This suggests
that Translatotron 3 is an effective approach for unsupervised S2ST that is able to retain important
information from the source speech in the target translation.

=============

Essentially a sample in one language is transcribed, translated, and cloned in the target language in the same speaker's voice. This is very cool! I wonder if it could be molded into tortoise somehow? I tested an implementation via clonedub.com but the service seems to only support short clips. However, the output was very clean and usable.

Abstract - https://arxiv.org/abs/2305.17547 Paper - https://arxiv.org/pdf/2305.17547.pdf Website with examples - https://google-research.github.io/lingvo-lab/translatotron3/ Translatotron 3, an unsupervised direct speech-to-speech translation model. It uses the unsupervised embedding word mapping technique and a back-translation training procedure. Unlike the previous approaches, the proposed approach can implicitly preserve some elements of para-/non-linguistic characteristics in the source speech. We demonstrated that the proposed approach improved upon the unsupervised cascade baseline (up to 10.51 increase in BLEU) and approached the performance of supervised systems on the CVSS dataset (by 1.95 gap in BLEU). This suggests that Translatotron 3 is an effective approach for unsupervised S2ST that is able to retain important information from the source speech in the target translation. ============= Essentially a sample in one language is transcribed, translated, and cloned in the target language in the same speaker's voice. This is very cool! I wonder if it could be molded into tortoise somehow? I tested an implementation via clonedub.com but the service seems to only support short clips. However, the output was very clean and usable.
Author

Actually, I'm not sure what the tech underneath clonedub is for sure. Meta and 11labs I guess both have similar capabilities now.

Actually, I'm not sure what the tech underneath clonedub is for sure. Meta and 11labs I guess both have similar capabilities now.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#277
No description provided.