Google Translatotron 3: Speech to Speech Translation with Monolingual Data #277

New Issue

helloitsme · 2023-06-21T19:33:53Z

helloitsme commented

2023-06-21 19:33:53 +00:00

Abstract - https://arxiv.org/abs/2305.17547
Paper - https://arxiv.org/pdf/2305.17547.pdf
Website with examples - https://google-research.github.io/lingvo-lab/translatotron3/

Translatotron 3, an unsupervised direct speech-to-speech translation model. It
uses the unsupervised embedding word mapping technique and a back-translation training procedure.
Unlike the previous approaches, the proposed approach can implicitly preserve some elements of
para-/non-linguistic characteristics in the source speech. We demonstrated that the proposed approach
improved upon the unsupervised cascade baseline (up to 10.51 increase in BLEU) and approached
the performance of supervised systems on the CVSS dataset (by 1.95 gap in BLEU). This suggests
that Translatotron 3 is an effective approach for unsupervised S2ST that is able to retain important
information from the source speech in the target translation.

=============

Essentially a sample in one language is transcribed, translated, and cloned in the target language in the same speaker's voice. This is very cool! I wonder if it could be molded into tortoise somehow? I tested an implementation via clonedub.com but the service seems to only support short clips. However, the output was very clean and usable.

Abstract - https://arxiv.org/abs/2305.17547 Paper - https://arxiv.org/pdf/2305.17547.pdf Website with examples - https://google-research.github.io/lingvo-lab/translatotron3/ Translatotron 3, an unsupervised direct speech-to-speech translation model. It uses the unsupervised embedding word mapping technique and a back-translation training procedure. Unlike the previous approaches, the proposed approach can implicitly preserve some elements of para-/non-linguistic characteristics in the source speech. We demonstrated that the proposed approach improved upon the unsupervised cascade baseline (up to 10.51 increase in BLEU) and approached the performance of supervised systems on the CVSS dataset (by 1.95 gap in BLEU). This suggests that Translatotron 3 is an effective approach for unsupervised S2ST that is able to retain important information from the source speech in the target translation. ============= Essentially a sample in one language is transcribed, translated, and cloned in the target language in the same speaker's voice. This is very cool! I wonder if it could be molded into tortoise somehow? I tested an implementation via clonedub.com but the service seems to only support short clips. However, the output was very clean and usable.

👍 1

helloitsme commented

2023-06-23 06:35:54 +00:00

Actually, I'm not sure what the tech underneath clonedub is for sure. Meta and 11labs I guess both have similar capabilities now.

Sign in to join this conversation.