Google Translatotron 3: Speech to Speech Translation with Monolingual Data #277
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#277
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Abstract - https://arxiv.org/abs/2305.17547
Paper - https://arxiv.org/pdf/2305.17547.pdf
Website with examples - https://google-research.github.io/lingvo-lab/translatotron3/
Translatotron 3, an unsupervised direct speech-to-speech translation model. It
uses the unsupervised embedding word mapping technique and a back-translation training procedure.
Unlike the previous approaches, the proposed approach can implicitly preserve some elements of
para-/non-linguistic characteristics in the source speech. We demonstrated that the proposed approach
improved upon the unsupervised cascade baseline (up to 10.51 increase in BLEU) and approached
the performance of supervised systems on the CVSS dataset (by 1.95 gap in BLEU). This suggests
that Translatotron 3 is an effective approach for unsupervised S2ST that is able to retain important
information from the source speech in the target translation.
=============
Essentially a sample in one language is transcribed, translated, and cloned in the target language in the same speaker's voice. This is very cool! I wonder if it could be molded into tortoise somehow? I tested an implementation via clonedub.com but the service seems to only support short clips. However, the output was very clean and usable.
Actually, I'm not sure what the tech underneath clonedub is for sure. Meta and 11labs I guess both have similar capabilities now.