Feature Request: Use WhisperX instead of Whisper for preparing dataset #45

Closed
opened 2023-02-27 08:42:13 +00:00 by hman360 · 2 comments

I tried using the Prepare Dataset option, and it does a somewhat poor job with timestamps on the generated dataset; with the text outputs not quite matching the audio when it's split up. I tried modifying the code to use WhisperX instead, and it seemed to do a much better job, although I still had to add a window of about 0.1s on either side of the split audio for more accuracy. It still misses the audio a little bit but the majority of the audio/text is much more accurate time-wise.
The WhisperX repo is here: https://github.com/m-bain/whisperX

I tried using the Prepare Dataset option, and it does a somewhat poor job with timestamps on the generated dataset; with the text outputs not quite matching the audio when it's split up. I tried modifying the code to use WhisperX instead, and it seemed to do a much better job, although I still had to add a window of about 0.1s on either side of the split audio for more accuracy. It still misses the audio a little bit but the majority of the audio/text is much more accurate time-wise. The WhisperX repo is here: https://github.com/m-bain/whisperX

Very impressed with whisperx. Added a pr #67

Very impressed with whisperx. Added a pr https://git.ecker.tech/mrq/ai-voice-cloning/pulls/67
mrq closed this issue 2023-03-07 02:49:55 +00:00
Owner

It's implemented, but with headaches.

It's implemented, but with headaches.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#45
No description provided.