Feature Request: Use WhisperX instead of Whisper for preparing dataset #45

New Issue

hman360 · 2023-02-27T08:42:13Z

hman360 commented

2023-02-27 08:42:13 +00:00

I tried using the Prepare Dataset option, and it does a somewhat poor job with timestamps on the generated dataset; with the text outputs not quite matching the audio when it's split up. I tried modifying the code to use WhisperX instead, and it seemed to do a much better job, although I still had to add a window of about 0.1s on either side of the split audio for more accuracy. It still misses the audio a little bit but the majority of the audio/text is much more accurate time-wise.
The WhisperX repo is here: https://github.com/m-bain/whisperX

I tried using the Prepare Dataset option, and it does a somewhat poor job with timestamps on the generated dataset; with the text outputs not quite matching the audio when it's split up. I tried modifying the code to use WhisperX instead, and it seemed to do a much better job, although I still had to add a window of about 0.1s on either side of the split audio for more accuracy. It still misses the audio a little bit but the majority of the audio/text is much more accurate time-wise. The WhisperX repo is here: https://github.com/m-bain/whisperX

yqxtqymn commented

2023-03-06 02:02:56 +00:00

Very impressed with whisperx. Added a pr #67

Very impressed with whisperx. Added a pr https://git.ecker.tech/mrq/ai-voice-cloning/pulls/67

mrq closed this issue

2023-03-07 02:49:55 +00:00

mrq commented

2023-03-07 02:50:10 +00:00

It's implemented, but with headaches.

Sign in to join this conversation.