Transcription and diarization (speaker identification) - Easy dataset building? #319
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#319
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Forgive my ignorance, but I've looking into how it might be possible to separate speakers from from one audio file and came across this : https://github.com/openai/whisper/discussions/264
I am not sure if this what I think it is and if it can be implemented for "easy" dataset building rather than manually editing of audio datasets....which is lets face it a big pet peeve.
It's supported by whisperx, see notes on Speaker Diarization in the README.
Thanks, but as far I can understand this is just to label speakers on the transcript rather than slicing the audio into different speakers
Do you know how to code? Its relatively simple to run whisperx then loop through the results grabbing only the desired speaker. ie
Unfortuntely not I've been trying for hours to get ChatGPT to create a script for me but its one issue after another.
`import whisperx
import torch
import soundfile as sf
import numpy as np
def move_to_device(obj, device):
if isinstance(obj, dict):
return {key: move_to_device(value, device) for key, value in obj.items()}
elif isinstance(obj, torch.nn.Module):
return obj.to(device)
else:
return obj
def split_audio_by_speakers(audio_file, result, output_folder):
audio, sr = whisperx.load_audio(audio_file)
if name == "main":
device = "cuda"
audio_file = "sample.wav"
batch_size = 16
compute_type = "float16"
`
This closest code I got to doing it but it always gives me this error :
Traceback (most recent call last):
File "H:\whisperX\diarization_script.py", line 63, in
diarize_model = whisperx.DiarizationPipeline(device="cpu")
File "H:\whisperX\whisperx\diarize.py", line 16, in init
self.model = Pipeline.from_pretrained(model_name, use_auth_token=use_auth_token).to(device)
File "H:\anaconda3\envs\whisperx\lib\site-packages\pyannote\pipeline\pipeline.py", line 100, in getattr
raise AttributeError(msg)
AttributeError: 'SpeakerDiarization' object has no attribute 'to'
Long story short, I seem to have an issue with diarization not working that i need to sort out first
I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this.
https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py
I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake.
Thanks. Btw I managed to get ChatGPT to get me something that works :
https://github.com/rikabi89/diarization_script/blob/main/diarization_script.py - The issue is that it doesn't work well. There is overlapping in the speakers and so its not separating them properly. Not sure why.
Anyway thanks for yours, I will test it out.
keep in mind this extracts the primary speaker of the clip out only. It doesn't segment out all the speakers. Although it can with minor modification. But that should work well enough if you're simply looking to create a dataset of a particular person.
that would be awesome if you could suggest the modification I could try.
Did you successfully get a transcript from it? If not, do so, so you have a baseline for the output to expect. If so than simply replace primarySpeaker with your desired speaker when its looping through and grabbing lines for the given speaker. I believed whisperx lists it out something to the effect of Speaker_01 or something like that.
But an even easier option is to just to leave it as is and find audio where you desired character is the main speaker.
It seems with this iteration of the code, it didn't produce/dump a transcript but it does the audio segment it into Speaker 1 and Speaker 2. I think this best works for large dataset rather than small. Anyway thanks for your help I'll try to mess around with this further but I think for large audio set it does a decent job.
The code I provided produces a transcript into the folder RawTranscripts from a wav file in the PreProcessedAudio folder. You can then use the transcript to extract out the relevant audio. I assumed you just needed help in the diarization part. So this file alone just produces a diarized transcript and isn't a complete dataset preparation pipeline. But if you don't mind getting your feet a little wet you can just use pydub and a little python to extract out the segments. This will be enough to get started on a small dataset.
I was also looking for something like this, and thanks to you guys, I was able to achieve what I was going for. I slightly modified Fresh's code, and also modified snippets from the colab linked here: https://github.com/openai/whisper/discussions/264
I ended up with two scripts, the first one is a mofidication of Fresh's script that outputs every speaker's dialogue into a txt file. It outputs in this format:
16.483.17.003. SPEAKER_00: "This is a test message."
31.745.33.066. SPEAKER_01: "This is a message from another speaker."
The numbers separated by the periods are meant to represent:
starttimeseconds.starttimemilliseconds.endtimeseconds.endtimemilliseconds.
I outputted the data this way so that its easier for me to parse it using the second script, which in turns parses the txt file and strips out all of the audio for the specified speaker. It separates snippets by 500ms (you may want to adjust it for your needs) and exports the results to one wav file. Using the scripts should be pretty simple, but please let me know if there's anything I can clarify.
Edit: I'm not sure what went wrong with the formatting. This is my first time using this site, sorry about that.
I've uploaded them to pastebin:
Script 1: https://pastebin.com/2JfY4PWr
Script2: https://pastebin.com/ipWw8YCD
You will have to install pydub if you don't already have it. Hope this helps!