Transcription and diarization (speaker identification) - Easy dataset building? #319

Open
opened 2023-07-30 15:58:39 +00:00 by SyntheticVoices · 14 comments

Forgive my ignorance, but I've looking into how it might be possible to separate speakers from from one audio file and came across this : https://github.com/openai/whisper/discussions/264

I am not sure if this what I think it is and if it can be implemented for "easy" dataset building rather than manually editing of audio datasets....which is lets face it a big pet peeve.

Forgive my ignorance, but I've looking into how it might be possible to separate speakers from from one audio file and came across this : https://github.com/openai/whisper/discussions/264 I am not sure if this what I think it is and if it can be implemented for "easy" dataset building rather than manually editing of audio datasets....which is lets face it a big pet peeve.

It's supported by whisperx, see notes on Speaker Diarization in the README.

It's supported by whisperx, see notes on [Speaker Diarization](https://github.com/m-bain/whisperX/blob/main/README.md#speaker-diarization) in the README.

It's supported by whisperx, see notes on Speaker Diarization in the README.

Thanks, but as far I can understand this is just to label speakers on the transcript rather than slicing the audio into different speakers

> It's supported by whisperx, see notes on [Speaker Diarization](https://github.com/m-bain/whisperX/blob/main/README.md#speaker-diarization) in the README. Thanks, but as far I can understand this is just to label speakers on the transcript rather than slicing the audio into different speakers

It's supported by whisperx, see notes on Speaker Diarization in the README.

Thanks, but as far I can understand this is just to label speakers on the transcript rather than slicing the audio into different speakers

Do you know how to code? Its relatively simple to run whisperx then loop through the results grabbing only the desired speaker. ie



diarize_model = whisperx.DiarizationPipeline(use_auth_token='YOUR_TOKEN', device=device)

diarize_segments = diarize_model(audioFile)

diarizedData = whisperx.assign_word_speakers(diarize_segments, result)


#badlyformattedincompleteloop 

   while (count < dataSize):
           try:
                  currentSpeaker = diarizedData["segments"][count]["speaker"]

                   if (currentSpeaker == desiredSpeaker):

> > It's supported by whisperx, see notes on [Speaker Diarization](https://github.com/m-bain/whisperX/blob/main/README.md#speaker-diarization) in the README. > > Thanks, but as far I can understand this is just to label speakers on the transcript rather than slicing the audio into different speakers Do you know how to code? Its relatively simple to run whisperx then loop through the results grabbing only the desired speaker. ie ``` diarize_model = whisperx.DiarizationPipeline(use_auth_token='YOUR_TOKEN', device=device) diarize_segments = diarize_model(audioFile) diarizedData = whisperx.assign_word_speakers(diarize_segments, result) #badlyformattedincompleteloop while (count < dataSize): try: currentSpeaker = diarizedData["segments"][count]["speaker"] if (currentSpeaker == desiredSpeaker): ```

It's supported by whisperx, see notes on Speaker Diarization in the README.

Thanks, but as far I can understand this is just to label speakers on the transcript rather than slicing the audio into different speakers

Do you know how to code? Its relatively simple to run whisperx then loop through the results grabbing only the desired speaker. ie



diarize_model = whisperx.DiarizationPipeline(use_auth_token='YOUR_TOKEN', device=device)

diarize_segments = diarize_model(audioFile)

diarizedData = whisperx.assign_word_speakers(diarize_segments, result)


#badlyformattedincompleteloop 

   while (count < dataSize):
           try:
                  currentSpeaker = diarizedData["segments"][count]["speaker"]

                   if (currentSpeaker == desiredSpeaker):

Unfortuntely not I've been trying for hours to get ChatGPT to create a script for me but its one issue after another.

> > > It's supported by whisperx, see notes on [Speaker Diarization](https://github.com/m-bain/whisperX/blob/main/README.md#speaker-diarization) in the README. > > > > Thanks, but as far I can understand this is just to label speakers on the transcript rather than slicing the audio into different speakers > > Do you know how to code? Its relatively simple to run whisperx then loop through the results grabbing only the desired speaker. ie > > ``` > > > diarize_model = whisperx.DiarizationPipeline(use_auth_token='YOUR_TOKEN', device=device) > > diarize_segments = diarize_model(audioFile) > > diarizedData = whisperx.assign_word_speakers(diarize_segments, result) > > > #badlyformattedincompleteloop > > while (count < dataSize): > try: > currentSpeaker = diarizedData["segments"][count]["speaker"] > > if (currentSpeaker == desiredSpeaker): > > ``` > > Unfortuntely not I've been trying for hours to get ChatGPT to create a script for me but its one issue after another.

`import whisperx
import torch
import soundfile as sf
import numpy as np

def move_to_device(obj, device):
if isinstance(obj, dict):
return {key: move_to_device(value, device) for key, value in obj.items()}
elif isinstance(obj, torch.nn.Module):
return obj.to(device)
else:
return obj

def split_audio_by_speakers(audio_file, result, output_folder):
audio, sr = whisperx.load_audio(audio_file)

# Initialize dictionary to store audio for each speaker
speaker_audio = {}

for segment in result["segments"]:
    start_time = segment["start"]
    end_time = segment["end"]
    speaker_id = segment["speaker_id"]

    # Convert start and end times from seconds to samples
    start_sample = int(start_time * sr)
    end_sample = int(end_time * sr)

    # Get the audio segment for the current speaker
    if speaker_id in speaker_audio:
        speaker_audio[speaker_id] = np.concatenate(
            (speaker_audio[speaker_id], audio[start_sample:end_sample])
        )
    else:
        speaker_audio[speaker_id] = audio[start_sample:end_sample]

# Save each speaker's audio to separate files
for speaker_id, speaker_audio_data in speaker_audio.items():
    output_file = f"{output_folder}/speaker_{speaker_id}.wav"
    sf.write(output_file, speaker_audio_data, sr)

print("Audio splitting completed.")

if name == "main":
device = "cuda"
audio_file = "sample.wav"
batch_size = 16
compute_type = "float16"

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print(result["segments"]) # after alignment

# 3. Diarization (Speaker Segmentation)
diarize_model = whisperx.DiarizationPipeline(device="cpu")  
diarize_model = move_to_device(diarize_model, device)  
diarize_segments = diarize_model(audio_file)
result = whisperx.assign_word_speakers(diarize_segments.segments, result)

# Split the audio based on speaker IDs
output_folder = "output_speaker_audio"
split_audio_by_speakers(audio_file, result, output_folder)

`

This closest code I got to doing it but it always gives me this error :

Traceback (most recent call last):
File "H:\whisperX\diarization_script.py", line 63, in
diarize_model = whisperx.DiarizationPipeline(device="cpu")
File "H:\whisperX\whisperx\diarize.py", line 16, in init
self.model = Pipeline.from_pretrained(model_name, use_auth_token=use_auth_token).to(device)
File "H:\anaconda3\envs\whisperx\lib\site-packages\pyannote\pipeline\pipeline.py", line 100, in getattr
raise AttributeError(msg)
AttributeError: 'SpeakerDiarization' object has no attribute 'to'

`import whisperx import torch import soundfile as sf import numpy as np def move_to_device(obj, device): if isinstance(obj, dict): return {key: move_to_device(value, device) for key, value in obj.items()} elif isinstance(obj, torch.nn.Module): return obj.to(device) else: return obj def split_audio_by_speakers(audio_file, result, output_folder): audio, sr = whisperx.load_audio(audio_file) # Initialize dictionary to store audio for each speaker speaker_audio = {} for segment in result["segments"]: start_time = segment["start"] end_time = segment["end"] speaker_id = segment["speaker_id"] # Convert start and end times from seconds to samples start_sample = int(start_time * sr) end_sample = int(end_time * sr) # Get the audio segment for the current speaker if speaker_id in speaker_audio: speaker_audio[speaker_id] = np.concatenate( (speaker_audio[speaker_id], audio[start_sample:end_sample]) ) else: speaker_audio[speaker_id] = audio[start_sample:end_sample] # Save each speaker's audio to separate files for speaker_id, speaker_audio_data in speaker_audio.items(): output_file = f"{output_folder}/speaker_{speaker_id}.wav" sf.write(output_file, speaker_audio_data, sr) print("Audio splitting completed.") if __name__ == "__main__": device = "cuda" audio_file = "sample.wav" batch_size = 16 compute_type = "float16" # 1. Transcribe with original whisper (batched) model = whisperx.load_model("large-v2", device, compute_type=compute_type) audio = whisperx.load_audio(audio_file) result = model.transcribe(audio, batch_size=batch_size) print(result["segments"]) # before alignment # 2. Align whisper output model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device) result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False) print(result["segments"]) # after alignment # 3. Diarization (Speaker Segmentation) diarize_model = whisperx.DiarizationPipeline(device="cpu") diarize_model = move_to_device(diarize_model, device) diarize_segments = diarize_model(audio_file) result = whisperx.assign_word_speakers(diarize_segments.segments, result) # Split the audio based on speaker IDs output_folder = "output_speaker_audio" split_audio_by_speakers(audio_file, result, output_folder) ` This closest code I got to doing it but it always gives me this error : Traceback (most recent call last): File "H:\whisperX\diarization_script.py", line 63, in <module> diarize_model = whisperx.DiarizationPipeline(device="cpu") File "H:\whisperX\whisperx\diarize.py", line 16, in __init__ self.model = Pipeline.from_pretrained(model_name, use_auth_token=use_auth_token).to(device) File "H:\anaconda3\envs\whisperx\lib\site-packages\pyannote\pipeline\pipeline.py", line 100, in __getattr__ raise AttributeError(msg) AttributeError: 'SpeakerDiarization' object has no attribute 'to'

Long story short, I seem to have an issue with diarization not working that i need to sort out first

Long story short, I seem to have an issue with diarization not working that i need to sort out first

Long story short, I seem to have an issue with diarization not working that i need to sort out first

I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this.

https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py

I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake.

> Long story short, I seem to have an issue with diarization not working that i need to sort out first I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this. https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake.

Long story short, I seem to have an issue with diarization not working that i need to sort out first

I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this.

https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py

I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake.

Thanks. Btw I managed to get ChatGPT to get me something that works :

https://github.com/rikabi89/diarization_script/blob/main/diarization_script.py - The issue is that it doesn't work well. There is overlapping in the speakers and so its not separating them properly. Not sure why.

Anyway thanks for yours, I will test it out.

> > Long story short, I seem to have an issue with diarization not working that i need to sort out first > > I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this. > > https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py > > I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake. Thanks. Btw I managed to get ChatGPT to get me something that works : https://github.com/rikabi89/diarization_script/blob/main/diarization_script.py - The issue is that it doesn't work well. There is overlapping in the speakers and so its not separating them properly. Not sure why. Anyway thanks for yours, I will test it out.

Long story short, I seem to have an issue with diarization not working that i need to sort out first

I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this.

https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py

I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake.

Thanks. Btw I managed to get ChatGPT to get me something that works :

https://github.com/rikabi89/diarization_script/blob/main/diarization_script.py - The issue is that it doesn't work well. There is overlapping in the speakers and so its not separating them properly. Not sure why.

Anyway thanks for yours, I will test it out.

keep in mind this extracts the primary speaker of the clip out only. It doesn't segment out all the speakers. Although it can with minor modification. But that should work well enough if you're simply looking to create a dataset of a particular person.

> > > Long story short, I seem to have an issue with diarization not working that i need to sort out first > > > > I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this. > > > > https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py > > > > I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake. > > Thanks. Btw I managed to get ChatGPT to get me something that works : > > https://github.com/rikabi89/diarization_script/blob/main/diarization_script.py - The issue is that it doesn't work well. There is overlapping in the speakers and so its not separating them properly. Not sure why. > > Anyway thanks for yours, I will test it out. keep in mind this extracts the primary speaker of the clip out only. It doesn't segment out all the speakers. Although it can with minor modification. But that should work well enough if you're simply looking to create a dataset of a particular person.

Long story short, I seem to have an issue with diarization not working that i need to sort out first

I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this.

https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py

I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake.

Thanks. Btw I managed to get ChatGPT to get me something that works :

https://github.com/rikabi89/diarization_script/blob/main/diarization_script.py - The issue is that it doesn't work well. There is overlapping in the speakers and so its not separating them properly. Not sure why.

Anyway thanks for yours, I will test it out.

keep in mind this extracts the primary speaker of the clip out only. It doesn't segment out all the speakers. Although it can with minor modification. But that should work well enough if you're simply looking to create a dataset of a particular person.

that would be awesome if you could suggest the modification I could try.

> > > > Long story short, I seem to have an issue with diarization not working that i need to sort out first > > > > > > I've been considering releasing a full dataset preparation pipeline for tortoise but for now you can borrow this. > > > > > > https://github.com/Ado012/TTSDataPrep/blob/main/SpeechRecognizerWXDiarizerExample.py > > > > > > I recommend learning some basic python so you actually know how to deal with error messages. The machines haven't yet fully replaced programmers and hopefully never will for my sake. > > > > Thanks. Btw I managed to get ChatGPT to get me something that works : > > > > https://github.com/rikabi89/diarization_script/blob/main/diarization_script.py - The issue is that it doesn't work well. There is overlapping in the speakers and so its not separating them properly. Not sure why. > > > > Anyway thanks for yours, I will test it out. > > > > keep in mind this extracts the primary speaker of the clip out only. It doesn't segment out all the speakers. Although it can with minor modification. But that should work well enough if you're simply looking to create a dataset of a particular person. that would be awesome if you could suggest the modification I could try.

Did you successfully get a transcript from it? If not, do so, so you have a baseline for the output to expect. If so than simply replace primarySpeaker with your desired speaker when its looping through and grabbing lines for the given speaker. I believed whisperx lists it out something to the effect of Speaker_01 or something like that.

But an even easier option is to just to leave it as is and find audio where you desired character is the main speaker.

Did you successfully get a transcript from it? If not, do so, so you have a baseline for the output to expect. If so than simply replace primarySpeaker with your desired speaker when its looping through and grabbing lines for the given speaker. I believed whisperx lists it out something to the effect of Speaker_01 or something like that. But an even easier option is to just to leave it as is and find audio where you desired character is the main speaker.

Did you successfully get a transcript from it? If not, do so, so you have a baseline for the output to expect. If so than simply replace primarySpeaker with your desired speaker when its looping through and grabbing lines for the given speaker. I believed whisperx lists it out something to the effect of Speaker_01 or something like that.

But an even easier option is to just to leave it as is and find audio where you desired character is the main speaker.

It seems with this iteration of the code, it didn't produce/dump a transcript but it does the audio segment it into Speaker 1 and Speaker 2. I think this best works for large dataset rather than small. Anyway thanks for your help I'll try to mess around with this further but I think for large audio set it does a decent job.

> Did you successfully get a transcript from it? If not, do so, so you have a baseline for the output to expect. If so than simply replace primarySpeaker with your desired speaker when its looping through and grabbing lines for the given speaker. I believed whisperx lists it out something to the effect of Speaker_01 or something like that. > > But an even easier option is to just to leave it as is and find audio where you desired character is the main speaker. It seems with this iteration of the code, it didn't produce/dump a transcript but it does the audio segment it into Speaker 1 and Speaker 2. I think this best works for large dataset rather than small. Anyway thanks for your help I'll try to mess around with this further but I think for large audio set it does a decent job.

Did you successfully get a transcript from it? If not, do so, so you have a baseline for the output to expect. If so than simply replace primarySpeaker with your desired speaker when its looping through and grabbing lines for the given speaker. I believed whisperx lists it out something to the effect of Speaker_01 or something like that.

But an even easier option is to just to leave it as is and find audio where you desired character is the main speaker.

It seems with this iteration of the code, it didn't produce/dump a transcript but it does the audio segment it into Speaker 1 and Speaker 2. I think this best works for large dataset rather than small. Anyway thanks for your help I'll try to mess around with this further but I think for large audio set it does a decent job.

The code I provided produces a transcript into the folder RawTranscripts from a wav file in the PreProcessedAudio folder. You can then use the transcript to extract out the relevant audio. I assumed you just needed help in the diarization part. So this file alone just produces a diarized transcript and isn't a complete dataset preparation pipeline. But if you don't mind getting your feet a little wet you can just use pydub and a little python to extract out the segments. This will be enough to get started on a small dataset.

> > Did you successfully get a transcript from it? If not, do so, so you have a baseline for the output to expect. If so than simply replace primarySpeaker with your desired speaker when its looping through and grabbing lines for the given speaker. I believed whisperx lists it out something to the effect of Speaker_01 or something like that. > > > > But an even easier option is to just to leave it as is and find audio where you desired character is the main speaker. > > It seems with this iteration of the code, it didn't produce/dump a transcript but it does the audio segment it into Speaker 1 and Speaker 2. I think this best works for large dataset rather than small. Anyway thanks for your help I'll try to mess around with this further but I think for large audio set it does a decent job. The code I provided produces a transcript into the folder RawTranscripts from a wav file in the PreProcessedAudio folder. You can then use the transcript to extract out the relevant audio. I assumed you just needed help in the diarization part. So this file alone just produces a diarized transcript and isn't a complete dataset preparation pipeline. But if you don't mind getting your feet a little wet you can just use pydub and a little python to extract out the segments. This will be enough to get started on a small dataset.

I was also looking for something like this, and thanks to you guys, I was able to achieve what I was going for. I slightly modified Fresh's code, and also modified snippets from the colab linked here: https://github.com/openai/whisper/discussions/264

I ended up with two scripts, the first one is a mofidication of Fresh's script that outputs every speaker's dialogue into a txt file. It outputs in this format:

16.483.17.003. SPEAKER_00: "This is a test message."
31.745.33.066. SPEAKER_01: "This is a message from another speaker."

The numbers separated by the periods are meant to represent:

starttimeseconds.starttimemilliseconds.endtimeseconds.endtimemilliseconds.

I outputted the data this way so that its easier for me to parse it using the second script, which in turns parses the txt file and strips out all of the audio for the specified speaker. It separates snippets by 500ms (you may want to adjust it for your needs) and exports the results to one wav file. Using the scripts should be pretty simple, but please let me know if there's anything I can clarify.

Edit: I'm not sure what went wrong with the formatting. This is my first time using this site, sorry about that.

I've uploaded them to pastebin:

Script 1: https://pastebin.com/2JfY4PWr
Script2: https://pastebin.com/ipWw8YCD

You will have to install pydub if you don't already have it. Hope this helps!

I was also looking for something like this, and thanks to you guys, I was able to achieve what I was going for. I slightly modified Fresh's code, and also modified snippets from the colab linked here: https://github.com/openai/whisper/discussions/264 I ended up with two scripts, the first one is a mofidication of Fresh's script that outputs every speaker's dialogue into a txt file. It outputs in this format: 16.483.17.003. SPEAKER_00: "This is a test message." 31.745.33.066. SPEAKER_01: "This is a message from another speaker." The numbers separated by the periods are meant to represent: starttimeseconds.starttimemilliseconds.endtimeseconds.endtimemilliseconds. I outputted the data this way so that its easier for me to parse it using the second script, which in turns parses the txt file and strips out all of the audio for the specified speaker. It separates snippets by 500ms (you may want to adjust it for your needs) and exports the results to one wav file. Using the scripts should be pretty simple, but please let me know if there's anything I can clarify. Edit: I'm not sure what went wrong with the formatting. This is my first time using this site, sorry about that. I've uploaded them to pastebin: Script 1: https://pastebin.com/2JfY4PWr Script2: https://pastebin.com/ipWw8YCD You will have to install pydub if you don't already have it. Hope this helps!
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#319
No description provided.