Adding faster-whisper backend #222

New Issue

nanonomad · 2023-04-29T21:47:30Z

nanonomad commented

2023-04-29 21:47:30 +00:00

Hi,
There's yet another implementation of Whisper out there called faster-whisper, and I've found it to be ~2-3x faster when I've been doing transcriptions for training VITS models.

I'm not a coder, but I tried implementing it in the AI Voice Cloning UI if anyone else wants to attempt to replicate it.

It'll probably whine about some CUDA library not found in the path, so find wherever that is supposed to be on your system,
source ./venv/bin/activate
and then
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:path_to_missing_libs_here
deactivate

Code changes:
Change compute_type to load the model in fp16i8, fp32, fp16.

53 in src/utils.py:
WHISPER_BACKENDS = ["openai/whisper", "lightmare/whispercpp", "m-bain/whisperx","guillaumekln/faster-whisper"]

~line 2020 in src/utils.py:
if args.whisper_backend == "guillaumekln/faster-whisper":
from faster_whisper import WhisperModel

	device = "cuda" if get_device_name() == "cuda" else "cpu"
	segments, info = whisper_model.transcribe(file, language=language, beam_size=5)
	result = {'text': [], 'language':{}, 'segments': []}
	for segment in segments:
		reparsed = {
			'start': segment[2] / 100.0,
			'end': segment[3] / 100.0,
			'text': segment[4],
 			'id': len(result['segments'])}
		#result['text'].append( segment[2] )
		result['text'].append(segment[4])
		result['segments'].append(reparsed)
		result['language'] = language
	result['text'] = " ".join(result['text'])
	return result

~line 3696 in src/utils.py:
elif args.whisper_backend == "guillaumekln/faster-whisper":
# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")
from faster_whisper import WhisperModel
try:
#is it possible for model to fit on vram but go oom later on while executing on data?
whisper_model = WhisperModel(model_name, device="cuda", compute_type="float32")
except:
print("Out of VRAM memory. falling back to loading Whisper on CPU.")
whisper_model = WhisperModel(model_name, device="cpu", compute_type="float32")

Reqs:
cd text-generation-webui
source ./venv/bin/activate
git clone https://github.com/guillaumekln/faster-whisper.git
pip install -e faster-whisper/
deactivate
./start.sh

Hi, There's yet another implementation of Whisper out there called faster-whisper, and I've found it to be ~2-3x faster when I've been doing transcriptions for training VITS models. I'm not a coder, but I tried implementing it in the AI Voice Cloning UI if anyone else wants to attempt to replicate it. It'll probably whine about some CUDA library not found in the path, so find wherever that is supposed to be on your system, source ./venv/bin/activate and then export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:_path_to_missing_libs_here_ deactivate Code changes: Change compute_type to load the model in fp16i8, fp32, fp16. 53 in src/utils.py: WHISPER_BACKENDS = ["openai/whisper", "lightmare/whispercpp", "m-bain/whisperx","guillaumekln/faster-whisper"] ~line 2020 in src/utils.py: if args.whisper_backend == "guillaumekln/faster-whisper": from faster_whisper import WhisperModel device = "cuda" if get_device_name() == "cuda" else "cpu" segments, info = whisper_model.transcribe(file, language=language, beam_size=5) result = {'text': [], 'language':{}, 'segments': []} for segment in segments: reparsed = { 'start': segment[2] / 100.0, 'end': segment[3] / 100.0, 'text': segment[4], 'id': len(result['segments'])} #result['text'].append( segment[2] ) result['text'].append(segment[4]) result['segments'].append(reparsed) result['language'] = language result['text'] = " ".join(result['text']) return result ~line 3696 in src/utils.py: elif args.whisper_backend == "guillaumekln/faster-whisper": # or run on GPU with INT8 # model = WhisperModel(model_size, device="cuda", compute_type="int8_float16") # or run on CPU with INT8 # model = WhisperModel(model_size, device="cpu", compute_type="int8") from faster_whisper import WhisperModel try: #is it possible for model to fit on vram but go oom later on while executing on data? whisper_model = WhisperModel(model_name, device="cuda", compute_type="float32") except: print("Out of VRAM memory. falling back to loading Whisper on CPU.") whisper_model = WhisperModel(model_name, device="cpu", compute_type="float32") Reqs: cd text-generation-webui source ./venv/bin/activate git clone https://github.com/guillaumekln/faster-whisper.git pip install -e faster-whisper/ deactivate ./start.sh

nanonomad commented

2023-04-29 23:58:03 +00:00

Well that formatting looks like shit, and I pasted the wrong code. Let's try again
~Line 2020 src/utils.py

	from faster_whisper import WhisperModel
	device = "cuda" if get_device_name() == "cuda" else "cpu"
	segments, info = whisper_model.transcribe(file, language=language, beam_size=5)
	result = {'text': [], 'language':{}, 'segments': []}
	for segment in segments:
		reparsed = {
			'start': segment[2],
			'end': segment[3],
			'text': segment[4],
 			'id': len(result['segments'])}
		result['text'].append(segment[4])
		result['segments'].append(reparsed)
		result['language'] = language
	result['text'] = " ".join(result['text'])
	return result

Well that formatting looks like shit, and I pasted the wrong code. Let's try again ~Line 2020 src/utils.py from faster_whisper import WhisperModel device = "cuda" if get_device_name() == "cuda" else "cpu" segments, info = whisper_model.transcribe(file, language=language, beam_size=5) result = {'text': [], 'language':{}, 'segments': []} for segment in segments: reparsed = { 'start': segment[2], 'end': segment[3], 'text': segment[4], 'id': len(result['segments'])} result['text'].append(segment[4]) result['segments'].append(reparsed) result['language'] = language result['text'] = " ".join(result['text']) return result

mrq commented

2023-05-04 02:19:18 +00:00

So:

whisperX is already implementing it in v3 (https://github.com/m-bain/whisperX/tree/v3)
- but you need to manually clone it and edit its requirements to drop the hard version requirement on an ancient torch version
- I have not tested it myself yet
implementing faster-whisper on its own would not require a dependency on whisperX to use it (as it's been a bit of a dependency pain before)
- but it won't have whisperX's VAD filtering for better timestamps, or diarization (whenever it actually works well enough)

For now, I'm going to see how whisperX's integration of faster-whisper fares, and depending on how it goes then I'll add in faster-whisper itself (which realistically can either wow my socks off or make me want to scream and shit myself in anger, so either way it'll get added as a backend, eventually).

I'm very mixed about it. Like usual, I had a decent writeup about my issues, but I figured to scrap it, since it's mostly based on whisperX's implementation (which has really soured me with getting it to work).

Ignoring the headaches of getting it to work after trashing several venvs, the main crux is that I do not trust its timestamps. The boon of whisperX to me before was its terrible timestamps, and what made me turn around and accept it again was the VAD filter it uses redeeming those terrible timestamps. Faster-whisper (and by extension whisperX-v3) don't have that cushion.

Maybe when I'm not so fried I'll give raw faster-whisper a shot rather than seeing the intermediate transcript in whisperX.

So: * whisperX is already implementing it in v3 (https://github.com/m-bain/whisperX/tree/v3) - but you need to manually clone it and edit its requirements to drop the hard version requirement on an ancient torch version - I have not tested it myself yet * implementing faster-whisper on its own would not require a dependency on whisperX to use it (as it's been a bit of a dependency pain before) - but it won't have whisperX's VAD filtering for better timestamps, or diarization (whenever it actually works well enough) For now, I'm going to see how whisperX's integration of faster-whisper fares, and depending on how it goes then I'll add in faster-whisper itself (which realistically can either wow my socks off or make me want to scream and shit myself in anger, so either way it'll get added as a backend, eventually). --- I'm very mixed about it. Like usual, I had a decent writeup about my issues, but I figured to scrap it, since it's mostly based on whisperX's implementation (which has really soured me with getting it to work). Ignoring the headaches of getting it to work after trashing several venvs, the main crux is that I do not trust its timestamps. The boon of whisperX to me before was its terrible timestamps, and what made me turn around and accept it again was the VAD filter it uses redeeming those terrible timestamps. Faster-whisper (and by extension whisperX-v3) don't have that cushion. Maybe when I'm not so fried I'll give raw faster-whisper a shot rather than seeing the intermediate transcript in whisperX.

Sign in to join this conversation.