It's supported by whisperx, see notes on Speaker Diarization in the README.
Activate the virtual environment with venv\Scripts\activate
then pip install -e .\requirements.txt
raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name)) soundfile.LibsndfileError: Error opening './training/tboiNarrator/audio/see 4ever 1_00000.wav': System error.`
Mea…
Might be better to re-clone the whole repo and run setup again.
How much VRAM do you have? If it's 8GB or less then knock the # of training elements down to 96 and try with a batch size of 32.
"Error no kernel image is available for execution on the device" likely indicates that you don't have CUDA installed correctly.
There's one for vanilla Tortoise-TTS which would probably make a good starting point.
Changing the sample rate may not have any noticeable effect other than increasing training times, see notes regarding RVQ bins in #152.
Both low and high-pitched voices come out closer to the median. Might improve with more training cycles but I usually just pitch-shift it with ffmpeg.