Generated voices from training data always garbled.... but works fine using tortoise-tts-fast ... (?) #113

Closed
opened 2023-03-11 01:42:40 +00:00 by nirurin · 60 comments

http://sndup.net/dxm4 = Mrq version

This is what my training data (training done with the mrq repository) sounds like if I generate the speech using the mrQ based generation UI. This is pretty much the best outcome I've had. This is from 200 epoch training data, but I've tried with everything up to 9000 epochs and it's the same garbled outcome.

Someone suggested trying tortoise-tts-fast to generate instead.

Using the same training data .pth file.

https://sndup.net/jkhc/ = tts-fast generator (https://github.com/152334H/tortoise-tts-fast)

... and it actually sounds like it's supposed to! By far my best result. It even seems to have the right accent.

Which makes me now ask... what's the issue with the tts generator in this repo? I can only assume I'm doing something wrong, but I've tried every temperature from 0.1 up to 1.0, and I've tried ultra-fast, fast, standard and high quality.

http://sndup.net/dxm4 = Mrq version This is what my training data (training done with the mrq repository) sounds like if I generate the speech using the mrQ based generation UI. This is pretty much the best outcome I've had. This is from 200 epoch training data, but I've tried with everything up to 9000 epochs and it's the same garbled outcome. Someone suggested trying tortoise-tts-fast to generate instead. Using the same training data .pth file. https://sndup.net/jkhc/ = tts-fast generator (https://github.com/152334H/tortoise-tts-fast) ... and it actually sounds like it's supposed to! By far my best result. It even seems to have the right accent. Which makes me now ask... what's the issue with the tts generator in this repo? I can only assume I'm doing something wrong, but I've tried every temperature from 0.1 up to 1.0, and I've tried ultra-fast, fast, standard and high quality.
Owner

That's strange. Can you send over that model for me to test against?

I can't imagine I broke anything. The only thing I can think of would be it somehow not actually using the model for generating.

That's strange. Can you send over that model for me to test against? I can't imagine I broke anything. The only thing I can think of would be it somehow not actually using the model for generating.
nirurin changed title from Generated voices from training data always garbled.... but words fine using tortoise-tts-fast ... (?) to Generated voices from training data always garbled.... but works fine using tortoise-tts-fast ... (?) 2023-03-11 01:47:13 +00:00
Author

That's strange. Can you send over that model for me to test against?

I can't imagine I broke anything. The only thing I can think of would be it somehow not actually using the model for generating.

The pth file? Sure I can

https://nirin.synology.me:10003/d/s/shjzhza68dqHiOpbkfV7N1fQmtcCXkg4/-RMnQPcvC_fJhQEhSL2fxTV4MlYdIHfb-b7GAd1L5Rgo

and for tthe sake of sanity I have also added screenies of the generator page and my settings page, showing that I do (I think?) have everything set up correctly

> That's strange. Can you send over that model for me to test against? > > I can't imagine I broke anything. The only thing I can think of would be it somehow not actually using the model for generating. The pth file? Sure I can https://nirin.synology.me:10003/d/s/shjzhza68dqHiOpbkfV7N1fQmtcCXkg4/-RMnQPcvC_fJhQEhSL2fxTV4MlYdIHfb-b7GAd1L5Rgo and for tthe sake of sanity I have also added screenies of the generator page and my settings page, showing that I do (I think?) have everything set up correctly
Owner

Oh right, I can actually check the settings for the one generated through AIVC: image

The only thing I can think of right now is that it's not using Half-P and Cond. Free (ironically, the settings that I forgot to fix last night). I doubt those are the culprits, but I just did a test on a model and it sounded terrible, but it's probably just from the model I tested being terrible anyways, later generations sound better but not right.


Oh you beat me to the punch. I'll test the model. I doubt I caused a regression between yesterday and today.

Oh right, I can actually check the settings for the one generated through AIVC: ![image](/attachments/ac3755d8-4ad2-45bc-a912-371ffae99470) The only thing I can think of right now is that it's not using Half-P and Cond. Free (ironically, the settings that I forgot to fix last night). I *doubt* those are the culprits, but I just did a test on a model and it sounded terrible, but it's probably just from the model I tested being terrible anyways, later generations sound better but not right. --- Oh you beat me to the punch. I'll test the model. I *doubt* I caused a regression between yesterday and today.
Author

I have done training on various versions of the mrq repo over the last few days (I tend to update whenever you say you've fixed something haha) but all of my tests have been garbled like this (usually even worse than the example I gave above).

I don't know what's different with the fast-tts generator, but it seems to be picking up the accent (which seems to be the main issue with untrained voices) and isn't at all garbled etc.

But yeh, it's not that the mrQ repo has broken today, it's literally never worked for me (so the last few days of various versions and updates) to generate a legible/understandable voice.

I have done training on various versions of the mrq repo over the last few days (I tend to update whenever you say you've fixed something haha) but all of my tests have been garbled like this (usually even worse than the example I gave above). I don't know what's different with the fast-tts generator, but it seems to be picking up the accent (which seems to be the main issue with untrained voices) and isn't at all garbled etc. But yeh, it's not that the mrQ repo has broken today, it's literally never worked for me (so the last few days of various versions and updates) to generate a legible/understandable voice.
Owner

Oh, there's one other difference between the tortoise-tts's: how the latents are calculated. I don't recall seeing the other tortoise saving latents, so there won't be an easy way to transplant from fast. I think you can load them fine for fast. I'll need to check the code again.

I'll return with instructions on how to go about it in a few minutes.

Oh, there's one other difference between the tortoise-tts's: how the latents are calculated. I don't recall seeing the other tortoise saving latents, so there won't be an easy way to transplant from fast. I *think* you can load them fine for fast. I'll need to check the code again. I'll return with instructions on how to go about it in a few minutes.
Author

Thanks for looking into it. I would much prefer it if I could do the entire process within your repo version, as the gui and everything is so much nicer to use (it took me ages of struggling to get tts-fast working earlier for testing, as whoever runs the repo has a bunch of broken instructions -.- )

Thanks for looking into it. I would much prefer it if I could do the entire process within your repo version, as the gui and everything is so much nicer to use (it took me ages of struggling to get tts-fast working earlier for testing, as whoever runs the repo has a bunch of broken instructions -.- )
Owner

Yeah, it looks like the you can just copy ./voices/patrick/cond_latents_f90a07a1.pth file and just paste it into the fast repo's voice folder (I'd suggest a new voice folder, just to make sure it does load only that). If it sounds terrible, then the culprit is how the latents are being computed.

Yeah, it looks like the you can just copy `./voices/patrick/cond_latents_f90a07a1.pth` file and just paste it into the fast repo's voice folder (I'd suggest a new voice folder, just to make sure it does load only that). If it sounds terrible, then the culprit is how the latents are being computed.
Contributor

Regarding latents and "Leveraging LJSpeech dataset for computing latents," does that mean that the latents are sensitive to the quality of the generated dataset by whisper/whisperx/whispercpp? So theoretically better processing with larger whisper model (even on cpu) would produce higher quality latents? Whisper in its various iterations doesn't take too long to run on reasonably sized datasets, I'm sure most users wouldn't mind spending the extra few minutes generating data with whisper if it means the latent quality increases.

Regarding latents and "Leveraging LJSpeech dataset for computing latents," does that mean that the latents are sensitive to the quality of the generated dataset by whisper/whisperx/whispercpp? So theoretically better processing with larger whisper model (even on cpu) would produce higher quality latents? Whisper in its various iterations doesn't take too long to run on reasonably sized datasets, I'm sure most users wouldn't mind spending the extra few minutes generating data with whisper if it means the latent quality increases.
Owner

Regarding latents and "Leveraging LJSpeech dataset for computing latents," does that mean that the latents are sensitive to the quality of the generated dataset by whisper/whisperx/whispercpp?

Not so much the quality of the transcription itself, but how accurate the segments are. Naturally, smaller whisper models will not segment the audio as nicely as the larger models.

So theoretically better processing with larger whisper model (even on cpu) would produce higher quality latents?

In theory, yeah. I haven't so much tested it with lesser quality segmentings.

Whisper in its various iterations doesn't take too long to run on reasonably sized datasets, I'm sure most users wouldn't mind spending the extra few minutes generating data with whisper if it means the latent quality increases.

The only caveat is VRAM consumption for whisper/whisperx for the large models, or normal RAM consumption if you manage to set them to use the CPU only. whispercpp has less egregious RAM requirements; I think even its large model consumes ~5GiB, I think half the size of the large models for whisper/whisperx. For a dataset of about ~200.

Although, it's just in theory. As I said I haven't tested transcriptions on anything less than large-v2 for the dataset-hinting.

> Regarding latents and "Leveraging LJSpeech dataset for computing latents," does that mean that the latents are sensitive to the quality of the generated dataset by whisper/whisperx/whispercpp? Not so much the quality of the transcription itself, but how accurate the segments are. Naturally, smaller whisper models will not segment the audio as nicely as the larger models. > So theoretically better processing with larger whisper model (even on cpu) would produce higher quality latents? In theory, yeah. I haven't so much tested it with *lesser* quality segmentings. > Whisper in its various iterations doesn't take too long to run on reasonably sized datasets, I'm sure most users wouldn't mind spending the extra few minutes generating data with whisper if it means the latent quality increases. The only caveat is VRAM consumption for whisper/whisperx for the large models, or normal RAM consumption if you manage to set them to use the CPU only. whispercpp has less egregious RAM requirements; I think even its large model consumes ~5GiB, I think half the size of the large models for whisper/whisperx. For a dataset of about ~200. Although, it's just in theory. As I said I haven't tested transcriptions on anything less than large-v2 for the dataset-hinting.
Author

Yeah, it looks like the you can just copy ./voices/patrick/cond_latents_f90a07a1.pth file and just paste it into the fast repo's voice folder (I'd suggest a new voice folder, just to make sure it does load only that). If it sounds terrible, then the culprit is how the latents are being computed.

I deleted all the voice folders and put the conditional latent file in the voices folder, and the generated speech was garbled (much like what mrq version does).

I then kept everything the same, but changes the voice file pointer to the patrick source voices folder in mrq repo (still using the same AR model path 2836.pth file) and it output a proper speech with accent.

So I guess the mrq latents are the issue (?)

> Yeah, it looks like the you can just copy `./voices/patrick/cond_latents_f90a07a1.pth` file and just paste it into the fast repo's voice folder (I'd suggest a new voice folder, just to make sure it does load only that). If it sounds terrible, then the culprit is how the latents are being computed. I deleted all the voice folders and put the conditional latent file in the voices folder, and the generated speech was garbled (much like what mrq version does). I then kept everything the same, but changes the voice file pointer to the patrick source voices folder in mrq repo (still using the same AR model path 2836.pth file) and it output a proper speech with accent. So I guess the mrq latents are the issue (?)
Owner

I suppose so. I guess my fancy magic super duper accuracy boost is flawed in some cases.

If that's the case then, you can always manually set the voice chunk size to something to avoid it from leveraging the LJ dataset. Don't got a good number, but something like 8 never hurts.

I suppose so. I guess my fancy magic super duper accuracy boost is flawed in some cases. If that's the case then, you can always manually set the voice chunk size to something to avoid it from leveraging the LJ dataset. Don't got a good number, but something like 8 never hurts.
Owner

That is to also say, I might need to have it use the default behavior too, if it works better in some cases, since I believe the fast repo by default uses the old behavior (use the first 4 seconds of each sound file). I'll assume you haven't touched that setting for the other fork.

That is to also say, I might need to have it use the default behavior too, if it works better in some cases, since I believe the fast repo by default uses the old behavior (use the first 4 seconds of each sound file). I'll assume you haven't touched that setting for the other fork.
Author

That is to also say, I might need to have it use the default behavior too, if it works better in some cases, since I believe the fast repo by default uses the old behavior (use the first 4 seconds of each sound file). I'll assume you haven't touched that setting for the other fork.

I'm actually using this repo for it - https://github.com/152334H/DL-Art-School

but I think it just uses the default fast-tts repo, just with a gui on top.

I'll try yours with voice-chunk = 8 (instead of the default 0)

Should I be using half-precision and conditioning-free in the experimental settings?

edit - Tried voice-chunk 8 and 12, but both ended up with

"CUDA out of memory. Tried to allocate 1.19 GiB (GPU 0; 24.00 GiB total capacity; 22.03 GiB already allocated; 0 bytes free; 22.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

> That is to also say, I might need to have it use the default behavior too, if it works better in some cases, since I believe the fast repo by default uses the old behavior (use the first 4 seconds of each sound file). I'll assume you haven't touched that setting for the other fork. I'm actually using this repo for it - https://github.com/152334H/DL-Art-School but I think it just uses the default fast-tts repo, just with a gui on top. I'll try yours with voice-chunk = 8 (instead of the default 0) Should I be using half-precision and conditioning-free in the experimental settings? edit - Tried voice-chunk 8 and 12, but both ended up with "CUDA out of memory. Tried to allocate 1.19 GiB (GPU 0; 24.00 GiB total capacity; 22.03 GiB already allocated; 0 bytes free; 22.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
Owner

For sure enable condition free, half-precision I'm still iffy on. I feel sometimes ever since BitsAndBytes was crammed in, it sometimes does boost quality a bit, but I feel it's placebo.

Which reminds me, I need to fix cond-free not defaulting to being on.

For sure enable condition free, half-precision I'm still iffy on. I feel sometimes ever since BitsAndBytes was crammed in, it sometimes does boost quality a bit, but I feel it's placebo. Which reminds me, I need to fix cond-free not defaulting to being on.
Owner

I guess you have too large of a dataset. Bump it to 16 or 24, or however high it needs to be.

I just tested a model that had terrible output, and I manually set it to 24 chunks and it started to sound better, so I guess it is how latents are being calculated. Oops.

I guess you have too large of a dataset. Bump it to 16 or 24, or however high it needs to be. I just tested a model that had terrible output, and I manually set it to 24 chunks and it started to sound better, so I guess it is how latents are being calculated. Oops.
Author

had to put chunks up to 32 to avoid out-of-memory errors (seems even 24gb of vram isn't enough these days lol).

But I am now getting actual accented voices out of the generator!

I wonder why your fancy accuracy boost doesn't work for me?!

had to put chunks up to 32 to avoid out-of-memory errors (seems even 24gb of vram isn't enough these days lol). But I am now getting actual accented voices out of the generator! I wonder why your fancy accuracy boost doesn't work for me?!
Owner

It might have to do with sound file length. If there's even one really long sound file compared to the rest, it'll cause all other sounds to pad to the largest length to easily match. It could be from that, and a lot of latents are using dead air.

It might have to do with sound file length. If there's even one really long sound file compared to the rest, it'll cause all other sounds to pad to the largest length to easily match. It could be from that, and a lot of latents are using dead air.
Contributor

If you're ooming while generating try lowering your Sample Batch Size in settings.

It's possible your fancy magic super duper accuracy boost is correct, but it is the whisper transcribe and dataset that is wrong. Personally I am getting audio chunks that are cut off too early and I don't even know how to manually correct all of it. I can manually edit wav files, but without remaking whisper.json, wouldn't it make the latents even more broken? Does whisper.json even matter for this?

Ideally whisper would get it perfect every time. But if it can't, then whisper -> human -> latents generation based off dataset would be ideal.

If you're ooming while generating try lowering your Sample Batch Size in settings. It's possible your fancy magic super duper accuracy boost is correct, but it is the whisper transcribe and dataset that is wrong. Personally I am getting audio chunks that are cut off too early and I don't even know how to manually correct all of it. I can manually edit wav files, but without remaking whisper.json, wouldn't it make the latents even more broken? Does whisper.json even matter for this? Ideally whisper would get it perfect every time. But if it can't, then whisper -> human -> latents generation based off dataset would be ideal.
Owner

If you're ooming while generating try lowering your Sample Batch Size in settings.

It's OOMing when generating the latents. Low latent chunk counts with a very large dataset will OOM since it's using larger batches.

Personally I am getting audio chunks that are cut off too early and I don't even know how to manually correct all of it.

WhisperX is better at segmenting since it does forced alignment blahblahblah, if you aren't already using that.

but without remaking whisper.json, wouldn't it make the latents even more broken? Does whisper.json even matter for this?

whisper.json doesn't get used at all, it's just an output file that I may use later in the future (probably never, but it doesn't hurt to retain it).

You're free to manually edit and re-trim them, even after transcription for training. The only magic it is for latents is to pad every segment to the largest segment size, and set the chunk size to the dataset count.

> If you're ooming while generating try lowering your Sample Batch Size in settings. It's OOMing when generating the latents. Low latent chunk counts with a very large dataset will OOM since it's using larger batches. > Personally I am getting audio chunks that are cut off too early and I don't even know how to manually correct all of it. WhisperX is better at segmenting since it does forced alignment blahblahblah, if you aren't already using that. > but without remaking whisper.json, wouldn't it make the latents even more broken? Does whisper.json even matter for this? `whisper.json` doesn't get used at all, it's just an output file that I may use later in the future (probably never, but it doesn't hurt to retain it). You're free to manually edit and re-trim them, even after transcription for training. The only magic it is for latents is to pad every segment to the largest segment size, and set the chunk size to the dataset count.
Contributor

The thing is I have very little to complain about whisperx transcribe process. It is the splitting the audio up into chunks that is sus. I've seen on https://github.com/m-bain/whisperX that they have a lot of features which we aren't using right now, sadly some of the models require a huggingface token to be functional, but perhaps some of the extra features could help with the splitting of the audio work.

The thing is I have very little to complain about whisperx transcribe process. It is the splitting the audio up into chunks that is sus. I've seen on https://github.com/m-bain/whisperX that they have a lot of features which we aren't using right now, sadly some of the models require a huggingface token to be functional, but perhaps some of the extra features could help with the splitting of the audio work.
Author

It might have to do with sound file length. If there's even one really long sound file compared to the rest, it'll cause all other sounds to pad to the largest length to easily match. It could be from that, and a lot of latents are using dead air.

ahhhh... I dooooo have a sound file which is 1hour+ because it's a narrated audiobook file (with the bad bits snipped out, but still kept as one single file).

I was going to try just using the whisper-cut-up files as the voice files, but someone suggested not to do that because they might cause issues from being too short!

> It might have to do with sound file length. If there's even one really long sound file compared to the rest, it'll cause all other sounds to pad to the largest length to easily match. It could be from that, and a lot of latents are using dead air. ahhhh... I dooooo have a sound file which is 1hour+ because it's a narrated audiobook file (with the bad bits snipped out, but still kept as one single file). I was going to try just using the whisper-cut-up files as the voice files, but someone suggested not to do that because they might cause issues from being too short!
Owner

sadly some of the models require a huggingface token to be functional

desu I don't think either VAD or diarization are necessary, as if your samples are that dirty, I don't think you should try and use them for either generating latents from or finetuning with:

  • VAD only really seems useful if there's substantial background noise, and whisper is picking up on those as cues for start/ends, but you're still having noise, and you should at least try and denoise/isolate the vocals
  • diarization would be nice when sourcing samples that don't trim between two speakers (namely, stuff from that silent hill resource website, it's a bit of a pain since about half of the voice samples there have two speakers)

I could really, really be wrong though, as I am not an expert.

> sadly some of the models require a huggingface token to be functional desu I don't think either VAD or diarization are necessary, as if your samples are that dirty, I don't think you should try and use them for either generating latents from or finetuning with: * VAD only really seems useful if there's substantial background noise, and whisper is picking up on those as cues for start/ends, but you're still having noise, and you should at least try and denoise/isolate the vocals * diarization would be nice when sourcing samples that don't trim between two speakers (namely, stuff from that silent hill resource website, it's a bit of a pain since about half of the voice samples there have two speakers) I could really, really be wrong though, as I am not an expert.
Owner

ahhhh... I dooooo have a sound file which is 1hour+ because it's a narrated audiobook file (with the bad bits snipped out, but still kept as one single file).

That'll do it. If it's in ./training/patrick/audio/, then it'll try and pad every other segment to an hour+.

I was going to try just using the whisper-cut-up files as the voice files, but someone suggested not to do that because they might cause issues from being too short!

Which I managed to finally catch and work around. DLAS has a hidden minimum sound length that, for some reason, won't mention. The transcription process should throw out anything that would make the training process unhappy, so you're free to segment then with the transcription process.

> ahhhh... I dooooo have a sound file which is 1hour+ because it's a narrated audiobook file (with the bad bits snipped out, but still kept as one single file). That'll do it. If it's in `./training/patrick/audio/`, then it'll try and pad every other segment to an hour+. > I was going to try just using the whisper-cut-up files as the voice files, but someone suggested not to do that because they might cause issues from being too short! Which I managed to finally catch and work around. DLAS has a hidden minimum sound length that, for some reason, won't mention. The transcription process should throw out anything that would make the training process unhappy, so you're free to segment then with the transcription process.
Author

ahhhh... I dooooo have a sound file which is 1hour+ because it's a narrated audiobook file (with the bad bits snipped out, but still kept as one single file).

That'll do it. If it's in ./training/patrick/audio/, then it'll try and pad every other segment to an hour+.

Oh, no, that file is in the voices/patrick folder. In the training/patrick/audio folder its been cut up by whisper into a bunch of short files.

Maybe one of the supposedly 'short' cut up files is iffy, and longer than it should be... I will check. If I delete an audio file, do I just need to remove it from the train.txt file? or will it need to be removed from other places too?

I was going to try just using the whisper-cut-up files as the voice files, but someone suggested not to do that because they might cause issues from being too short!

Which I managed to finally catch and work around. DLAS has a hidden minimum sound length that, for some reason, won't mention. The transcription process should throw out anything that would make the training process unhappy, so you're free to segment then with the transcription process.

> > ahhhh... I dooooo have a sound file which is 1hour+ because it's a narrated audiobook file (with the bad bits snipped out, but still kept as one single file). > > That'll do it. If it's in `./training/patrick/audio/`, then it'll try and pad every other segment to an hour+. > Oh, no, that file is in the voices/patrick folder. In the training/patrick/audio folder its been cut up by whisper into a bunch of short files. Maybe one of the supposedly 'short' cut up files is iffy, and longer than it should be... I will check. If I delete an audio file, do I just need to remove it from the train.txt file? or will it need to be removed from other places too? > > I was going to try just using the whisper-cut-up files as the voice files, but someone suggested not to do that because they might cause issues from being too short! > > Which I managed to finally catch and work around. DLAS has a hidden minimum sound length that, for some reason, won't mention. The transcription process should throw out anything that would make the training process unhappy, so you're free to segment then with the transcription process.
Contributor

Can we make the cuts for whisper?
Say I take my 1 minute source, and cut it into sentences. Can whisper then take each sentence and not cut it further? That's what comes to mind when thinking how to stop whisper from cuttin off mid sent

Can we make the cuts for whisper? Say I take my 1 minute source, and cut it into sentences. Can whisper then take each sentence and not cut it further? That's what comes to mind when thinking how to stop whisper from cuttin off mid sent
Owner

Oh, no, that file is in the voices/patrick folder. In the training/patrick/audio folder its been cut up by whisper into a bunch of short files.

Ah, then it shouldn't affect it, as the "leveraging of the LJ-formatted dataset" thing will source its sound inputs from the ./training/{voice}/ folder itself, rather than what's in the voice folder.

If I delete an audio file, do I just need to remove it from the train.txt file? or will it need to be removed from other places too?

You can simply just remove the line from the text file itself. The process reads from the train.txt file and load the voice samples from there.

> Oh, no, that file is in the voices/patrick folder. In the training/patrick/audio folder its been cut up by whisper into a bunch of short files. Ah, then it shouldn't affect it, as the "leveraging of the LJ-formatted dataset" thing will source its sound inputs from the `./training/{voice}/` folder itself, rather than what's in the voice folder. > If I delete an audio file, do I just need to remove it from the train.txt file? or will it need to be removed from other places too? You can simply just remove the line from the text file itself. The process reads from the `train.txt` file and load the voice samples from there.
Owner

Say I take my 1 minute source, and cut it into sentences. Can whisper then take each sentence and not cut it further? That's what comes to mind when thinking how to stop whisper from cuttin off mid sent

I'll need to read up if whisper has a flag for it. If not, then I suppose I can implement it myself by stitching fragments to form sentences instead.

> Say I take my 1 minute source, and cut it into sentences. Can whisper then take each sentence and not cut it further? That's what comes to mind when thinking how to stop whisper from cuttin off mid sent I'll need to read up if whisper has a flag for it. If not, then I *suppose* I can implement it myself by stitching fragments to form sentences instead.
Contributor

I don't know if its good or not, just brainstorming. This will be awful for long files. Although theoretically, if whisper can play nice with what it is given, then another tool can do the cutting.

I don't know if its good or not, just brainstorming. This will be awful for long files. Although theoretically, if whisper can play nice with what it is given, then another tool can do the cutting.
Owner

It should be good, my main issue with whisper is that I'm seeing a lot of single words that get segmented off. I just need to also evaluate how intrusive it'll be to implement it.

It should be good, my main issue with whisper is that I'm seeing a lot of single words that get segmented off. I just need to also evaluate how intrusive it'll be to implement it.
Author

Oh, no, that file is in the voices/patrick folder. In the training/patrick/audio folder its been cut up by whisper into a bunch of short files.

Ah, then it shouldn't affect it, as the "leveraging of the LJ-formatted dataset" thing will source its sound inputs from the ./training/{voice}/ folder itself, rather than what's in the voice folder.

If I delete an audio file, do I just need to remove it from the train.txt file? or will it need to be removed from other places too?

You can simply just remove the line from the text file itself. The process reads from the train.txt file and load the voice samples from there.

So I just did a fresh transcribe using whisperX and LargeV2.

The longest file is 15s, with a couple 14s and a bunch of 9s and 10s.

They then ramp down all the way to 1s.

But there is also a lot (158 out of 1343 files) which are 0s. Some still have a single word or two in there, but there's a bunch which seem to be completely empty (they probably have a single-word in the transcription but I can't hear anything)

> > Oh, no, that file is in the voices/patrick folder. In the training/patrick/audio folder its been cut up by whisper into a bunch of short files. > > Ah, then it shouldn't affect it, as the "leveraging of the LJ-formatted dataset" thing will source its sound inputs from the `./training/{voice}/` folder itself, rather than what's in the voice folder. > > > If I delete an audio file, do I just need to remove it from the train.txt file? or will it need to be removed from other places too? > > You can simply just remove the line from the text file itself. The process reads from the `train.txt` file and load the voice samples from there. So I just did a fresh transcribe using whisperX and LargeV2. The longest file is 15s, with a couple 14s and a bunch of 9s and 10s. They then ramp down all the way to 1s. But there is also a lot (158 out of 1343 files) which are 0s. Some still have a single word or two in there, but there's a bunch which seem to be completely empty (they probably have a single-word in the transcription but I can't hear anything)
Author

oh most of those files seem to be in the 'validation.txt' not in train.txt. Not sure what that file does.

oh most of those files seem to be in the 'validation.txt' not in train.txt. Not sure what that file does.
Contributor

I see https://github.com/m-bain/whisperX/blob/main/whisperx/transcribe.py has no_speech_threshold: Optional[float] = 0.6,

no_speech_threshold: float
If the no_speech probability is higher than this value AND the average log probability over sampled tokens is below logprob_threshold, consider the segment as silent

Maybe needs tweaking or to be user customizable?

Why does whisper even need to splice the audio up? If you're making a tv series you want 1 line per subtitle, but we don't care about any of that. We just want it to fit in the memory. Hell, computing latents should be an infrequent event. We already have an option to do it on cpu. It doesn't even have to fit in vram, just ram. I bet many users' use case the entire voice sample would fit in one go, along with the whisper model.

I see https://github.com/m-bain/whisperX/blob/main/whisperx/transcribe.py has no_speech_threshold: Optional[float] = 0.6, no_speech_threshold: float If the no_speech probability is higher than this value AND the average log probability over sampled tokens is below `logprob_threshold`, consider the segment as silent Maybe needs tweaking or to be user customizable? Why does whisper even need to splice the audio up? If you're making a tv series you want 1 line per subtitle, but we don't care about any of that. We just want it to fit in the memory. Hell, computing latents should be an infrequent event. We already have an option to do it on cpu. It doesn't even have to fit in vram, just ram. I bet many users' use case the entire voice sample would fit in one go, along with the whisper model.
Author

I do think it could only be an improvement if there was an automated way to remove any transcribed clips that are below a certain length, as most of those are half-words or weirdly cut off. Not sure how easy/possible that is to implement in the mrq gui.

Is the length difference in my files (0s up to 15s) enough to be the cause to my garbled latents do you think?

I do think it could only be an improvement if there was an automated way to remove any transcribed clips that are below a certain length, as most of those are half-words or weirdly cut off. Not sure how easy/possible that is to implement in the mrq gui. Is the length difference in my files (0s up to 15s) enough to be the cause to my garbled latents do you think?
Contributor

I can't imagine 0s files being anything other than poorly cut off. If you have enough data then I'd drop the worst part of it.

I can't imagine 0s files being anything other than poorly cut off. If you have enough data then I'd drop the worst part of it.
Author

I can't imagine 0s files being anything other than poorly cut off. If you have enough data then I'd drop the worst part of it.

Yeh I agree, though that's why I mentioned it would be nice if this was able to be automated... as I'll have to manually remove ~150 entries from the text file lol.

> I can't imagine 0s files being anything other than poorly cut off. If you have enough data then I'd drop the worst part of it. Yeh I agree, though that's why I mentioned it would be nice if this was able to be automated... as I'll have to manually remove ~150 entries from the text file lol.
Owner

But there is also a lot (158 out of 1343 files) which are 0s. Some still have a single word or two in there, but there's a bunch which seem to be completely empty (they probably have a single-word in the transcription but I can't hear anything)

~11% might significantly be enough to throw off latents.

oh most of those files seem to be in the 'validation.txt' not in train.txt. Not sure what that file does.

The validation dataset is a list of samples that get culled out from the training dataset for being under a certain length (default 12 characters), and can be recycled to run validation if requested during training.

You'll probably want to re-generate your dataset after making sure AIVC is up to date, as it'll check for short sound files (and not short transcriptions) to cull completely.

Why does whisper even need to splice the audio up?

Having one giant long text string and one giant long voice file is a very, very bad thing for training.

Hell, computing latents should be an infrequent event. We already have an option to do it on cpu

I'm frequently regenerating them when I'm comparing between checkpoints for a model, then scrapping the model entirely to retune from scratch.

Before having it source the segmented transcriptions, I was also constantly toying with the chunk size to find close-to-the-best setting for a model, finetuned or not.

We already have an option to do it on cpu. It doesn't even have to fit in vram, just ram.

It's slow on CPU, believe me. When I was running it on DirectML on my 6800XT instead of CUDA on my 2060, it was a huge pain even on smaller counts. It adds up a lot too when it's tiny segments.

I bet many users' use case the entire voice sample would fit in one go, along with the whisper model.

No way. From what caught wind back to me from /g/, there were a substantial number of posts about OOMing from generating latents because of how I was handling the default. I think it was a naive way of using the file count, where people just had one giant voice file; they don't know better, my original documentation from when it was just a rentry OK'd it, since no one knew exactly how latents were handled. Even after modifying it, I still OK'd it by just having a chunk size cap.

I do think it could only be an improvement if there was an automated way to remove any transcribed clips that are below a certain length, as most of those are half-words or weirdly cut off. Not sure how easy/possible that is to implement in the mrq gui.

It did before for the past few days if you did culling-to-the-validation-dataset-based-on-text-length. As of a few hours ago though, it'll ignore very short clips (I think it's 0.6 seconds, as what DLAS will throw errors for anything below that).

> But there is also a lot (158 out of 1343 files) which are 0s. Some still have a single word or two in there, but there's a bunch which seem to be completely empty (they probably have a single-word in the transcription but I can't hear anything) ~11% might significantly be enough to throw off latents. > oh most of those files seem to be in the 'validation.txt' not in train.txt. Not sure what that file does. The validation dataset is a list of samples that get culled out from the training dataset for being under a certain length (default 12 characters), and can be recycled to run validation if requested during training. You'll probably want to re-generate your dataset after making sure AIVC is up to date, as it'll check for short sound files (and not short transcriptions) to cull completely. > Why does whisper even need to splice the audio up? Having one giant long text string and one giant long voice file is a very, very bad thing for training. > Hell, computing latents should be an infrequent event. We already have an option to do it on cpu I'm frequently regenerating them when I'm comparing between checkpoints for a model, then scrapping the model entirely to retune from scratch. Before having it source the segmented transcriptions, I was also constantly toying with the chunk size to find close-to-the-best setting for a model, finetuned or not. > We already have an option to do it on cpu. It doesn't even have to fit in vram, just ram. It's slow on CPU, believe me. When I was running it on DirectML on my 6800XT instead of CUDA on my 2060, it was a huge pain even on smaller counts. It adds up a lot too when it's tiny segments. > I bet many users' use case the entire voice sample would fit in one go, along with the whisper model. No way. From what caught wind back to me from /g/, there were a substantial number of posts about OOMing from generating latents because of how I was handling the default. I think it was a naive way of using the file count, where people just had one giant voice file; they don't know better, my original documentation from when it was just a rentry OK'd it, since no one knew exactly how latents were handled. Even after modifying it, I still OK'd it by just having a chunk size cap. > I do think it could only be an improvement if there was an automated way to remove any transcribed clips that are below a certain length, as most of those are half-words or weirdly cut off. Not sure how easy/possible that is to implement in the mrq gui. It did before for the past few days if you did culling-to-the-validation-dataset-based-on-text-length. As of a few hours ago though, it'll ignore very short clips (I think it's 0.6 seconds, as what DLAS will throw errors for anything below that).
Owner

Yeh I agree, though that's why I mentioned it would be nice if this was able to be automated... as I'll have to manually remove ~150 entries from the text file lol.

Just run the Prepare validation in the Train > Prepare Dataset tab. Play around with the cutoff value to see how many lines do get removed.

If I have the brain juices left I could probably whip up a setting to do it by sound length.

> Yeh I agree, though that's why I mentioned it would be nice if this was able to be automated... as I'll have to manually remove ~150 entries from the text file lol. Just run the `Prepare validation` in the `Train` > `Prepare Dataset` tab. Play around with the cutoff value to see how many lines do get removed. If I have the brain juices left I could probably whip up a setting to do it by sound length.
Author

Yeh I agree, though that's why I mentioned it would be nice if this was able to be automated... as I'll have to manually remove ~150 entries from the text file lol.

Just run the Prepare validation in the Train > Prepare Dataset tab. Play around with the cutoff value to see how many lines do get removed.

If I have the brain juices left I could probably whip up a setting to do it by sound length.

Ahh I see, yes, but that still leaves the short audio files in the folder. It would be nice if the short files were deleted as part of the culling, as I assume because they remain in the folder they will still be used for the latents calculations (which may be part of the issue I'm having).

Though it seems that with 1000 files in the dataset, with only 1 being 15s and the majority being 5s or less, there will always be a lot more 'empty space' in the padded files than actual words.

> > Yeh I agree, though that's why I mentioned it would be nice if this was able to be automated... as I'll have to manually remove ~150 entries from the text file lol. > > Just run the `Prepare validation` in the `Train` > `Prepare Dataset` tab. Play around with the cutoff value to see how many lines do get removed. > > If I have the brain juices left I could probably whip up a setting to do it by sound length. Ahh I see, yes, but that still leaves the short audio files in the folder. It would be nice if the short files were deleted as part of the culling, as I assume because they remain in the folder they will still be used for the latents calculations (which may be part of the issue I'm having). Though it seems that with 1000 files in the dataset, with only 1 being 15s and the majority being 5s or less, there will always be a lot more 'empty space' in the padded files than actual words.
Author

So I'm trying again, with a fresh training session, but now I seem to be getting the groany/garbled generated voices on both mrq (using manual voice chunks) AND in fast-tts lol. So this time I assume the issue is the training...

I'm using a pretty small dataset, which probably doesn't help.

Screenshot the graphs from it. It was a pretty quick training, only 100 epochs at 0.000085.

So now I need to figure out... did I train too fast? Or did I not train for long enough? Or... did I train for -too long- (if training for too long is even a thing?)

I went back and tried one of the saved-state pth files from about epoch 40 and it was still garbled, so it doesn't seem to be a too-long issue.

I'll try it again tomorrow, and double check all the training files it's using.

So I'm trying again, with a fresh training session, but now I seem to be getting the groany/garbled generated voices on both mrq (using manual voice chunks) AND in fast-tts lol. So this time I assume the issue is the training... I'm using a pretty small dataset, which probably doesn't help. Screenshot the graphs from it. It was a pretty quick training, only 100 epochs at 0.000085. So now I need to figure out... did I train too fast? Or did I not train for long enough? Or... did I train for -too long- (if training for too long is even a thing?) I went back and tried one of the saved-state pth files from about epoch 40 and it was still garbled, so it doesn't seem to be a too-long issue. I'll try it again tomorrow, and double check all the training files it's using.

The biggest problem I'm having when it comes to transcription, even when using WhisperX, is the transcribed text having a full sentence, but the audio having the first or last word or two cut off. I don't think the transcription timestamps are accurate enough to cut the audio apart properly.

I'm using clean audio files too; they're roughly 5-minute long compiled clips of ripped voices with virtually no background noise.

The biggest problem I'm having when it comes to transcription, even when using WhisperX, is the transcribed text having a full sentence, but the audio having the first or last word or two cut off. I don't think the transcription timestamps are accurate enough to cut the audio apart properly. I'm using clean audio files too; they're roughly 5-minute long compiled clips of ripped voices with virtually no background noise.

The biggest problem I'm having when it comes to transcription, even when using WhisperX, is the transcribed text having a full sentence, but the audio having the first or last word or two cut off. I don't think the transcription timestamps are accurate enough to cut the audio apart properly.

I'm using clean audio files too; they're roughly 5-minute long compiled clips of ripped voices with virtually no background noise.

Same here. I usually run the dataset preparation, then check the files manually, then fix the ones where audio starts or cuts short with Audacity (by re-cutting the original file).
That only works on small datasets.
It doesn't look like it's a constant cut off either, sometimes it cuts the start of the audio sometimes the end, sometimes it's fine.

> The biggest problem I'm having when it comes to transcription, even when using WhisperX, is the transcribed text having a full sentence, but the audio having the first or last word or two cut off. I don't think the transcription timestamps are accurate enough to cut the audio apart properly. > > I'm using clean audio files too; they're roughly 5-minute long compiled clips of ripped voices with virtually no background noise. Same here. I usually run the dataset preparation, then check the files manually, then fix the ones where audio starts or cuts short with Audacity (by re-cutting the original file). That only works on small datasets. It doesn't look like it's a constant cut off either, sometimes it cuts the start of the audio sometimes the end, sometimes it's fine.
Owner

Well shit.

I just transcribed some more datasets with whisperx+large-v2 and they're consistently cut off too soon at the end. I compared them against whisper and whisper does a better job at that. I genuinely don't know how my processed datasets were fine earlier, unless I actually wasn't using whisperx for those.

I'll see what I can do to fix it. I added a button to disable slicing and a setting to cull too short sound files for using them for validation purposes during training. I'll push those in a moment.

Well shit. I just transcribed some more datasets with whisperx+large-v2 and they're consistently cut off too soon at the end. I compared them against whisper and whisper does a better job at that. I genuinely don't know how my processed datasets were fine earlier, unless I actually wasn't using whisperx for those. I'll see what I can do to fix it. I added a button to disable slicing and a setting to cull too short sound files for using them for validation purposes during training. I'll push those in a moment.
Owner

Pushed commit 2424c455cb. It seems every passing day I regret more and more adding whisperx.

I'm very, very tempted to just remove it. It caused nothing but trouble. WhisperX is gone. I'm done with it.

Pushed commit 2424c455cb9614003c072f6cdc25fa80ba2694ba. It seems every passing day I regret more and more adding whisperx. ~~I'm very, very tempted to just remove it. It caused nothing but trouble.~~ WhisperX is gone. I'm done with it.
Author

Pushed commit 2424c455cb. It seems every passing day I regret more and more adding whisperx.

I'm very, very tempted to just remove it. It caused nothing but trouble. WhisperX is gone. I'm done with it.

I had this same issue from day 1 with some of my transcripted files ending a word earlier than they should... but I didn't think it was a whisperx problem, I thought it also happened on the original whisper? Maybe I was mistaken.

What's the best setup to use now?

> Pushed commit 2424c455cb9614003c072f6cdc25fa80ba2694ba. It seems every passing day I regret more and more adding whisperx. > > ~~I'm very, very tempted to just remove it. It caused nothing but trouble.~~ WhisperX is gone. I'm done with it. I had this same issue from day 1 with some of my transcripted files ending a word earlier than they should... but I didn't think it was a whisperx problem, I thought it also happened on the original whisper? Maybe I was mistaken. What's the best setup to use now?

It's the same on whisper for me

It's the same on whisper for me
Owner

I thought it also happened on the original whisper?

It's the same on whisper for me

Based on going through most of my voice samples, normal whisper's timestamps do have minor accuracy problems some of the time, but not anywhere near as whisperx both consistently doing it and egregiously doing it.

desu there's no real point in keeping whisperx around when the main point of it was better timestamps for segmenting, and it fails to do that so, so bad. It's just not worth the headaches it keeps bringing.

I think it'd be neat to use if diariazation, but that requires a HF token, and given how terrible the timestamps are, I don't trust it all that well.

What's the best setup to use now?

Ideally:

  • if you can, have your voice samples already segmented, and only really use it for transcription only
    • unfortunately, most (if not, almost all) samples I linked the rentry/mega for on the wiki are one single file (I imagine it was easier for people to share them as one file, and 11.AI doesn't really care about segments).
  • if you can't, base whisper is still serviceable at it, but definitely curate the results.
  • whisper model doesn't matter too much if you're only transcribing English on decent audio. I imagine larger models are better for dirtier audio, but you really shouldn't be using dirty audio to train against.
    • it definitely does seem to matter for non-Latin languages, like Japanese, as it'll improve the accuracy of determining the right kana/kanji (lower models will coerce something that's defacto katakana as a kanji that sounds right)
    • I'd like to imagine timestamps get better with better models, but at this point there's no guarantee for perfect timestamps
  • I vaguely remember whispercpp providing pleasing timestamps, but that was from cursory tests with trying to make it work
    • I won't suggest it for Windows users, as it's a huge pain in the ass to get it to compile nicely with MSVC
    • it being CPU only is a bit of a pill too

In short, if you don't need to segment, don't. I have it not slicing by default now, but you still can by either ticking the checkbox when transcribing, or clicking the slice button.

  • which, you can easily go and edit the timestamps by editing the whisper.json, finally giving it a use.
> I thought it also happened on the original whisper? > It's the same on whisper for me Based on going through most of my voice samples, normal whisper's timestamps do have minor accuracy problems some of the time, but not anywhere near as whisperx both consistently doing it and egregiously doing it. desu there's no real point in keeping whisperx around when the main point of it was better timestamps for segmenting, and it fails to do that so, so bad. It's just not worth the headaches it keeps bringing. I think it'd be neat to use if diariazation, but that requires a HF token, and given how terrible the timestamps are, I don't trust it all that well. > What's the best setup to use now? Ideally: * if you can, have your voice samples already segmented, and only really use it for transcription only - unfortunately, most (if not, almost all) samples I linked the rentry/mega for on the wiki are one single file (I imagine it was easier for people to share them as one file, and 11.AI doesn't really care about segments). * if you can't, base whisper is still *serviceable* at it, but ***definitely*** curate the results. * whisper model doesn't matter too much if you're only transcribing English on decent audio. I imagine larger models are better for dirtier audio, but you really shouldn't be using dirty audio to train against. - it definitely does seem to matter for non-Latin languages, like Japanese, as it'll improve the accuracy of determining the right kana/kanji (lower models will coerce something that's defacto katakana as a kanji that sounds right) - I'd like to imagine timestamps get better with better models, but at this point there's no guarantee for perfect timestamps * I vaguely remember whispercpp providing pleasing timestamps, but that was from cursory tests with trying to make it work - I won't suggest it for Windows users, as it's a huge pain in the ass to get it to compile nicely with MSVC - it being CPU only is a bit of a pill too In short, if you don't need to segment, don't. I have it not slicing by default now, but you still can by either ticking the checkbox when transcribing, or clicking the slice button. * which, you can easily go and edit the timestamps by editing the `whisper.json`, finally giving it a use.
Author

https://github.com/openai/whisper/discussions/435

This seems to be the most recent discussion involving a fix for the innaccurate timestamps in whisper.

Otherwise, as you suggest, I may just DIY it.

I assume I can just load the file into an audio program, and find the exact timecode in seconds, and edit the whisper.json. Then re-slice.

The issue does just seem to be that whisper only slices in whole-integer seconds. Even though all the entries are to a decimal place (13.0, etc) they are never "12.3".

edit - https://github.com/openai/whisper/discussions/139

This discussion seems to confirm this, but also mentions it might be a bug with the large dataset... though other people also weigh in with reasons (and fixes, but I don't understand most of those)

edit2 -
Somehow I have managed to get one of my datasets to transcribe with actual decimal timestamps (down to like "14.233s" so it gets pretty accurate) and it seems at first glance to be pretty accurate to the text. This was using a medium whisper.

However I also tried medium on another voice, and it still only does it in whole integers. So ... no idea wtf.

edit3 -
Even though I got a timestamp set which was into the 14.333 decimal places, they still weren't accurate. I even had some transcriptions which were about 20 words long, but the audio clip was only 10. The rest got spread into the next few slices making a bunch of incorrect clips.
I think I'll do my next test of training using manually-created slices..

https://github.com/openai/whisper/discussions/435 This seems to be the most recent discussion involving a fix for the innaccurate timestamps in whisper. Otherwise, as you suggest, I may just DIY it. I assume I can just load the file into an audio program, and find the exact timecode in seconds, and edit the whisper.json. Then re-slice. The issue does just seem to be that whisper only slices in whole-integer seconds. Even though all the entries are to a decimal place (13.0, etc) they are never "12.3". edit - https://github.com/openai/whisper/discussions/139 This discussion seems to confirm this, but also mentions it might be a bug with the large dataset... though other people also weigh in with reasons (and fixes, but I don't understand most of those) edit2 - Somehow I have managed to get one of my datasets to transcribe with actual decimal timestamps (down to like "14.233s" so it gets pretty accurate) and it seems at first glance to be pretty accurate to the text. This was using a medium whisper. However I also tried medium on another voice, and it still only does it in whole integers. So ... no idea wtf. edit3 - Even though I got a timestamp set which was into the 14.333 decimal places, they still weren't accurate. I even had some transcriptions which were about 20 words long, but the audio clip was only 10. The rest got spread into the next few slices making a bunch of incorrect clips. I think I'll do my next test of training using manually-created slices..
Owner

I assume I can just load the file into an audio program, and find the exact timecode in seconds, and edit the whisper.json. Then re-slice.

Yeah, that'd be the path of least resistance when manually slicing them: open it in Audacity Tenacity, manually deduce start and end times, edit the whisper.json, then re-slice.

The issue does just seem to be that whisper only slices in whole-integer seconds. Even though all the entries are to a decimal place (13.0, etc) they are never "12.3".

This discussion seems to confirm this, but also mentions it might be a bug with the large dataset... though other people also weigh in with reasons (and fixes, but I don't understand most of those)

I see, that might explain some of the inconsistencies I saw between things getting timestamped fine-ish for me, since the stuff I would transcribe on my personal system are with the medium model. I peeped at a transcription from the large model and they're mostly whole numbers (they're coerced to decimals when stored as JSON because I suppose that's how python's serializer works)

I think I'll do my next test of training using manually-created slices..

Yeah, that's probably for the best.


I've added blanket start/end offset options when slicing, and it seems pretty favorable, as I re-processed my Patrick Bateman samples and they mostly only need the start times offsetted back by 0.25 seconds.

I might add in a way to playback parsed sample pieces, so I don't have to keep transfering off my beefy system to my personal system.

> I assume I can just load the file into an audio program, and find the exact timecode in seconds, and edit the whisper.json. Then re-slice. Yeah, that'd be the path of least resistance when manually slicing them: open it in ~~Audacity~~ Tenacity, manually deduce start and end times, edit the `whisper.json`, then re-slice. > The issue does just seem to be that whisper only slices in whole-integer seconds. Even though all the entries are to a decimal place (13.0, etc) they are never "12.3". > This discussion seems to confirm this, but also mentions it might be a bug with the large dataset... though other people also weigh in with reasons (and fixes, but I don't understand most of those) I see, that might explain some of the inconsistencies I saw between things getting timestamped fine-ish for me, since the stuff I would transcribe on my personal system are with the medium model. I peeped at a transcription from the large model and they're mostly whole numbers (they're coerced to decimals when stored as JSON because I suppose that's how python's serializer works) > I think I'll do my next test of training using manually-created slices.. Yeah, that's probably for the best. --- I've added blanket start/end offset options when slicing, and it seems pretty favorable, as I re-processed my Patrick Bateman samples and they mostly only need the start times offsetted back by 0.25 seconds. I might add in a way to playback parsed sample pieces, so I don't have to keep transfering off my beefy system to my personal system.

How feasible would it be to run sliced audio through Whisper a second time to see if the transcription matches the sliced audio? If the re-transcription doesn't match, you can throw out that slice.

How feasible would it be to run sliced audio through Whisper a second time to see if the transcription matches the sliced audio? If the re-transcription doesn't match, you can throw out that slice.
Owner

mm, I suppose that could be one way to automatically check if segments aren't trimmed too much. I could then dump the failures into another text file to narrow down what's needed for manual intervention.

The only qualm is the additional overhead; large datasets already take quite the time on even the medium model, although I'll say if you're using a large dataset you better damn right have your samples segmented.

mm, I suppose that could be one way to automatically check if segments aren't trimmed too much. I could then dump the failures into another text file to narrow down what's needed for manual intervention. The only qualm is the additional overhead; large datasets already take quite the time on even the medium model, although I'll say if you're using a large dataset you better damn right have your samples segmented.
Contributor

I'm struggling to understand the parameters of the problem here.
Assume we have a perfect transcription.

If we start off with a single long dialogue wav, where would we ideally want it cut? Every sentence? As close to but under 30 seconds (internal whisper constraint)? As high as memory permits? Or would it not matter at all? Do we need X number of audio files to train using batchsize X?

If audio transitions are a point of failure, would we want as few of them as we can get away with? How few can we get away with, and what are we constrained by?

I'm assuming some information is carried over sentence to sentence, which makes me think every time we cut a consecutive dialogue we lose out on some information.

And alternatively, if we start off with many short wavs of isolated dialogue, perhaps as short as a few words each, would we always necessarily want to use them as is and change nothing, since there's no meaningful way to gain information by stiching them together? At that point we just want to transcribe and no cutting / stiching whatsoever.

Whisper use case was not necessarily model training, and we should be careful about assuming just because whisper does something a particular way, that it is what's best for our purpose.

I'm struggling to understand the parameters of the problem here. Assume we have a perfect transcription. If we start off with a single long dialogue wav, where would we ideally want it cut? Every sentence? As close to but under 30 seconds (internal whisper constraint)? As high as memory permits? Or would it not matter at all? Do we need X number of audio files to train using batchsize X? If audio transitions are a point of failure, would we want as few of them as we can get away with? How few can we get away with, and what are we constrained by? I'm assuming some information is carried over sentence to sentence, which makes me think every time we cut a consecutive dialogue we lose out on some information. And alternatively, if we start off with many short wavs of isolated dialogue, perhaps as short as a few words each, would we always necessarily want to use them as is and change nothing, since there's no meaningful way to gain information by stiching them together? At that point we just want to transcribe and no cutting / stiching whatsoever. Whisper use case was not necessarily model training, and we should be careful about assuming just because whisper does something a particular way, that it is what's best for our purpose.
Owner

If we start off with a single long dialogue wav, where would we ideally want it cut?

Segment by every sentence, if:

  • it's under 200 characters, as defined in the training YAML
  • when tokenized as text tokens (phonemes I assume from ./models/tortoise/bpe_lowercase_asr_256.json), it's under 402 tokens, as defined in the training YAML
  • the resulting sound file is under ~11 seconds, as defined in the training YAML (dataset loader has a cap of 255995 frames, which at 22.5K is ~11 seconds)

and if it's too long, then I'll assume by clause, which is about what whisper would like to segment by (if it's not outright segmenting off the last word).

As close to but under 30 seconds (internal whisper constraint)?

In terms of pure transcription, I don't think this is actually a problem, as I've had adequate transcriptions from everything-in-one-sound-file voice samples (such as the Stanley Parable narrator, and Gabe Newell) in my cursory tests.

[...]

You're neglecting how the original LJSpeech dataset is formatted: it's fairly blasé on being consistent. Some lines are multi-claused sentences, while others will split by clause, and some are just short sentences. For example:

LJ001-0101|It is discouraging to note that the improvement of the last fifty years is almost wholly confined to Great Britain.|It is discouraging to note that the improvement of the last fifty years is almost wholly confined to Great Britain.
LJ001-0102|Here and there a book is printed in France or Germany with some pretension to good taste,|Here and there a book is printed in France or Germany with some pretension to good taste,
LJ001-0103|but the general revival of the old forms has made no way in those countries.|but the general revival of the old forms has made no way in those countries.
LJ001-0104|Italy is contentedly stagnant.|Italy is contentedly stagnant.
LJ001-0105|America has produced a good many showy books, the typography, paper, and illustrations of which are, however, all wrong,|America has produced a good many showy books, the typography, paper, and illustrations of which are, however, all wrong,
LJ001-0106|oddity rather than rational beauty and meaning being apparently the thing sought for both in the letters and the illustrations.|oddity rather than rational beauty and meaning being apparently the thing sought for both in the letters and the illustrations.
LJ001-0107|To say a few words on the principles of design in typography:|To say a few words on the principles of design in typography:
LJ001-0108|it is obvious that legibility is the first thing to be aimed at in the forms of the letters;|it is obvious that legibility is the first thing to be aimed at in the forms of the letters;
LJ001-0109|this is best furthered by the avoidance of irrational swellings and spiky projections, and by the using of careful purity of line.|this is best furthered by the avoidance of irrational swellings and spiky projections, and by the using of careful purity of line.
LJ001-0110|Even the Caslon type when enlarged shows great shortcomings in this respect:|Even the Caslon type when enlarged shows great shortcomings in this respect:
LJ001-0111|the ends of many of the letters such as the t and e are hooked up in a vulgar and meaningless way,|the ends of many of the letters such as the t and e are hooked up in a vulgar and meaningless way,

I don't think sticking by a strict "every line should be the same length" would help at all with training the AR model. And in my experience, I've got decent results so far with base whisper blasting away giant blobs of audio.

It's just that whisperx has had egregious timestamp problems; I'm fairly sure it's been the (other) source of why I've been getting trash results after switching to it (the other being not matching unified_voice2 in DLAS closer to the autoregressive model in tortoise).

As long as clauses are intact, it shouldn't really matter.

Although, LibriTTS (what VALL-E was originally trained with) seems to be split by sentences wholly, and not just clauses, for what it's worth.


I'm actually looking at the transcriptions I've put on my huggingface repo for finetunes and my God, I must have really been pressed for time as they're either awfully segmented too much or they're not segmented enough. Yet, for James Sunderland at least, those two performed really well. So I'm not too sure if it's fine for the text itself to not be up to par, as long as the audio is relatively intact (James v2 has some trimming issues, however).

> If we start off with a single long dialogue wav, where would we ideally want it cut? Segment by every sentence, if: * it's under 200 characters, as defined in the training YAML * when tokenized as text tokens (phonemes I assume from `./models/tortoise/bpe_lowercase_asr_256.json`), it's under 402 tokens, as defined in the training YAML * the resulting sound file is under ~11 seconds, as defined in the training YAML (dataset loader has a cap of 255995 frames, which at 22.5K is ~11 seconds) and if it's too long, then I'll assume by clause, which is about what whisper would like to segment by (if it's not outright segmenting off the last word). > As close to but under 30 seconds (internal whisper constraint)? In terms of pure transcription, I don't think this is actually a problem, as I've had adequate transcriptions from everything-in-one-sound-file voice samples (such as the Stanley Parable narrator, and Gabe Newell) in my cursory tests. > [...] You're neglecting how the original LJSpeech dataset is formatted: it's fairly blasé on being consistent. Some lines are multi-claused sentences, while others will split by clause, and some are just short sentences. For example: ``` LJ001-0101|It is discouraging to note that the improvement of the last fifty years is almost wholly confined to Great Britain.|It is discouraging to note that the improvement of the last fifty years is almost wholly confined to Great Britain. LJ001-0102|Here and there a book is printed in France or Germany with some pretension to good taste,|Here and there a book is printed in France or Germany with some pretension to good taste, LJ001-0103|but the general revival of the old forms has made no way in those countries.|but the general revival of the old forms has made no way in those countries. LJ001-0104|Italy is contentedly stagnant.|Italy is contentedly stagnant. LJ001-0105|America has produced a good many showy books, the typography, paper, and illustrations of which are, however, all wrong,|America has produced a good many showy books, the typography, paper, and illustrations of which are, however, all wrong, LJ001-0106|oddity rather than rational beauty and meaning being apparently the thing sought for both in the letters and the illustrations.|oddity rather than rational beauty and meaning being apparently the thing sought for both in the letters and the illustrations. LJ001-0107|To say a few words on the principles of design in typography:|To say a few words on the principles of design in typography: LJ001-0108|it is obvious that legibility is the first thing to be aimed at in the forms of the letters;|it is obvious that legibility is the first thing to be aimed at in the forms of the letters; LJ001-0109|this is best furthered by the avoidance of irrational swellings and spiky projections, and by the using of careful purity of line.|this is best furthered by the avoidance of irrational swellings and spiky projections, and by the using of careful purity of line. LJ001-0110|Even the Caslon type when enlarged shows great shortcomings in this respect:|Even the Caslon type when enlarged shows great shortcomings in this respect: LJ001-0111|the ends of many of the letters such as the t and e are hooked up in a vulgar and meaningless way,|the ends of many of the letters such as the t and e are hooked up in a vulgar and meaningless way, ``` I don't think sticking by a strict "every line should be the same length" would help at all with training the AR model. And in my experience, I've got decent results so far with base whisper blasting away giant blobs of audio. It's just that whisperx has had egregious timestamp problems; I'm fairly sure it's been the (other) source of why I've been getting trash results after switching to it (the other being not matching unified_voice2 in DLAS closer to the autoregressive model in tortoise). As long as clauses are intact, it shouldn't really matter. Although, LibriTTS (what VALL-E was originally trained with) seems to be split by sentences wholly, and not just clauses, for what it's worth. --- I'm actually looking at the transcriptions I've put on my huggingface repo for finetunes and my God, I must have really been pressed for time as they're either awfully segmented too much or they're not segmented enough. Yet, for James Sunderland at least, those two performed really well. So I'm not too sure if it's *fine* for the text itself to not be up to par, as long as the audio is relatively intact (James v2 has some trimming issues, however).
Contributor

I agree whisperX got segmentation wrong, which undoubtedly led to troubles. It did well on multi-lingual transcribing but that's beside the point.

What I'm wondering about is how to get segmentation right. It's surprising to hear that sentence by sentence is okay, since whisper has condition_on_previous_text which according to them:

condition_on_previous_text: bool
if True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

(It is merely used for transcribing, which I find weird.)

Human speech carries over sentence by sentence, just like it does token by token. I'm not saying every dialogue needs to be multi sentence long, sometimes humans say one line and that's okay. Perhaps our ideal data needs to capture a varied sample of multi line dialogue as well as single sentences. What I'm trying to figure out is what that ideal is. My impression is that single sentence audio segments has more to do with whisper being focused on transcription and less with single sentences being best for latent generation (Edit: and model training ofc).

I agree whisperX got segmentation wrong, which undoubtedly led to troubles. It did well on multi-lingual transcribing but that's beside the point. What I'm wondering about is how to get segmentation *right*. It's surprising to hear that sentence by sentence is okay, since whisper has condition_on_previous_text which according to them: > condition_on_previous_text: bool if True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync. (It is merely used for transcribing, which I find weird.) Human speech carries over sentence by sentence, just like it does token by token. I'm not saying every dialogue needs to be multi sentence long, sometimes humans say one line and that's okay. Perhaps our ideal data needs to capture a varied sample of multi line dialogue as well as single sentences. What I'm trying to figure out is what that ideal is. My impression is that single sentence audio segments has more to do with whisper being focused on transcription and less with single sentences being best for latent generation (Edit: and model training ofc).
Author

As an aside, but its part of my ongoing journey to clean up my training -

My outputs are now generating some fairly decent speech, even with very small data sets. However the output audio quality is a little... low quality? Basically kinda sounds a bit like old radio sometimes.

I tried voice fixer but weirdly that seemed to make it worse rather than better.

Is there an obvious way to improve the audio quality on a generated voice that I'm missing?>

As an aside, but its part of my ongoing journey to clean up my training - My outputs are now generating some fairly decent speech, even with very small data sets. However the output audio quality is a little... low quality? Basically kinda sounds a bit like old radio sometimes. I tried voice fixer but weirdly that seemed to make it worse rather than better. Is there an obvious way to improve the audio quality on a generated voice that I'm missing?>
Contributor

I've been curious, often voice samples get put through a background noise removal process. The damage to the voice is typically invisible or minimal to human ear, but I wonder if it's substantial enough for the fine tuning process to pick up on it during training. Would it be beneficial to put the voice samples through some vocal regeneration model prior to sending them to tts for training? Of course naturally clean audio would be ideal, if only we would be so lucky.

I've been curious, often voice samples get put through a background noise removal process. The damage to the voice is typically invisible or minimal to human ear, but I wonder if it's substantial enough for the fine tuning process to pick up on it during training. Would it be beneficial to put the voice samples through some vocal regeneration model prior to sending them to tts for training? Of course naturally clean audio would be ideal, if only we would be so lucky.
Owner

I forgot I was too fried over the weekend to respond.

What I'm wondering about is how to get segmentation right. It's surprising to hear that sentence by sentence is okay, since whisper has condition_on_previous_text which according to them

In the scope of segmenting for training, segmenting is only actually necessary if the main text is over 200 characters or the main audio is over the 11.6s requirement imposed by the training YAML (which I imagine is necessary due to how many mel tokens are within a sequence in the AR model).

Human speech carries over sentence by sentence, just like it does token by token. I'm not saying every dialogue needs to be multi sentence long, sometimes humans say one line and that's okay. Perhaps our ideal data needs to capture a varied sample of multi line dialogue as well as single sentences. What I'm trying to figure out is what that ideal is. My impression is that single sentence audio segments has more to do with whisper being focused on transcription and less with single sentences

I imagine 2+ sentences per line is a challenge moreso from having to keep it within the restrictions the model/training configuration imposes.

If your dataset specifically has a sentence where one depends on the other, then by all means, keep it in one file (as long as it doesn't exceed the limits).

being best for latent generation

As for that, I imagine (I don't know for sure, but given how robust GPT2 is, this is probably true) latents are only really calculated running mel tokens through the AR, and inferencing the latents within a token string.

I'm sure the original issue here was just from an oversight on my end, rather than the AR model itself being trained wrong (which is still a possibility given the over reliance on liberally timestamped segments).

However the output audio quality is a little... low quality? Basically kinda sounds a bit like old radio sometimes.

You probably were still on the version before I fixed the generation settings not persisting (to keep Condition-Free enabled). I had some test generations that improved a lot after realizing it kept disabling itself.

I've been curious, often voice samples get put through a background noise removal process. The damage to the voice is typically invisible or minimal to human ear, but I wonder if it's substantial enough for the fine tuning process to pick up on it during training. Would it be beneficial to put the voice samples through some vocal regeneration model prior to sending them to tts for training? Of course naturally clean audio would be ideal, if only we would be so lucky.

desu I'm not too sure if it matters all that much, given that the waveform is pretty much compressed so much into a mel spectrogram.

Per the tortoise design doc:

speech audio sampled at 22kHz is first downsampled 1024x by first encoding it as a MEL spectrogram (a 256x compression) and then reducing it a further 4x using a VQVAE specifically trained on MEL spectrograms.

I'm sure any (very) minor noise or imperfections will get filtered out by compression, and cleaned up from the diffusion/vocoder.

My only doubts is that I don't have many datasets that do have background noise. The only one I can think of is of Patrick Bateman (and whisper doesn't like to trim those out for segments), but I don't think I had any perceptable quality problems at the end of the chain with inferencing.

However, """imperfections""" (and I say that loosely, I'd rather classify them as quirks) like with Silent Hill voices do get maintained.

I forgot I was too fried over the weekend to respond. > What I'm wondering about is how to get segmentation right. It's surprising to hear that sentence by sentence is okay, since whisper has condition_on_previous_text which according to them In the scope of segmenting for training, segmenting is only actually necessary if the main text is over 200 characters or the main audio is over the 11.6s requirement imposed by the training YAML (which I imagine is necessary due to how many mel tokens are within a sequence in the AR model). > Human speech carries over sentence by sentence, just like it does token by token. I'm not saying every dialogue needs to be multi sentence long, sometimes humans say one line and that's okay. Perhaps our ideal data needs to capture a varied sample of multi line dialogue as well as single sentences. What I'm trying to figure out is what that ideal is. My impression is that single sentence audio segments has more to do with whisper being focused on transcription and less with single sentences I imagine 2+ sentences per line is a challenge moreso from having to keep it within the restrictions the model/training configuration imposes. If your dataset specifically has a sentence where one depends on the other, then by all means, keep it in one file (as long as it doesn't exceed the limits). > being best for latent generation As for that, I ***imagine*** (I don't know for sure, but given how robust GPT2 is, this is probably true) latents are only really calculated running mel tokens through the AR, and inferencing the latents within a token string. I'm sure the original issue here was just from an oversight on my end, rather than the AR model itself being trained wrong (which is still a possibility given the over reliance on liberally timestamped segments). > However the output audio quality is a little... low quality? Basically kinda sounds a bit like old radio sometimes. You probably were still on the version before I fixed the generation settings not persisting (to keep Condition-Free enabled). I had some test generations that improved a lot after realizing it kept disabling itself. > I've been curious, often voice samples get put through a background noise removal process. The damage to the voice is typically invisible or minimal to human ear, but I wonder if it's substantial enough for the fine tuning process to pick up on it during training. Would it be beneficial to put the voice samples through some vocal regeneration model prior to sending them to tts for training? Of course naturally clean audio would be ideal, if only we would be so lucky. desu I'm not too sure if it matters all that much, given that the waveform is pretty much compressed so much into a mel spectrogram. Per [the tortoise design doc](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/): > speech audio sampled at 22kHz is first downsampled 1024x by first encoding it as a MEL spectrogram (a 256x compression) and then reducing it a further 4x using a VQVAE specifically trained on MEL spectrograms. I'm sure any (very) minor noise or imperfections will get filtered out by compression, and cleaned up from the diffusion/vocoder. My only doubts is that I don't have many datasets that do have background noise. The only one I can think of is of Patrick Bateman (and whisper doesn't like to trim those out for segments), but I don't think I had any perceptable quality problems at the end of the chain with inferencing. However, """imperfections""" (and I say that loosely, I'd rather classify them as quirks) like with Silent Hill voices do get maintained.
Owner

Anyways, the last thing I need to mend is how I'm "leveraging" the dataset by making it actually leverage the dataset, rather than something like "I see you segmented your audio, so I'll make some loose assumptions about the maximum segment length to derive a common chunk size".


Aside from that, the general outline of preparing a dataset is (and I should add this to the wiki):

  • do not segment by default (you can always force it to by ticking the box, if you have the misfortune of having your voice samples all in one file); leverage the original sound files as much as possible to avoid needlessly segmenting.
  • run transcription, the output from whisper gets dumped to whisper.json for later reuse.
    • double check the transcription here if you plan on consistently regenerating your dataset text files
  • create your dataset file
    • here, validation is done on each line to ensure it's over 0.6s and under 11.6s and text lengths under 200 characters
    • if not using segments, and a piece of audio does exceed 11.6s, then segment it.
    • if a line or segment fails validation, ignore it, it can't be used for either training or validation
  • please, please, please double check both the transcription and the audio each line references in the dataset, especially if you're using segments. My Japanese dataset seemed to be okay, even for the lines that had to be segmented, but I imagine a half-second of inaccuracy is better than the entire end getting silently discarded when DLAS loads it.
    • if a segment is wrong, you can manually edit it in the whisper.json, then reslice
    • if something is transcribed weirdly, you can edit it in the whisper.json (both the overall text and its segment), then recreate the dataset.

A lot of it should be fairly hand-held, but the biggest point is to double check the end results (which I didn't do much of).

These sanity checks seemed to help my Japanese finetune actually sound almost as nice as it did originally before I added all the bloat ontop of "just transcribe these lines please and thanks".

However, while transcription will also handle the slicing (if requested) and the dataset creation, you can always re-run the other two steps (slicing and dataset creation) after the fact. They don't usually take so long so it's easy to just have them done after transcription automatically.

Anyways, the last thing I need to mend is how I'm "leveraging" the dataset by making it actually leverage the dataset, rather than something like "I see you segmented your audio, so I'll make some loose assumptions about the maximum segment length to derive a common chunk size". --- Aside from that, the general outline of preparing a dataset is (and I should add this to the wiki): * do not segment by default (you can always force it to by ticking the box, if you have the misfortune of having your voice samples all in one file); leverage the original sound files as much as possible to avoid needlessly segmenting. * run transcription, the output from whisper gets dumped to `whisper.json` for later reuse. - double check the transcription here if you plan on consistently regenerating your dataset text files * create your dataset file - here, validation is done on each line to ensure it's over 0.6s and under 11.6s and text lengths under 200 characters - if not using segments, and a piece of audio does exceed 11.6s, then segment it. - if a line or segment fails validation, ignore it, it can't be used for either training or validation * ***please, please, please*** double check both the transcription and the audio each line references in the dataset, especially if you're using segments. My Japanese dataset seemed to be *okay*, even for the lines that had to be segmented, but I imagine a half-second of inaccuracy is better than the entire end getting silently discarded when DLAS loads it. - if a segment is wrong, you can manually edit it in the whisper.json, then reslice - if something is transcribed weirdly, you can edit it in the whisper.json (both the overall text and its segment), then recreate the dataset. A lot of it should be fairly hand-held, but the biggest point is to double check the end results (which I didn't do much of). These sanity checks seemed to help my Japanese finetune actually sound almost as nice as it did originally before I added all the bloat ontop of "just transcribe these lines please and thanks". However, while transcription will also handle the slicing (if requested) and the dataset creation, you can always re-run the other two steps (slicing and dataset creation) after the fact. They don't usually take so long so it's easy to just have them done after transcription automatically.

Just to clear up my understanding: The recommendation now is to not use all-in-one files and instead make sure the audio clips post-transcription are always under 11.6s?

Just to clear up my understanding: The recommendation now is to not use all-in-one files and instead make sure the audio clips post-transcription are always under 11.6s?
Owner

The recommendation now is to not use all-in-one files

Correct.

It's not the end of the world if you can't, but Whisper's timestamps aren't to be trusted as much as I had hoped.

make sure the audio clips post-transcription are always under 11.6s?

Ideally you want them under 11.6s pre-transcription.

Again, it's not the end of the world if you can't. The web UI will handle implicitly segmenting anything too long (>11.6s, >200 characters), but as I keep stressing, double check how well they were segmented.

Anything that gets implicitly segmented, or outright thrown out, will be outputted to the output area on the right, but I've yet to have anything egregiously decimated (or at least, in a way that is noticeable outright).

> The recommendation now is to not use all-in-one files Correct. It's not the end of the world if you can't, but Whisper's timestamps aren't to be trusted as much as I had hoped. > make sure the audio clips post-transcription are always under 11.6s? Ideally you want them under 11.6s pre-transcription. Again, it's not the end of the world if you can't. The web UI will handle implicitly segmenting anything too long (>11.6s, >200 characters), but as I keep stressing, double check how well they were segmented. Anything that gets implicitly segmented, or outright thrown out, will be outputted to the output area on the right, but I've yet to have anything egregiously decimated (or at least, in a way that is noticeable outright).
Owner

I figured I'll (ashamedly) mention it here, since this issues thread was the reason why I removed it in the first place: I added WhisperX back. It won't install by default to avoid any more dependency nightmares, and you need a HF token anyways to make the best use of it.

I gave it another shot and I admit I was wrong to axe it. It just took a little elbow grease from the rather rudimentary implementation to really see the uplifts it brought:

  • smarter loading of the alignment model made transcriptions slightly faster (it was naively being loaded on every transcription, rather than just loading it at the initial Whisper load)
  • VAD filtering + using a larger wav2vec2 alignment model has nearly great transcription timestamps
    • using -0.05 and 0.05 for start/end offsets literally fixes 99% of the remaining problems.
  • there's also bigger batch sizes when using the VAD filtering method, but I believe it benefits better from having larger audio clips (I don't seem to have much of an uplift from increasing this from 1 to 8 to 32).
  • I also have it using diarization, but it's disabled for now as it's a performance penalty, as 99% of datasets are single speakers, and you'll have to manually prune out the speakers you don't want.

I'll do additional sanity checks, but it seems with WhisperX, quality slicing isn't a crack dream.


It's not a big enough change to necessitate retraining finetunes, nor to retranscribe datasets, as the principle still applies: if you can have your audio clips trimmed right ahead of time, do so, don't rely on this. However, if you need to, it won't kick you in the balls trying to get accurate timestamps. It's as close to perfect as I can get it.

I figured I'll (ashamedly) mention it here, since this issues thread was the reason why I removed it in the first place: I added WhisperX back. It won't install by default to avoid any more dependency nightmares, and you need a HF token anyways to make the best use of it. I gave it another shot and I admit I was wrong to axe it. It just took a little elbow grease from the rather rudimentary implementation to really see the uplifts it brought: * smarter loading of the alignment model made transcriptions slightly faster (it was naively being loaded on *every* transcription, rather than just loading it at the initial Whisper load) * VAD filtering + using a larger wav2vec2 alignment model has nearly great transcription timestamps - using `-0.05` and `0.05` for start/end offsets literally fixes 99% of the remaining problems. * there's also bigger batch sizes when using the VAD filtering method, but I believe it benefits better from having larger audio clips (I don't seem to have much of an uplift from increasing this from 1 to 8 to 32). * I also have it using diarization, but it's disabled for now as it's a performance penalty, as 99% of datasets are single speakers, and you'll have to manually prune out the speakers you don't want. I'll do additional sanity checks, but it seems with WhisperX, quality slicing isn't a crack dream. --- It's not a big enough change to necessitate retraining finetunes, nor to retranscribe datasets, as the principle still applies: if you can have your audio clips trimmed right ahead of time, do so, don't rely on this. However, if you need to, it won't kick you in the balls trying to get accurate timestamps. It's as close to perfect as I can get it.
mrq closed this issue 2023-03-23 03:36:28 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#113
No description provided.