conditioning_length: 44000 is different to sample rate? #289

New Issue

gforce · 2023-07-02T08:28:06Z

gforce commented

2023-07-02 08:28:06 +00:00

Why is that and would it make an improvement in quality if it was made the same, either using 44000 sample rate or dropping conditioning length?

psammites commented

2023-07-04 17:14:33 +00:00

Changing the sample rate may not have any noticeable effect other than increasing training times, see notes regarding RVQ bins in #152.

👍 1

mrq commented

2023-07-04 20:31:34 +00:00

If I'm reading DLAS's code right, this strictly governs the number of samples to poll for a similar voice clip during training. This is only related to the sampling rate insofar as to being related to the duration of the clip (and the sample rate used is pretty much hard baked into the TorToiSe stack).

./dlas/data/audio/paired_voice_audio_dataset.py#L225 references the conditioning_length, which is used for getting "similar voice clips" (which I remember another issue mentioning it might actually be broken?), which references ./dlas/data/audio/unsupervised_audio_dataset.py#L50, which seems to govern how big of a reference clip to sample.

I imagine this translates to 3.991s of input audio to work to serve as the input / conditioning latents during training (conditioning_length / sample_rate * num_conditioning_candidates), as the analog to this when inferencing would be using 3.991s of input audio to work with for generating the conditioning latents.

Now, I'm very sure this number isn't actually reflected in any of the TorToiSe code, as the window for the AR latents is 132300 samples (yielding 6 seconds at 22050Hz), and the window for the diffusion latents is 102400 samples (yielding 4.644 seconds at 22050Hz).

I don't really have any empirical evidence on how much this will help training (or if it does at all), but I think the performance "tax" from doing so is negligible, since a similar approach is taken in the land of VALL-E (or my forked implementation, at least).

Again, assuming it actually does work, since I vaguely remember an issue being brought up that, under DLAS, this functionality actually doesn't work, or was assumed to not work.

(If I need to reclarify let me know, I feel this is a bit confusing since I don't think any of this was ever mentioned before for TorToiSe, either from me or in the other )

If I'm reading DLAS's code right, this strictly governs the number of samples to poll for a similar voice clip during training. This is only related to the sampling rate insofar as to being related to the duration of the clip (and the sample rate used is pretty much hard baked into the TorToiSe stack). * [./dlas/data/audio/paired_voice_audio_dataset.py#L225](https://git.ecker.tech/mrq/DL-Art-School/src/branch/master/dlas/data/audio/paired_voice_audio_dataset.py#L225) references the `conditioning_length`, which is used for getting "similar voice clips" (which I remember another issue mentioning it might actually be broken?), which references [./dlas/data/audio/unsupervised_audio_dataset.py#L50](https://git.ecker.tech/mrq/DL-Art-School/src/branch/master/dlas/data/audio/unsupervised_audio_dataset.py#L50), which seems to govern how big of a reference clip to sample. I imagine this translates to 3.991s of input audio to work to serve as the input / conditioning latents during training (`conditioning_length` / `sample_rate` * `num_conditioning_candidates`), as the analog to this when inferencing would be using 3.991s of input audio to work with for generating the conditioning latents. * Now, I'm *very* sure this number isn't actually reflected in any of the TorToiSe code, as the window for the AR latents is 132300 samples (yielding 6 seconds at 22050Hz), and the window for the diffusion latents is 102400 samples (yielding 4.644 seconds at 22050Hz). I don't really have any empirical evidence on how much this will help training (or if it does at all), but I think the performance "tax" from doing so is negligible, since a similar approach is taken in the land of VALL-E (or my forked implementation, at least). --- Again, assuming it actually does work, since I vaguely remember an issue being brought up that, under DLAS, this functionality actually doesn't work, or was assumed to not work. (If I need to reclarify let me know, I feel this is a bit confusing since I don't think *any* of this was ever mentioned before for TorToiSe, either from me or in the other )

👍 1

Sign in to join this conversation.