conditioning_length: 44000 is different to sample rate? #289
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#289
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why is that and would it make an improvement in quality if it was made the same, either using 44000 sample rate or dropping conditioning length?
Changing the sample rate may not have any noticeable effect other than increasing training times, see notes regarding RVQ bins in #152.
If I'm reading DLAS's code right, this strictly governs the number of samples to poll for a similar voice clip during training. This is only related to the sampling rate insofar as to being related to the duration of the clip (and the sample rate used is pretty much hard baked into the TorToiSe stack).
conditioning_length
, which is used for getting "similar voice clips" (which I remember another issue mentioning it might actually be broken?), which references ./dlas/data/audio/unsupervised_audio_dataset.py#L50, which seems to govern how big of a reference clip to sample.I imagine this translates to 3.991s of input audio to work to serve as the input / conditioning latents during training (
conditioning_length
/sample_rate
*num_conditioning_candidates
), as the analog to this when inferencing would be using 3.991s of input audio to work with for generating the conditioning latents.I don't really have any empirical evidence on how much this will help training (or if it does at all), but I think the performance "tax" from doing so is negligible, since a similar approach is taken in the land of VALL-E (or my forked implementation, at least).
Again, assuming it actually does work, since I vaguely remember an issue being brought up that, under DLAS, this functionality actually doesn't work, or was assumed to not work.
(If I need to reclarify let me know, I feel this is a bit confusing since I don't think any of this was ever mentioned before for TorToiSe, either from me or in the other )