goodbye nvidia/audio-codec-44khz, crossed fingers for DAC again

2025-04-06 21:05:29 -05:00 · 2025-04-06 21:05:29 -05:00 · d6cd848c32
commit d6cd848c32
parent 1e22519d94
1 changed files with 17 additions and 15 deletions
--- a/docs/models_v2.md
+++ b/docs/models_v2.md
@ -84,19 +84,20 @@ However, this modality was not trained for either models, as there seems to be s
 The `nemo-smaller-44khz-llama-8` model is a 512-dim, 12 layered, 8 headed attention-based transformer with rotary position embedding. Training was performed on four V100s with AMP+`float16` with a batch size of 8 samples per GPU, and an AdamW optimizer with adequate parameters (`1.0e-4` learning rate,  betas of `[0.8, 0.95]`, weight_decay of `0.01`, linear warmup to 5K steps before holding) for 400K steps before introducing training for duration prediction in parallel. The dataloader sorts the dataset by duration, starting from 2 seconds and ending with 8 seconds-ed utterances. Training consists of computing the loss for each codebook level non-parallely (where a level is randomly assigned to a sample per a "normal" distribution) with each loss being weighed "normal"ly, for 70% of the epoch when speech starts to emerge. Then, the model was trained to compute the loss paralelly (where all levels have the loss computed) without weighing the loss per-level. Audio quality was lacking for most speakers, as the model failed to handle all codebook levels adequately. Additional training slowly helps, but by-the-numbers metrics don't show much improvement.
 * this model also had ~~some~~ plenty of training on my 7900XTX rig under `bfloat16`, with similar hyperparameters (a batch size of 32 for one GPU, rather than 8 samples * 4 GPUs ), as it ironically is at parity for throughput when utilizing `flash_(sdpa)` attention.
 * it's reasonable to assume that a lot of the nitty gritty like LR warmup and slowly introducing features are entirely unnecessary
+* the model *may* benefit from setting the dataloader to a speaker-based one, so it can balance different speakers.

 The `nemo-larger-44khz-llama-8` model is similar to its immediate predecessor, with 1024-dim, 24 layers, and 16 heads. Training is similar where the only difference is with a learning rate of `3.0e-4`.  Speech emerged quicker than its predecessor at `?`% of the epoch, but quality remains about the same.
 * increasing the de-facto batch size and lowering the learning rate seems to be necessary to edge out improvements in speaker similarity

-Training of both models experienced degredation in quality periodically, where the loss will rise, spike, then climb back down. It's reasonable to assume this came from duration sorting being the cause, as the model might somehow "overfit" based on duration, as this problem disappeared when re-initializing the dataloader to instead batch samples by durations, then shuffle the batches. However, training throughput significantly dropped for the larger model.
-* Training should *probably* only have the dataloader duration-ordered until speech does emerge, then train an epoch with shuffled durations. Both models do seem to start overfitting on given durations and is a pain to try and train on larger durations (I do not remember the prior implementation having this behavior emerge).
+Training of both models experienced degradation in quality periodically, where the loss will rise, spike, then climb back down. It's reasonable to assume this came from duration sorting being the cause, as the model might somehow "overfit" based on duration, as this problem disappeared when re-initializing the dataloader to instead batch samples by durations, then shuffle the batches. However, training throughput significantly dropped for the larger model.
+* Training should *probably* only have the dataloader duration-ordered until speech does emerge, then train an epoch with shuffled durations.

-The differences between the two models ~~suggests there is no outright immediate benefits from scaling up as it "costs" more to train the larger model. Benefitis may be discovered through manual evaluation, which kind of predicates on the duration predictor (which wasn't added until much later into training out of neglect).~~ start to emerge on how the model can generalize. The smaller model seems to have trouble handling a variety of speakers and no inherent way of inferencing duration, while the larger model is starting to behave as expected in comparison to the prior model (where speaker similarity starts to improve with more and more training time *and* increasing the effective batch size through gradient accumulation).
+The differences between the two models start to emerge on how the model can generalize. The smaller model seems to have trouble handling a variety of speakers and no inherent way of inferencing duration, while the larger model reaches its "capacity" much much later in training.

 Both flavors were trained on the previously used dataset, but English-only utterances until speech was quasi-consistent.
 * Additional languages and the remaining 8 seconds to 12 seconds were re-introduced into the dataset. Non-English language performance needs to be evaluated, but it seems *fine*.

-Additional tasks beyond text-to-speech (`tts`) were not trained for either models, as they're very low priority, and the implementation might have had logic to train for it gutted.
+Additional tasks beyond text-to-speech (such as `ns`, `sr`, `stt`) were not trained for either models, as they're very low priority, and the implementation might have had logic to train for it gutted.

 ### Experimental Settings

@ -141,24 +142,25 @@ These settings should be avoided:
 ## Benefits and Caveats

 To be evaluated thoroughly.
-* The smaller model seems to have hit its capacity limit, while the larger model is slowly improving (although objective metrics are not noted).
 * The model seems pretty quick, even for the large model.
 * The smaller model seems small enough for CPU-only inferencing
 	* Despite its poor zero-shot performance, it could be perfectly fine for finetuning.

-At a glance, compared to the prior model setup, this implementation allows for the model to better represent speech as it's able to see the entire signal and account for it in its latent space, rather than only specific levels of it.
+At a glance, compared to the prior model setup, this implementation allows for the model to better represent speech as it's able to see the entire signal and account for it in its latent space, rather than only operate on specific levels of it at a time.

 Additionally, this implementation paves the way a ton of neat features, such as:
 * live playback through autoregressive inferencing, as all codebooks are predicted for each step
-	* could also be "mocked" by doing NAR-len demasking in chunks
+	* could also be "mocked" by doing NAR-len demasking in chunks, although this may require sliding window attention
 * inherent audio upscaling, as the model is trained on a 44KHz codec
 * some other features I can't recall

-However, I'm not sure if there's a problem inherent with the model or one that lies within the codec
-* the output leaves a lot to be desired when compared to the prior reference model, but that model is too radically different to not be a fair comparison
-	* training a new model is required for the proper comparison, but that requires compute that could be put into better-ing the current model
-* some speakers sound fine, while others have output that suggests there's some quantization/precision problem, where there's some form of bandwidth limiting in the output
-* speaker similarity is improving, but still rather poor
-	* but again, this is simply from the prior model having a ton of training applied to it despite the various features glued on top of it and post-trained
-* the prior model is still much faster to inference, although this could just be a difference in model size
-* an RVQ codec is heavily favored by the prior implementation, as the most important level gets the best training, and prior levels are easily inferenced
+However, output leaves a lot to be desired:
+* despite what literature suggests, an FSQ codec is heavily favored against with the current approach
+	* each codebook's importance is effectively dependent on the speaker itself, so even having priority be a "learned" parameter is tough
+	* if DAC succeeds where `nvidia/audio-codec-44khz` failed, then this would prove it's a codec problem
+	* if DAC does not, then it's simply a 44KHz problem, where the codebooks are too saturated with information for the model to properly utilize
+		*  although I doubt this is true, as the model still performs fine for a subset of speakers trained against
+	* if Encodec fails too, then this implementation has an inherent flaw
+* both the small and the large model seemed to have hit a "capacity" limit
+	* the "confidence" problem of the prior implementation seems to have emerged even for typical speakers
+* some other quirks and emergent behaviors inherent to the model I'm not aware of / can't recall