re-added noise dataloader sampler whatever for the old implementation's other tasks that require it

This commit is contained in:
mrq 2025-03-28 15:07:06 -05:00
parent 90b3509404
commit 6ae282e090
5 changed files with 47 additions and 13 deletions

View File

@ -1,6 +1,6 @@
# Model V2 Notes
This section aims to document the `_v2` class of models. Documentation here might be all over the place from having to extract findings from four weeks worth of agonizing experiments.
This section aims to document the `_v2` class of models. Documentation here might be all over the place from having to extract findings from several weeks worth of agonizing experiments and quirks.
Unlike the original, this implementation strives to operate on *all* codebooks at once with a full 44KHz bandwidth, rather than requiring the model to operate on one codebook level at a time at 24KHz audio.
@ -90,16 +90,18 @@ However, this modality was not trained for either models, as there seems to be s
## Training Regimen
The `nemo-smaller-44khz-llama-8` model is a 512-dim, 12 layered, 8 headed attention-based transformer with rotary position embedding. Training was performed on four V100s with AMP+`float16` with a batch size of 8 samples per GPU, and an AdamW optimizer with adequate parameters (`1.0e-4` learning rate, betas of `[0.8, 0.95]`, weight_decay of `0.01`, linear warmup to 5K steps before holding) for 400K steps before introducing training for duration prediction in parallel. The dataloader sorts the dataset by duration, starting from 2 seconds and ending with 8 seconds-ed utterances. Training consists of computing the loss for each codebook level non-parallely (where a level is randomly assigned to a sample per a "normal" distribution) with each loss being weighed "normal"ly, for 70% of the epoch when speech starts to emerge. Then, the model was trained to compute the loss paralelly (where all levels have the loss computed) without weighing the loss per-level. Audio quality was lacking for most speakers, as the model failed to handle all codebook levels adequately. Additional training slowly helps, but by-the-numbers metrics don't show much improvement.
* this model also had some training on my 7900XTX rig under `bfloat16`, with similar hyperparameters (a batch size of 32 for one GPU, rather than 8 samples * 4 GPUs ), as it ironically is at parity when utilizing `flash_(sdpa)` attention.
* this model also had ~~some~~ plenty of training on my 7900XTX rig under `bfloat16`, with similar hyperparameters (a batch size of 32 for one GPU, rather than 8 samples * 4 GPUs ), as it ironically is at parity for throughput when utilizing `flash_(sdpa)` attention.
* it's reasonable to assume that a lot of the nitty gritty like LR warmup and slowly introducing features are entirely unnecessary
The `nemo-larger-44khz-llama-8` model is similar to its immediate predecessor, with 1024-dim, 24 layers, and 16 heads. Training is similar where the only difference is with a learning rate of `3.0e-4`. Speech emerged quicker than its predecessor at `?`% of the epoch, but quality remains about the same.
* increasing the de-facto batch size and lowering the learning rate seems to be necessary to edge out improvements in speaker similarity
Training of both models experienced degredation in quality periodically, where the loss will rise, spike, then climb back down. It's reasonable to assume this came from duration sorting being the cause, as the model might somehow "overfit" based on duration, as this problem disappeared when re-initializing the dataloader to instead batch samples by durations, then shuffle the batches. However, training throughput significantly dropped for the larger model.
* Training should *probably* only have the dataloader duration-ordered until speech does emerge, then train an epoch with shuffled durations. Both models do seem to start overfitting on given durations and is a pain to try and train on larger durations (I do not remember the prior implementation having this behavior emerge).
The differences between the two models ~~suggests there is no outright immediate benefits from scaling up as it "costs" more to train the larger model. Benefitis may be discovered through manual evaluation, which kind of predicates on the duration predictor (which wasn't added until much later into training out of neglect).~~ start to emerge on how the model can generalize. The smaller model seems to have trouble handling both a variety of durations *and* speakers, while the larger model is starting to behave as expected in comparison to the prior model (where speaker similarity starts to improve with more and more training time *and* increasing the effective batch size through gradient accumulation).
The differences between the two models ~~suggests there is no outright immediate benefits from scaling up as it "costs" more to train the larger model. Benefitis may be discovered through manual evaluation, which kind of predicates on the duration predictor (which wasn't added until much later into training out of neglect).~~ start to emerge on how the model can generalize. The smaller model seems to have trouble handling a variety of speakers and no inherent way of inferencing duration, while the larger model is starting to behave as expected in comparison to the prior model (where speaker similarity starts to improve with more and more training time *and* increasing the effective batch size through gradient accumulation).
Both flavors were trained on the previously used dataset, but English only (as I did not want to risk throwing in multiple languages during the initial training session, and my patience was dwindling during the audio processing phase).
Both flavors were trained on the previously used dataset, but English-only utterances until speech was quasi-consistent.
* Additional languages and the remaining 8 seconds to 12 seconds were re-introduced into the dataset. Non-English language performance needs to be evaluated, but it seems *fine*.
Additional tasks beyond text-to-speech (`tts`) were not trained for either models, as they're very low priority, and the implementation might have had logic to train for it gutted.
@ -116,7 +118,7 @@ audio_level_loss_factors: "normal" # distribution of loss weights per codebook (
masking_train_p: 1.0 # pure AR
masking_ratio: 0.8 # fixed mask ratio proves to be better
ignore_inputs_for_loss: True # False is not implemented
use_segmented_attention_mask: True # restricts each section within its own section + prior section (in other words, does not let the text/prom see further into the future outside of its segment)
use_segmented_attention_mask: True # restricts each section within its own section + prior section (in other words, does not let the text/prom see further into the future outside of its segment), also enables parallel duration training
use_streamlined_calc_loss: True # False has no effect now
len_loss_factor: 0.0001 # start with the default for a while to not let duration training overpower the model, then gradually increase this (but this may only be required when introducing duration training on existing weights)
@ -141,6 +143,7 @@ These settings should be avoided:
* however, it seems this is a detriment to the model, I imagine because the model could rely on how something sounds earlier on, even if there shouldn't be a direct causal relationship
* this could be something that might need to be trained from the very beginning rather than early on, but training existing models does not seem to fare well
* `nemo-smaller-llama-8` seemed to have degraded far more than `nemo-larger-llama-8` did. I suppose the head count / size might matter.
* this could also have been caused by a regression in the code due to dumb Python aliasing behaviors
## Benefits and Caveats
@ -157,3 +160,12 @@ Additionally, this implementation paves the way a ton of neat features, such as:
* could also be "mocked" by doing NAR-len demasking in chunks
* inherent audio upscaling, as the model is trained on a 44KHz codec
* some other features I can't recall
However, I'm not sure if there's a problem inherent with the model or one that lies within the codec
* the output leaves a lot to be desired when compared to the prior reference model, but that model is too radically different to not be a fair comparison
* training a new model is required for the proper comparison, but that requires compute that could be put into better-ing the current model
* some speakers sound fine, while others have output that suggests there's some quantization/precision problem, where there's some form of bandwidth limiting in the output
* speaker similarity is improving, but still rather poor
* but again, this is simply from the prior model having a ton of training applied to it despite the various features glued on top of it and post-trained
* the prior model is still much faster to inference, although this could just be a difference in model size
* an RVQ codec is heavily favored by the prior implementation, as the most important level gets the best training, and prior levels are easily inferenced

View File

@ -453,6 +453,13 @@ class Model:
return 512
return 1024
@property
def embed_dim(self):
if isinstance(self.size, dict) and "embed_dim" in self.size:
return self.size['embed_dim']
return self.dim
@property
def heads(self):
if isinstance(self.size, dict) and "heads" in self.size:

View File

@ -899,10 +899,9 @@ class Dataset(_Dataset):
# just interleave
self.paths = [*_interleaved_reorder(self.paths, lambda x: x[0])]
"""
self.noise_paths = _load_paths(cfg.dataset.noise, "noise")
self.noise_paths = list(itertools.chain.from_iterable(self.noise_paths.values()))
"""
self.noise_metadata = _load_dataset_metadata(cfg.dataset.noise, "noise", dataset_hash_key=self.dataset_hash_key)
self.noise_speakers = list(self.noise_metadata.keys())
self.noise_paths = [ (speaker_id, utterance_id) for speaker_id, speaker in enumerate(self.noise_speakers) for utterance_id, utterance in enumerate(self.noise_metadata[speaker].keys()) ]
self.phone_symmap = phone_symmap or self._get_phone_symmap()
self.speaker_symmap = self._get_speaker_symmap()
@ -1042,7 +1041,12 @@ class Dataset(_Dataset):
return get_task_symmap()
def sample_noise(self):
path = random.choice(self.noise_paths)
speaker_id, utterance_id = random.choice(self.noise_paths)
speaker_name = self.noise_speakers[speaker_id]
utterance_name = list(self.noise_metadata[speaker_name].keys())[utterance_id]
path = cfg.data_dir / speaker_name / utterance_name
if cfg.dataset.use_hdf5:
key = _get_hdf5_path(path)

View File

@ -363,14 +363,20 @@ class Attention(nn.Module):
if attention_mask is not None:
x_mask = x_mask[:, :, :, : key_states.shape[-2]]
# pain
# SDPA backends only sometimes allowing/disallowing some arguments...
if isinstance( is_causal, list ):
count = sum( is_causal )
count = sum( [ 1 if x else 0 for x in is_causal ] )
if count == 0:
is_causal = False
elif count == len( is_causal ):
is_causal = True
elif x_mask is not None:
is_causal = False
if self.attn_mode in [torch.nn.attention.SDPBackend.FLASH_ATTENTION] or is_causal:
if self.attn_mode in [torch.nn.attention.SDPBackend.FLASH_ATTENTION]:
x_mask = None
elif is_causal == True:
x_mask = None
# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
@ -513,8 +519,10 @@ class DecoderLayer(nn.Module):
hidden_states = self.input_layernorm(hidden_states)
# ugh
"""
if isinstance( is_causal, list ) and len(is_causal) == 1:
is_causal = is_causal[0]
"""
# Self Attention
if self.config.attn_mode == "sparse":

View File

@ -1289,6 +1289,7 @@ class Base_V2(nn.Module):
scores = None
entropy = None
causal = False
if prev_list is not None:
seq_lens = [ prev.shape[0] for prev in prev_list ]
@ -1296,6 +1297,7 @@ class Base_V2(nn.Module):
seq_lens = len_list
elif self.causal:
seq_lens = [ self.causal_size for _ in range( batch_size) ]
causal = True
logits = [ logit[..., -l:, :] for l, logit in zip(seq_lens, logits) ]
@ -1321,9 +1323,10 @@ class Base_V2(nn.Module):
else:
res = [ Categorical(logits=logit / temperature).sample() for logit in logits ]
# we only need the scores for NAR demasking, but AR breaks and I cannot be assed to handle it right now
scores = [
torch.gather(prob, 2, tokens.unsqueeze(-1)).squeeze(-1)
for prob, tokens in zip(probabilities, res)
]
] if not causal else []
return Sampled(res, logits, scores, entropy)