One or two models, and their training? #13

Open
opened 2024-08-14 10:05:10 +00:00 by kepsilons · 1 comment

Apologies in advance if these are dumb questions! I'm trying to learn the codebase.

The vall-e paper suggests that the AR and NAR model are two separate models ("Both the AR model and the NAR model have the same transformer architecture...", section 5.1). However, my reading of the code seems to indicate that there is only one transformer-like backbone being created, and it's used in both AR and NAR models. Is this deliberate?

The paper also suggests that during training, the AR model is trained on all steps to "maximize the probability of the next token in the first codebook" (section 4.2.1), whereas for the NAR model, "in each training step, we randomly sample a training stage i ∈ [2, 8]" (section 4.2.2). My reading of the code seems to indicate that at each step, we randomly sample a training stage i ∈ [1, 8], such that the AR model is not trained on all steps. Is this also deliberate?
If AR model is trained on all steps, and NAR model is trained on a random stage per step, it might help solve the problem mentioned in your HF model card:

Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts.
As a fix for the above, naively training it on equally distributed RVQ levels does lobotomize the AR.

Thanks a lot for your patience!

Apologies in advance if these are dumb questions! I'm trying to learn the codebase. The vall-e paper suggests that the AR and NAR model are two separate models ("Both the AR model and the NAR model have the same transformer architecture...", section 5.1). However, my reading of the code seems to indicate that there is only one transformer-like backbone being created, and it's used in both AR and NAR models. Is this deliberate? The paper also suggests that during training, the AR model is trained on all steps to "maximize the probability of the next token in the first codebook" (section 4.2.1), whereas for the NAR model, "in each training step, we randomly sample a training stage i ∈ [2, 8]" (section 4.2.2). My reading of the code seems to indicate that at each step, we randomly sample a training stage i ∈ [1, 8], such that the AR model is not trained on all steps. Is this also deliberate? If AR model is trained on all steps, and NAR model is trained on a random stage per step, it might help solve the problem mentioned in your HF model card: > Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts. > As a fix for the above, naively training it on equally distributed RVQ levels does lobotomize the AR. Thanks a lot for your patience!
kepsilons changed title from One or two models, and their training sequence? to One or two models, and their training? 2024-08-14 10:05:32 +00:00
Owner

Apologies in advance if these are dumb questions! I'm trying to learn the codebase.

No worries. I don't have a good outside perspective on how clean (or schizo) the code is.

However, my reading of the code seems to indicate that there is only one transformer-like backbone being created, and it's used in both AR and NAR models. Is this deliberate?

Correct.

In my early experiments, I did train two models separately, and rigged the engine/trainer backend to be able to handle training both models at the "same" time (or at least feed them the same samples).

I don't recall what tipped me off to just use one model, but it works a little too phenomenaly to have just one model. The theoretical downside is that you need to train longer for each "task" (AR/NAR), but the model can probably "share it's knowledge" of each task and it can help both.

In theory.

My reading of the code seems to indicate that at each step, we randomly sample a training stage i ∈ [1, 8], such that the AR model is not trained on all steps. Is this also deliberate?

I'll have to read the paper again since it's been a while just to be sure, but I imagine "all steps" just means it's fed the entire sequence and not truncated (all timesteps). I feel it's a bit weird to need to state it since I can't think of a specific scenario where training a transformer benefits from not feeding it the entire sequence.

But I could be wrong.

it might help solve the problem mentioned in your HF model card:

Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts.
As a fix for the above, naively training it on equally distributed RVQ levels does lobotomize the AR.

desu that observation is muddied up in a bunch of other things, incidentally both from me trying to throw a bunch of other tweaks at the wall and some other reasons that could be at play. It's actually a bit hard to try and detangle it all even in my head.

Going back to "in theory", one idea I have is that trying to have a single model tasked to have each token try and predict the next (AR) and have the model predict in parallel a batch of tokens (NAR) is quite a lot to task a model to do. I notice this several times between:

  • having the embeddings shared between the two (namely, reusing RVQ level 0 for the AR and NAR, they really do not like to be shared)
  • trying to have the AR predict multiple steps at a time
    • I absolutely cannot get it to work no matter what context, yet in reality it should be possible because the NAR does that.

I think the weighting of how much each RVQ level the model "sees" does influence how well it can do one or the other, or at least for the default model size. I think the default "the next level is half as likely" distribution just maps better to how each level contributes maybe-half as much as the previous level to the final waveform.

Hope it makes sense. I'm having to try and pull out a bunch of crammed observations from what felt like a fever dream of experiments. A lot of it would probably be cleaned up if I did a clean training experiment (but I need the right headspace to do so).

> Apologies in advance if these are dumb questions! I'm trying to learn the codebase. No worries. I don't have a good outside perspective on how clean (or schizo) the code is. > However, my reading of the code seems to indicate that there is only one transformer-like backbone being created, and it's used in both AR and NAR models. Is this deliberate? Correct. In my early experiments, I did train two models separately, and rigged the engine/trainer backend to be able to handle training both models at the "same" time (or at least feed them the same samples). I don't recall what tipped me off to just use one model, but it works a little too phenomenaly to have just one model. The theoretical downside is that you need to train longer for each "task" (AR/NAR), but the model can probably "share it's knowledge" of each task and it can help both. In theory. > My reading of the code seems to indicate that at each step, we randomly sample a training stage i ∈ [1, 8], such that the AR model is not trained on all steps. Is this also deliberate? I'll have to read the paper again since it's been a while just to be sure, but I imagine "all steps" just means it's fed the entire sequence and not truncated (all timesteps). I feel it's a bit weird to need to state it since I can't think of a specific scenario where training a transformer benefits from not feeding it the entire sequence. But I could be wrong. > it might help solve the problem mentioned in your HF model card: >> Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts. >> As a fix for the above, naively training it on equally distributed RVQ levels does lobotomize the AR. desu that observation is muddied up in a bunch of other things, incidentally both from me trying to throw a bunch of other tweaks at the wall *and* some other reasons that could be at play. It's actually a bit hard to try and detangle it all even in my head. Going back to "in theory", one idea I have is that trying to have a single model tasked to have each token try and predict the next (AR) *and* have the model predict in parallel a batch of tokens (NAR) is quite a lot to task a model to do. I notice this several times between: * having the embeddings shared between the two (namely, reusing RVQ level 0 for the AR and NAR, they really do not like to be shared) * trying to have the AR predict multiple steps at a time + I absolutely cannot get it to work no matter what context, yet in reality it *should* be possible because the NAR does that. I think the weighting of how much each RVQ level the model "sees" does influence how well it can do one or the other, or at least for the default model size. I think the default "the next level is half as likely" distribution just maps better to how each level contributes maybe-half as much as the previous level to the final waveform. Hope it makes sense. I'm having to try and pull out a bunch of crammed observations from what felt like a fever dream of experiments. A lot of it would probably be cleaned up if I did a clean training experiment (but I need the right headspace to do so).
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/vall-e#13
No description provided.