audio_embedding_sums performance #12
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hello! Thank you for this repository and corresponding checkpoints.
I'm new to vall-e (coming over from styletts2), and am trying to understand the paper and code a bit more. One thing the paper mentions about the NAR layers is that:
"The acoustic tokens from stage 1 to stage i − 1 are embedded and summed up as model input" (section 4.2.2)
Does this correspond to the code controlled by audio_embedding_sums? I noticed that audio_embedding_sums=False for your llama HF checkpoints - did you find the summation to be unnecessary, or worse - actually harm performance?
Correct.
I will warn though that looking at how I handle the audio embeddings is a bit of a mess, as I have three separate classes for separate versions of things (
MultiEmbedding
was the original enhuiz/vall-e I used as a starting point,AudioEmbedding_Old
is whatar+nar-retnet-8
uses, andAudioEmbedding
is what everything else will use, and I think I had to try and maintain interoperability between them).Honestly, I don't think I ever found a concrete answer.
I think my primary reason for going with not summing the audio embeddings was to perform loss calculation on the
prom
portion of the input sequence, as you can't really do that if you're summing the audio embeddings (as they no longer directly map to a token).An emergent side-effect from doing so is that I can do
prom
-less inferencing and get a random voice, for free. At least I think it emerged from it. I don't recall thear+nar-retnet-8
model being able to do so, and I think I remember thear+nar-llama-8
model being able to do so before I started tampering more with other things.I suppose I have enough GPU compute to actually be able to spin out models that "work" enough quickly that I can actually do the comparison, but I just need to nudge myself enough to get around to it.
As a bit of a follow-up, I did sort of "cave" and swapped the model to use summed audio embeddings (through
ar+nar-tts+stt-llama-8
) and I feel like it helped improve the crust and artifacting I would hear in what-I-feel-like-is-the NAR portion of the final audio. What's strange is that there wasn't any issues when doing so, and even just enabling it before doing some correctional post-training the output was fine, and I still retainprom
-less prompting, although it felt a little less stable.However I do need to do some apples-to-apples comparison and it just not being some other tweaks I've glued on (like an additional task helping to bolster the model overall). Just some observations I've noticed a while back.