audio_embedding_sums performance #12

Open
opened 2024-08-14 08:25:20 +00:00 by kepsilons · 2 comments

Hello! Thank you for this repository and corresponding checkpoints.

I'm new to vall-e (coming over from styletts2), and am trying to understand the paper and code a bit more. One thing the paper mentions about the NAR layers is that:
"The acoustic tokens from stage 1 to stage i − 1 are embedded and summed up as model input" (section 4.2.2)

Does this correspond to the code controlled by audio_embedding_sums? I noticed that audio_embedding_sums=False for your llama HF checkpoints - did you find the summation to be unnecessary, or worse - actually harm performance?

Hello! Thank you for this repository and corresponding checkpoints. I'm new to vall-e (coming over from styletts2), and am trying to understand the paper and code a bit more. One thing the paper mentions about the NAR layers is that: "The acoustic tokens from stage 1 to stage i − 1 are embedded and summed up as model input" (section 4.2.2) Does this correspond to the code controlled by audio_embedding_sums? I noticed that audio_embedding_sums=False for your llama HF checkpoints - did you find the summation to be unnecessary, or worse - actually harm performance?
Owner

Does this correspond to the code controlled by audio_embedding_sums?

Correct.

I will warn though that looking at how I handle the audio embeddings is a bit of a mess, as I have three separate classes for separate versions of things (MultiEmbedding was the original enhuiz/vall-e I used as a starting point, AudioEmbedding_Old is what ar+nar-retnet-8 uses, and AudioEmbedding is what everything else will use, and I think I had to try and maintain interoperability between them).

did you find the summation to be unnecessary, or worse - actually harm performance?

Honestly, I don't think I ever found a concrete answer.

I think my primary reason for going with not summing the audio embeddings was to perform loss calculation on the prom portion of the input sequence, as you can't really do that if you're summing the audio embeddings (as they no longer directly map to a token).

An emergent side-effect from doing so is that I can do prom-less inferencing and get a random voice, for free. At least I think it emerged from it. I don't recall the ar+nar-retnet-8 model being able to do so, and I think I remember the ar+nar-llama-8 model being able to do so before I started tampering more with other things.

I suppose I have enough GPU compute to actually be able to spin out models that "work" enough quickly that I can actually do the comparison, but I just need to nudge myself enough to get around to it.

> Does this correspond to the code controlled by audio_embedding_sums? Correct. I will warn though that looking at how I handle the audio embeddings is a bit of a mess, as I have three separate classes for separate versions of things (`MultiEmbedding` was the original [enhuiz/vall-e](https://github.com/enhuiz/vall-e) I used as a starting point, `AudioEmbedding_Old` is what `ar+nar-retnet-8` uses, and `AudioEmbedding` is what everything else will use, and I think I had to try and maintain interoperability between them). > did you find the summation to be unnecessary, or worse - actually harm performance? Honestly, I don't think I ever found a concrete answer. I think my primary reason for going with not summing the audio embeddings was to perform loss calculation on the `prom` portion of the input sequence, as you can't really do that if you're summing the audio embeddings (as they no longer directly map to a token). An emergent side-effect from doing so is that I can do `prom`-less inferencing and get a random voice, for free. At least I think it emerged from it. I don't recall the `ar+nar-retnet-8` model being able to do so, and I *think* I remember the `ar+nar-llama-8` model being able to do so before I started tampering more with other things. I suppose I have enough GPU compute to actually be able to spin out models that "work" enough quickly that I can actually do the comparison, but I just need to nudge myself enough to get around to it.
Owner

As a bit of a follow-up, I did sort of "cave" and swapped the model to use summed audio embeddings (through ar+nar-tts+stt-llama-8) and I feel like it helped improve the crust and artifacting I would hear in what-I-feel-like-is-the NAR portion of the final audio. What's strange is that there wasn't any issues when doing so, and even just enabling it before doing some correctional post-training the output was fine, and I still retain prom-less prompting, although it felt a little less stable.

However I do need to do some apples-to-apples comparison and it just not being some other tweaks I've glued on (like an additional task helping to bolster the model overall). Just some observations I've noticed a while back.

As a bit of a follow-up, I did sort of "cave" and swapped the model to use summed audio embeddings (through `ar+nar-tts+stt-llama-8`) and I ***feel*** like it helped improve the crust and artifacting I would hear in what-I-feel-like-is-the NAR portion of the final audio. What's strange is that there wasn't any issues when doing so, and even just enabling it before doing some correctional post-training the output was fine, *and* I still retain `prom`-less prompting, although it felt a little less stable. However I do need to do some apples-to-apples comparison and it just not being some other tweaks I've glued on (like an additional task helping to bolster the model overall). Just some observations I've noticed a while back.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/vall-e#12
No description provided.