Finetuning diffusion model #421

Thank you for the all the work you have done!

You haven't provided code for finetuning diffusion model (gpt latent -> mel). Would there be benefit in finetuning diffusion model?

I have noticed that gpt finetuning clones speaking style, but not the voice itself. How can we achieve voice cloning as well? I was thinking through diffusion models finetuning

Thank you for the all the work you have done! You haven't provided code for finetuning diffusion model (gpt latent -> mel). Would there be benefit in finetuning diffusion model? I have noticed that gpt finetuning clones speaking style, but not the voice itself. How can we achieve voice cloning as well? I was thinking through diffusion models finetuning

It's a mixed bag.

152334H/DL-Art-School should have diffusion finetuning covered. If I remember right, it should also be at parity with my DLAS fork as well. If you insist on using my fork instead, you can refer to how it generates the training YAML for the diffusion model. I just don't know if things are set up to finetune the diffusion model for TorToiSe.

On the other hand, I don't think it's worth the trouble of finetuning the diffusion model, as in theory it only really handles converting mel tokens into an actual mel spectrogram, and it's already decent enough at it.

On the other, other hand, again in theory, the diffusion model can help fill in the gaps of replicating a speaker's "acoustics" (for lack of a better analog from VALL-E) that the AR and its latents cannot replicate, you'll still be restricted to the diffusion latents and how well those latents can capture the "acoustics" (if they do do that) of a speaker. I think it can work, since the AR latents already can re-adapt to the finetuned voice, and can even allow voice mixing.

However, the above is just conjecture. I don't have anything empirical on how the diffusion could help out fully, as finetuning the AR usually is enough for whatever you need, sans the niche voices that have effects applied over it.

It's a mixed bag. [152334H/DL-Art-School](https://github.com/152334H/DL-Art-School#training-the-diffusion-model-wip) *should* have diffusion finetuning covered. If I remember right, it should also be at parity with my DLAS fork as well. If you insist on using my fork instead, you can refer to how it generates the training YAML for the diffusion model. I just don't know if things are set up to finetune the diffusion model for TorToiSe. On the other hand, I don't think it's worth the trouble of finetuning the diffusion model, as in theory it only really handles converting mel tokens into an actual mel spectrogram, and it's already decent enough at it. On the other, other hand, again in theory, the diffusion model *can* help fill in the gaps of replicating a speaker's "acoustics" (for lack of a better analog from VALL-E) that the AR and its latents cannot replicate, you'll still be restricted to the diffusion latents and how well those latents can capture the "acoustics" (if they do do that) of a speaker. I think it can work, since the AR latents already can re-adapt to the finetuned voice, and can even allow voice mixing. However, the above is just conjecture. I don't have anything empirical on how the diffusion could help out fully, as finetuning the AR usually is enough for whatever you need, sans the niche voices that have effects applied over it.

Labels Milestones

Finetuning diffusion model #421