Discussion about Multi-Speaker dataset fine tuning. #146

Hi guys.
First of all, thanks to mrq for his amazing work on this project.
I've been using this amazing model and would like to explore all its possibilities. I'm planning to open a few issues to discuss three different topics, and I'm hoping to gather information and insights from everyone who is fine-tuning the model, in order to create a thread that can be useful to people that are working with Tortoise.

The topics I'd like to discuss are as follows:

Fine Tuning on Different Language: I know that there have been some discussions on this topic, but I believe it would be helpful to have a specific thread to share procedures and findings.
Multi-Lingual Tortoise: I'm curious if it's possible for a single Tortoise model to generate multiple languages, and possibly being cross-lingual (e.g. using english speaker voice to speak Italian).
Fine Tuning on Multi-Speaker Dataset: this is the topic I would like to discuss in this issue.

I'm planning to open a discussion on these topics, but I'm having trouble finding the appropriate section. If this is not the right approach, please let me know, and I will close the issue.

Let's begin with fine tuning on Multi-speaker dataset.

The base model of the Tortoise model is a zero-shot multi-speaker model that changes the speaker identity of the generated speech based on the conditioning voice latent (reference audios). However, the problem with the base model is that it tends to Americanize/Britishize non-native speakers or speakers who were not well-represented in the training set.

To solve this issue, I fine-tuned the model on the speaker I wanted to clone, which worked well for that speaker. However, I discovered that the fine-tuned model lost all the multi-speaker capabilities it had. No matter which speaker (reference audios) I used, the model during inference always generated the voice of the speaker I had fine-tuned on.

So my question is, is it possible to retain the multi-speaker capabilities of the model? My guess is that we would need to add the new speakers' datasets to the entire original training dataset (which is not available) and fine-tune the Tortoise model on this multi-speaker dataset.

I'm curious if anyone has tried fine-tuning on a multi-speaker dataset, such as vctk, to see if the model can successfully retain all the speakers it has seen during fine-tuning.
Also what is the approach? I was reading on the other repo that the model should be able to discriminate the different speakers and due to this there's no need to separate the dataset (a dir for each speaker)

If someone of you guys has successfully fine tuned on multi-speaker dataset, what procedure did you use? How much data did you use in total and for each speaker?

Thanks and have a great day.

Hi guys. First of all, thanks to mrq for his amazing work on this project. I've been using this amazing model and would like to explore all its possibilities. I'm planning to open a few issues to discuss three different topics, and I'm hoping to gather information and insights from everyone who is fine-tuning the model, in order to create a thread that can be useful to people that are working with Tortoise. The topics I'd like to discuss are as follows: 1. **Fine Tuning on Different Language**: I know that there have been some discussions on this topic, but I believe it would be helpful to have a specific thread to share procedures and findings. 2. **Multi-Lingual Tortoise**: I'm curious if it's possible for a single Tortoise model to generate multiple languages, and possibly being cross-lingual (e.g. using english speaker voice to speak Italian). 3. **Fine Tuning on Multi-Speaker Dataset**: this is the topic I would like to discuss in this issue. I'm planning to open a discussion on these topics, but I'm having trouble finding the appropriate section. If this is not the right approach, please let me know, and I will close the issue. Let's begin with **fine tuning on Multi-speaker dataset**. The base model of the Tortoise model is a zero-shot multi-speaker model that changes the speaker identity of the generated speech based on the conditioning voice latent (reference audios). However, the problem with the base model is that it tends to Americanize/Britishize non-native speakers or speakers who were not well-represented in the training set. To solve this issue, I fine-tuned the model on the speaker I wanted to clone, which worked well for that speaker. However, I discovered that the fine-tuned model lost all the multi-speaker capabilities it had. No matter which speaker (reference audios) I used, the model during inference always generated the voice of the speaker I had fine-tuned on. So my question is, is it possible to retain the multi-speaker capabilities of the model? My guess is that we would need to add the new speakers' datasets to the entire original training dataset (which is not available) and fine-tune the Tortoise model on this multi-speaker dataset. I'm curious if anyone has tried fine-tuning on a multi-speaker dataset, such as vctk, to see if the model can successfully retain all the speakers it has seen during fine-tuning. Also what is the approach? I was reading on the other repo that the model should be able to discriminate the different speakers and due to this there's no need to separate the dataset (a dir for each speaker) If someone of you guys has successfully fine tuned on multi-speaker dataset, what procedure did you use? How much data did you use in total and for each speaker? Thanks and have a great day.

Has anybody tried yet to fine tune on a large multi-speaker dataset?
I've read on the other repo that we can put all the speakers' wavs in the same folder and the model figures out itself how to deal with a multi-speaker dataset.
To quote the answer in the other repo: 'It's all learned implicitly. There's no fundamental difference between a single-speaker and a multi-speaker dataset apart from the variance of the distribution of conditioning latents && predicted audio'.

But it seems to me, correct me if I'm wrong, that in this repo the conditioning latent is computed only once given the voice folder, so if I put all the dataset in one folder it would compute a single voice conditioning latent which is the average of all the dataset. Am I right?

Thanks to anyone who would answer.

Has anybody tried yet to fine tune on a large multi-speaker dataset? I've read on the other repo that we can put all the speakers' wavs in the same folder and the model figures out itself how to deal with a multi-speaker dataset. To quote the answer in the other repo: 'It's all learned implicitly. There's no fundamental difference between a single-speaker and a multi-speaker dataset apart from the variance of the distribution of conditioning latents && predicted audio'. But it seems to me, correct me if I'm wrong, that in this repo the conditioning latent is computed only once given the voice folder, so if I put all the dataset in one folder it would compute a single voice conditioning latent which is the average of all the dataset. Am I right? Thanks to anyone who would answer.

Oh right, I forgot to actually test an English-but-varied speaker finetune to see how well it'd work for zero shot. desu it hasn't been something I'd care all that much to see (partially because I just figured it'd get better results just finetuning per voice I want, and partly because of VALL-E eating my brain).

So my question is, is it possible to retain the multi-speaker capabilities of the model? My guess is that we would need to add the new speakers' datasets to the entire original training dataset (which is not available) and fine-tune the Tortoise model on this multi-speaker dataset.

I'd imagine that way is the only way to "add" a voice to the "list" of possible voices to use for zero-shot.

I'd like to think you can get away with providing a large/varied dataset of every voice you plan to use for zero-shot, and maybe getting better.

I'm curious if anyone has tried fine-tuning on a multi-speaker dataset, such as vctk, to see if the model can successfully retain all the speakers it has seen during fine-tuning.

My Japanese finetune didn't do so good, but I'm sure it's because there's way, way too many variables I was changing, so it's zero-shot-ability is pretty much fried.

Also what is the approach? I was reading on the other repo that the model should be able to discriminate the different speakers and due to this

To my understanding (or what's still left of my understanding of TorToiSe), speakers are classified by their voice latents (acoustic prompts / speech conditioning), as the input token string is defined as <speech conditioning>:<text tokens>:<mel tokens>.

For inference, speech conditioning tokens are your voice latents. I'm not sure how much of the voice latents file is the speech conditioning, though.

For training, I think speech conditioning is computed for each line of the dataset, so it would better match with its text tokens and mel tokens.

there's no need to separate the dataset (a dir for each speaker)

Mhm. If you were to train a multi-speaker finetune, you can just dump it all in one folder.

You just need to be careful when you are using it for inference, as when it comes to computing the voice latents, it will compute it for all voices, effectively creating an average between all voices.

If someone of you guys has successfully fine tuned on multi-speaker dataset, what procedure did you use? How much data did you use in total and for each speaker?

desu I'm not too sure if there's anything special you should do about it. You'll always be limited by what exists in your training dataset, so use as much as you can.

I'd say you can "balance" things out by making sure each voice you want to train against has the same amount of lines, but there will always be some weight favoring one over the other if you have too small of a batch size, as the learning rate will adjust midway through (although adequate training with a good enough epoch count will fix that).

I've read on the other repo that we can put all the speakers' wavs in the same folder and the model figures out itself how to deal with a multi-speaker dataset.
To quote the answer in the other repo: 'It's all learned implicitly. There's no fundamental difference between a single-speaker and a multi-speaker dataset apart from the variance of the distribution of conditioning latents && predicted audio'.

Correct. For training, the part that defines voice traits (speech conditioning / voice latents / blah blah) are associated for each line itself.

But it seems to me, correct me if I'm wrong, that in this repo the conditioning latent is computed only once given the voice folder

so if I put all the dataset in one folder it would compute a single voice conditioning latent which is the average of all the dataset. Am I right?

Only for inferencing, which is effectively just mixing voices together.

You should be fine with just dumping every voice you want into a folder and go from there with similar settings to a normal finetune. I'll need to double check the code that does derive speech conditioning tokens given a WAV file, but I'm confident it's by line.

Sorry for not getting around to it. I think at that time everyone flocked to the non-English issue, especially me since that was something I was focused on at the time.

I don't think anyone has tried a multi-speaker one since it's rather a bit silly to do; if you want a specific voice it's arguably better to just finetune for a specific voice. You might be able to leverage voices with similar voice latents by training alongside it, but it's ultimately up to how training goes and how it learns.

Oh right, I forgot to actually test an English-but-varied speaker finetune to see how well it'd work for zero shot. desu it hasn't been something I'd care all that much to see (partially because I just figured it'd get better results just finetuning per voice I want, and partly because of VALL-E eating my brain). > So my question is, is it possible to retain the multi-speaker capabilities of the model? My guess is that we would need to add the new speakers' datasets to the entire original training dataset (which is not available) and fine-tune the Tortoise model on this multi-speaker dataset. I'd imagine that way is the only way to "add" a voice to the "list" of possible voices to use for zero-shot. I'd like to think you can get away with providing a large/varied dataset of every voice you plan to use for zero-shot, and *maybe* getting better. >I'm curious if anyone has tried fine-tuning on a multi-speaker dataset, such as vctk, to see if the model can successfully retain all the speakers it has seen during fine-tuning. My Japanese finetune didn't do so good, but I'm sure it's because there's way, way too many variables I was changing, so it's zero-shot-ability is pretty much fried. >Also what is the approach? I was reading on the other repo that the model should be able to discriminate the different speakers and due to this To my understanding (or what's still left of my understanding of TorToiSe), speakers are classified by their voice latents (acoustic prompts / speech conditioning), as the input token string is defined as `<speech conditioning>:<text tokens>:<mel tokens>`. For inference, speech conditioning tokens are your voice latents. I'm not sure how *much* of the voice latents file is the speech conditioning, though. For training, I *think* speech conditioning is computed for each line of the dataset, so it would better match with its text tokens and mel tokens. > there's no need to separate the dataset (a dir for each speaker) Mhm. If you were to train a multi-speaker finetune, you can just dump it all in one folder. You just need to be careful when you are using it for inference, as when it comes to computing the voice latents, it will compute it for *all* voices, effectively creating an average between all voices. > If someone of you guys has successfully fine tuned on multi-speaker dataset, what procedure did you use? How much data did you use in total and for each speaker? desu I'm not too sure if there's anything special you should do about it. You'll always be limited by what exists in your training dataset, so use as much as you can. I'd say you can "balance" things out by making sure each voice you want to train against has the same amount of lines, but there will always be *some* weight favoring one over the other if you have too small of a batch size, as the learning rate will adjust midway through (although adequate training with a good enough epoch count will fix that). --- > I've read on the other repo that we can put all the speakers' wavs in the same folder and the model figures out itself how to deal with a multi-speaker dataset. > To quote the answer in the other repo: 'It's all learned implicitly. There's no fundamental difference between a single-speaker and a multi-speaker dataset apart from the variance of the distribution of conditioning latents && predicted audio'. Correct. For training, the part that defines voice traits (speech conditioning / voice latents / blah blah) are associated for each line itself. > But it seems to me, correct me if I'm wrong, that in this repo the conditioning latent is computed only once given the voice folder > so if I put all the dataset in one folder it would compute a single voice conditioning latent which is the average of all the dataset. Am I right? Only for inferencing, which is effectively just mixing voices together. --- You *should* be fine with just dumping every voice you want into a folder and go from there with similar settings to a normal finetune. I'll need to double check the code that does derive speech conditioning tokens given a WAV file, but I'm confident it's by line. Sorry for not getting around to it. I think at that time everyone flocked to the non-English issue, especially me since that was something I was focused on at the time. I don't think anyone has tried a multi-speaker one since it's rather a bit silly to do; if you want a specific voice it's arguably better to just finetune for a specific voice. You *might* be able to leverage voices with similar voice latents by training alongside it, but it's ultimately up to how training goes and how it learns.

Hi @mrq thanks for the response. I have tried dumping all speakers into the same folder (i've used LibriTTS dev clean part) and after 10 epochs the model is still solid in the outputs (which means there's no much artifacts and strange noises happening).

The problem is that, whichever voice latents I use, the generated speech is always more or less the same. I did what you suggested, so I've not used all the mixed speakers' wavs inside the folder to compute voice latents at inference.

In theory, If I provide a large enough and varied enough dataset, maybe the model 'loses' a bit some voices present in the original training dataset, but should be able to clone at least the new ones seen during the fine tuning, so I think there iss some problem inside the code that prevents it. I am very bad at analyzing the code, especially if it is this complex, so I've tried to figure out what's the problem already but I finished my trip in the code lost and desperate haha.

I know you are busy with Vall-E right now, but I see very high value in maintaining multi-speaker zero shot capabilities in Tortoise, while being able to 'add' new speakers.

First of all, if the model deals with more speakers, it should be less prone to a number of artifacts that I experience when I try to fine tune on little one speaker datasets (maybe because of overfitting, who knows?).

Secondly, we wouldn't need to have a single speaker model for every new speaker we want. If we have a new speaker, we simply add it to the mixed dataset and fine tune further for a few epochs.

I don't know how Vall-E would turn out to be, I'm really excited about, but as of now, and I have tried lots of them, Tortoise is the best sounding and credible TTS model I have tried and I think there's still unexpressed potential.

Anyway, thanks for your work and your response.

Hi @mrq thanks for the response. I have tried dumping all speakers into the same folder (i've used LibriTTS dev clean part) and after 10 epochs the model is still solid in the outputs (which means there's no much artifacts and strange noises happening). The problem is that, whichever voice latents I use, the generated speech is always more or less the same. I did what you suggested, so I've not used all the mixed speakers' wavs inside the folder to compute voice latents at inference. In theory, If I provide a large enough and varied enough dataset, maybe the model 'loses' a bit some voices present in the original training dataset, but should be able to clone at least the new ones seen during the fine tuning, so I think there iss some problem inside the code that prevents it. I am very bad at analyzing the code, especially if it is this complex, so I've tried to figure out what's the problem already but I finished my trip in the code lost and desperate haha. I know you are busy with Vall-E right now, but I see very high value in maintaining multi-speaker zero shot capabilities in Tortoise, while being able to 'add' new speakers. First of all, if the model deals with more speakers, it should be less prone to a number of artifacts that I experience when I try to fine tune on little one speaker datasets (maybe because of overfitting, who knows?). Secondly, we wouldn't need to have a single speaker model for every new speaker we want. If we have a new speaker, we simply add it to the mixed dataset and fine tune further for a few epochs. I don't know how Vall-E would turn out to be, I'm really excited about, but as of now, and I have tried lots of them, Tortoise is the best sounding and credible TTS model I have tried and I think there's still unexpressed potential. Anyway, thanks for your work and your response.

There's one last thing that I keep forgetting to try and implement myself to see how the results are. I only remembered it earlier for VALL-E uses, but I don't see why it wouldn't also work for TorToiSe AR models, but something like Voldy's Web UI for Stable Diffusion mixing models might be a way to go about adding "voices" to the main model.

Might, as I'm not sure how well the results would be. I'd like to believe the AR model is large enough that you can actually do this, but it'd take me some luck and elbow grease to reimplement it for TorToiSe use.

There's one last thing that I keep forgetting to try and implement myself to see how the results are. I only remembered it earlier for VALL-E uses, but I don't see why it wouldn't also work for TorToiSe AR models, but something [like Voldy's Web UI for Stable Diffusion mixing models](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#checkpoint-merger) *might* be a way to go about adding "voices" to the main model. *Might*, as I'm not sure how well the results would be. I'd like to believe the AR model is large enough that you can actually do this, but it'd take me some luck and elbow grease to reimplement it for TorToiSe use.

Added ability to mix models in commit f66281f10c. I might need to actually lift Voldy's Web UI's implementation and have a third model and do three-way-merging just to make the process a little faster / better results.

I'm not so sure how well it'd work for "adding" voices to the base model, so if you want to play around with it for me, be my guest; it's under Utilities > Model Mixer. I can't really test it in depth right now, just a cursory test with merging two finetunes together. It yielded impressive results, and technically can be called multi-speaker, with a side effect of voices being slightly blended.

Right now it really likes to overpower the AR model even at 90% AR:10% a finetune. I'm sure you could get better results if you mix in a multi-voice finetune into the base model.

Added ability to mix models in commit f66281f10cb952706fa97669d8d8c37cb7a261c1. I might need to actually lift Voldy's Web UI's implementation and have a third model and do three-way-merging just to make the process a little faster / better results. I'm not so sure how well it'd work for "adding" voices to the base model, so if you want to play around with it for me, be my guest; it's under `Utilities` > `Model Mixer`. I can't really test it in depth right now, just a cursory test with merging two finetunes together. It yielded impressive results, and *technically* can be called multi-speaker, with a side effect of voices being slightly blended. --- Right now it *really* likes to overpower the AR model even at 90% AR:10% a finetune. I'm sure you could get better results if you mix in a multi-voice finetune into the base model.

Thank you very much for this awesome new feature, I wan't aware of that possibility. I've tried it and since I'm trying to add the clone of a heavy accented speaker, the merged model nails the timbre but it remains mainly american pronunciation, as you said it tends to blend the voices.

I have a request if it's not too much code and hustle for you, would it be possible to apply the exact same conditioning latent for every wav said by a specific character?
This should solve the original problem or at least I think, as I believe the problem is in how the conditioning inputs are passed, given a large mixed multi-speaker dataset as it is structured now.

I was thinking: you have a dir containing the whole multi-speaker dataset, this dir is composed of subdirs representing each different speaker from the dataset, for each subdir you have a specific conditioning latent.
Right now we would have to place all the speakers mixed in one single directory, and I think that's where the problem arises.

If it is not huge to implement, I would be happy to experiment with it and let you know.

Thank you again @mrq

Thank you very much for this awesome new feature, I wan't aware of that possibility. I've tried it and since I'm trying to add the clone of a heavy accented speaker, the merged model nails the timbre but it remains mainly american pronunciation, as you said it tends to blend the voices. I have a request if it's not too much code and hustle for you, would it be possible to apply the exact same conditioning latent for every wav said by a specific character? This should solve the original problem or at least I think, as I believe the problem is in how the conditioning inputs are passed, given a large mixed multi-speaker dataset as it is structured now. I was thinking: you have a dir containing the whole multi-speaker dataset, this dir is composed of subdirs representing each different speaker from the dataset, for each subdir you have a specific conditioning latent. Right now we would have to place all the speakers mixed in one single directory, and I think that's where the problem arises. If it is not huge to implement, I would be happy to experiment with it and let you know. Thank you again @mrq

Labels Milestones

Discussion about Multi-Speaker dataset fine tuning. #146