Tortoise Training: 'num_conditioning_inputs' is useless #237

New Issue

Brugio96 · 2023-05-15T12:47:49Z

Brugio96 commented

2023-05-15 12:47:49 +00:00

Hi, I'm training an italian TTS with Tortoise on an italian multi-speaker dataset and, while I was able to teach the model how to speak a high quality italian, I wasn't able to control the speaker voice. So I start investigating and debugging the training code, and found out that the model as it is will always use the training audio sample itself
Hi, I'm training an italian TTS with Tortoise on a very large italian multi-speaker dataset with a custom italian tokenizer, and I was able to successfully teach the model an almost perfect italian pronunciation. I directly fine tuned on the multi-speaker dataset. However I had the problem of controlling the speaker with the reference clips, as the model would output most of the time the most present speaker in the dataset or sometimes it would output random voices (of great quality though). It seemed that no matter which reference clips I passed during inference, I wasn't able to control the speaker. So I tried to balance the number of samples from each speaker, to avoid the bias towards the most present ones, but with the resulting model I still wasn't be able to control the speaker.
So I did some debugging and found out that 'num_conditioning_inputs' parameter during training is irrelevant. In fact the code for loading the reference clips (from which then you compute the speech input conditioning) is such that it will always use the audio training sample itself as speech conditioning inputs. In fact the code used to load the conditioning inputs during training is this:

cond, cond_is_self = load_similar_clips(self.audiopaths_and_text[index][0], self.conditioning_length, self.sample_rate, n=self.conditioning_candidates) if self.load_conditioning else (None, False))

(you can find it here). The problem lies in the 'load_similar_clips' function: in fact, this function use a model 'similarities.pth' to search for similar clips; the problem is that we don't have this model, so the fallback mechanism is such that, no matter how many 'num_conditioning_inputs' we use, it will always use 1 speech conditioning input and it will always be the audio training sample itself. So I tried modifying the 'load_similar_clips' to search for similar clips in the same folder of the audio training sample, since I have the training set organized in speaker folders.

Altough I have not solved my problem of controlling the output speaker voice at inference, I thought I share this finding. If anyone has some advice on how to solve the problem it would be great!

Thank you and have a great day!

Hi, I'm training an italian TTS with Tortoise on an italian multi-speaker dataset and, while I was able to teach the model how to speak a high quality italian, I wasn't able to control the speaker voice. So I start investigating and debugging the training code, and found out that the model as it is will always use the training audio sample itself Hi, I'm training an italian TTS with Tortoise on a very large italian multi-speaker dataset with a custom italian tokenizer, and I was able to successfully teach the model an almost perfect italian pronunciation. I directly fine tuned on the multi-speaker dataset. However I had the problem of controlling the speaker with the reference clips, as the model would output most of the time the most present speaker in the dataset or sometimes it would output random voices (of great quality though). It seemed that no matter which reference clips I passed during inference, I wasn't able to control the speaker. So I tried to balance the number of samples from each speaker, to avoid the bias towards the most present ones, but with the resulting model I still wasn't be able to control the speaker. So I did some debugging and found out that 'num_conditioning_inputs' parameter during training is irrelevant. In fact the code for loading the reference clips (from which then you compute the speech input conditioning) is such that it will always use the audio training sample itself as speech conditioning inputs. In fact the code used to load the conditioning inputs during training is this: ``` cond, cond_is_self = load_similar_clips(self.audiopaths_and_text[index][0], self.conditioning_length, self.sample_rate, n=self.conditioning_candidates) if self.load_conditioning else (None, False)) ``` (you can find it [here](https://git.ecker.tech/mrq/DL-Art-School/src/branch/master/dlas/data/audio/paired_voice_audio_dataset.py)). The problem lies in the 'load_similar_clips' function: in fact, this function use a model 'similarities.pth' to search for similar clips; the problem is that we don't have this model, so the fallback mechanism is such that, no matter how many 'num_conditioning_inputs' we use, it will always use 1 speech conditioning input and it will always be the audio training sample itself. So I tried modifying the 'load_similar_clips' to search for similar clips in the same folder of the audio training sample, since I have the training set organized in speaker folders. Altough I have not solved my problem of controlling the output speaker voice at inference, I thought I share this finding. If anyone has some advice on how to solve the problem it would be great! Thank you and have a great day!

mrq commented

2023-05-15 19:33:31 +00:00

Huh, funny. I believe it was the other day it crossed my mind to try and figure out that exact same thing: how DLAS would handle it's input prompt shuffling for better training. Being in the land of VALL-E has given me a lot of insights. Very nice find.

Bear with me though, since it's been a long time since I have actually touched anything TorToiSe/DLAS, but I should be able to still offer some remarks.

It seemed that no matter which reference clips I passed during inference, I wasn't able to control the speaker

That's pretty much how I felt about my generalized Japanese finetunes. No matter how I went about it, its zero-shot-ability suffered, and the only way to go about it was just to finetune it again for a specific speaker, as the model itself would already have "learnt" the language.

So I tried modifying the 'load_similar_clips' to search for similar clips in the same folder of the audio training sample, since I have the training set organized in speaker folders.

That's pretty much what the VALL-E implementation I forked does, and I imagine it works well enough. From what I can glean from load_similar_clips, it uses a custom-tailored-to-a-dataset pickled dict that would already have a list of similar candidates (by filename), so it's pretty much out of the question to try and source one yourself.

But...

Altough I have not solved my problem of controlling the output speaker voice at inference

I guess it doesn't matter all that much in the end then, or at least, in terms of finetuning.

If anyone has some advice on how to solve the problem it would be great!

I don't believe I ever actually tried the French finetune, but I don't recall anyone mentioning it having issues with zero-shot voices? Although I'm not sure how much help it would be anyways, since that was finetuned back in like, February I believe as one of the first finetunes.

As for my advice, or at least, what I can remember, if you're using it for zero-shot voices, might be to narrow down your source voice clips (like, have it only be a small clip of a voice given), since I'm very, very, very, very sure, my crack at the conditioning latents generation routine is inconsistent, where it would work very well for some voices, but other voices it'll just not go so well. You can use the 152334H/tortoise-tts-fast fork, load your finetuned model and voice(s) into it, and see how it fares. By default, I believe it uses the old-but-probably-more-correct way to generate its conditioning latents.

Huh, funny. I believe it was the other day it crossed my mind to try and figure out that exact same thing: how DLAS would handle it's input prompt shuffling for better training. Being in the land of VALL-E has given me a lot of insights. Very nice find. Bear with me though, since it's been a long time since I have actually touched anything TorToiSe/DLAS, but I should be able to still offer some remarks. > It seemed that no matter which reference clips I passed during inference, I wasn't able to control the speaker That's pretty much how I felt about my generalized Japanese finetunes. No matter how I went about it, its zero-shot-ability suffered, and the only way to go about it was just to finetune it again for a specific speaker, as the model itself would already have "learnt" the language. > So I tried modifying the 'load_similar_clips' to search for similar clips in the same folder of the audio training sample, since I have the training set organized in speaker folders. That's pretty much what the VALL-E implementation I forked does, and I imagine it works well enough. From what I can glean from [`load_similar_clips`](https://git.ecker.tech/mrq/DL-Art-School/src/branch/master/dlas/data/audio/unsupervised_audio_dataset.py#L50), it uses a custom-tailored-to-a-dataset pickled dict that would already have a list of similar candidates (by filename), so it's pretty much out of the question to try and source one yourself. But... > Altough I have not solved my problem of controlling the output speaker voice at inference I guess it doesn't matter all that much in the end then, or at least, in terms of finetuning. > If anyone has some advice on how to solve the problem it would be great! I don't believe I ever actually tried the French finetune, but I don't recall anyone mentioning it having issues with zero-shot voices? Although I'm not sure how much help it would be anyways, since that was finetuned back in like, February I believe as one of the first finetunes. As for my advice, or at least, what I can remember, if you're using it for zero-shot voices, might be to narrow down your source voice clips (like, have it only be a small clip of a voice given), since I'm very, very, very, ***very*** sure, my crack at the conditioning latents generation routine is inconsistent, where it would work very well for some voices, but other voices it'll just not go so well. You can use the [152334H/tortoise-tts-fast](https://github.com/152334H/tortoise-tts-fast) fork, load your finetuned model and voice(s) into it, and see how it fares. By default, I believe it uses the old-but-probably-more-correct way to generate its conditioning latents.

Brugio96 commented

2023-05-15 20:02:41 +00:00

Hi @mrq, thank you for the response, I've already tried the tortoise-tts-fast fork, and it behaves the same.
When I talk about not being able to control the output speaker, I don't have the ambition of expecting a zero shot control over unseen voices, but at least I would like to be able to control speaker identity for the speakers that were present in the training set. It almost seems that the model ignores which conditioning inputs I input during inference. And it's a shame because I know that the model is capable of speaking well in several different speakers voices, since it outputs all kind of voices. The problem is that I can't get any control over which it will actually output. Anyway, I know that some people managed to get a working multi-speaker and multi-speaker Tortoise-TTS model. I still have of course the possibility of specializing a new Tortoise-TTS model for each speaker by further fine tuning on the italian model I have now, but it would be great to have a single model being able to deal with different speakers rather than a new model for each new speaker.

As a final consideration: in my opinion Tortoise is still the best open source TTS to this day, it has an exceptional expressivity and it is able to learn perfectly a new language. Unfortunately it seems to be a nightmare to generalize good training configs and it has all kinda of quirks when it comes to get a consistent result from its training.

Hi @mrq, thank you for the response, I've already tried the tortoise-tts-fast fork, and it behaves the same. When I talk about not being able to control the output speaker, I don't have the ambition of expecting a zero shot control over unseen voices, but at least I would like to be able to control speaker identity for the speakers that were present in the training set. It almost seems that the model ignores which conditioning inputs I input during inference. And it's a shame because I know that the model is capable of speaking well in several different speakers voices, since it outputs all kind of voices. The problem is that I can't get any control over which it will actually output. Anyway, I know that some people managed to get a working multi-speaker and multi-speaker Tortoise-TTS model. I still have of course the possibility of specializing a new Tortoise-TTS model for each speaker by further fine tuning on the italian model I have now, but it would be great to have a single model being able to deal with different speakers rather than a new model for each new speaker. As a final consideration: in my opinion Tortoise is still the best open source TTS to this day, it has an exceptional expressivity and it is able to learn perfectly a new language. Unfortunately it seems to be a nightmare to generalize good training configs and it has all kinda of quirks when it comes to get a consistent result from its training.

munggok commented

2023-08-24 20:30:33 +00:00

seems code to generate similarities.pth here

https://github.com/neonbjb/DL-Art-School/blob/master/codes/scripts/audio/preparation/phase_3_generate_similarities.py#L108

as config for yml and pretrained voice clip
https://huggingface.co/jbetker/tortoise-filtering-models/tree/main

maybe you can try it @Brugio96
(i am still on collecting data and training)

seems code to generate similarities.pth here https://github.com/neonbjb/DL-Art-School/blob/master/codes/scripts/audio/preparation/phase_3_generate_similarities.py#L108 as config for yml and pretrained voice clip https://huggingface.co/jbetker/tortoise-filtering-models/tree/main maybe you can try it @Brugio96 (i am still on collecting data and training)

Sign in to join this conversation.