Tortoise Training: 'num_conditioning_inputs' is useless #237
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#237
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi, I'm training an italian TTS with Tortoise on an italian multi-speaker dataset and, while I was able to teach the model how to speak a high quality italian, I wasn't able to control the speaker voice. So I start investigating and debugging the training code, and found out that the model as it is will always use the training audio sample itself
Hi, I'm training an italian TTS with Tortoise on a very large italian multi-speaker dataset with a custom italian tokenizer, and I was able to successfully teach the model an almost perfect italian pronunciation. I directly fine tuned on the multi-speaker dataset. However I had the problem of controlling the speaker with the reference clips, as the model would output most of the time the most present speaker in the dataset or sometimes it would output random voices (of great quality though). It seemed that no matter which reference clips I passed during inference, I wasn't able to control the speaker. So I tried to balance the number of samples from each speaker, to avoid the bias towards the most present ones, but with the resulting model I still wasn't be able to control the speaker.
So I did some debugging and found out that 'num_conditioning_inputs' parameter during training is irrelevant. In fact the code for loading the reference clips (from which then you compute the speech input conditioning) is such that it will always use the audio training sample itself as speech conditioning inputs. In fact the code used to load the conditioning inputs during training is this:
(you can find it here). The problem lies in the 'load_similar_clips' function: in fact, this function use a model 'similarities.pth' to search for similar clips; the problem is that we don't have this model, so the fallback mechanism is such that, no matter how many 'num_conditioning_inputs' we use, it will always use 1 speech conditioning input and it will always be the audio training sample itself. So I tried modifying the 'load_similar_clips' to search for similar clips in the same folder of the audio training sample, since I have the training set organized in speaker folders.
Altough I have not solved my problem of controlling the output speaker voice at inference, I thought I share this finding. If anyone has some advice on how to solve the problem it would be great!
Thank you and have a great day!
Huh, funny. I believe it was the other day it crossed my mind to try and figure out that exact same thing: how DLAS would handle it's input prompt shuffling for better training. Being in the land of VALL-E has given me a lot of insights. Very nice find.
Bear with me though, since it's been a long time since I have actually touched anything TorToiSe/DLAS, but I should be able to still offer some remarks.
That's pretty much how I felt about my generalized Japanese finetunes. No matter how I went about it, its zero-shot-ability suffered, and the only way to go about it was just to finetune it again for a specific speaker, as the model itself would already have "learnt" the language.
That's pretty much what the VALL-E implementation I forked does, and I imagine it works well enough. From what I can glean from
load_similar_clips
, it uses a custom-tailored-to-a-dataset pickled dict that would already have a list of similar candidates (by filename), so it's pretty much out of the question to try and source one yourself.But...
I guess it doesn't matter all that much in the end then, or at least, in terms of finetuning.
I don't believe I ever actually tried the French finetune, but I don't recall anyone mentioning it having issues with zero-shot voices? Although I'm not sure how much help it would be anyways, since that was finetuned back in like, February I believe as one of the first finetunes.
As for my advice, or at least, what I can remember, if you're using it for zero-shot voices, might be to narrow down your source voice clips (like, have it only be a small clip of a voice given), since I'm very, very, very, very sure, my crack at the conditioning latents generation routine is inconsistent, where it would work very well for some voices, but other voices it'll just not go so well. You can use the 152334H/tortoise-tts-fast fork, load your finetuned model and voice(s) into it, and see how it fares. By default, I believe it uses the old-but-probably-more-correct way to generate its conditioning latents.
Hi @mrq, thank you for the response, I've already tried the tortoise-tts-fast fork, and it behaves the same.
When I talk about not being able to control the output speaker, I don't have the ambition of expecting a zero shot control over unseen voices, but at least I would like to be able to control speaker identity for the speakers that were present in the training set. It almost seems that the model ignores which conditioning inputs I input during inference. And it's a shame because I know that the model is capable of speaking well in several different speakers voices, since it outputs all kind of voices. The problem is that I can't get any control over which it will actually output. Anyway, I know that some people managed to get a working multi-speaker and multi-speaker Tortoise-TTS model. I still have of course the possibility of specializing a new Tortoise-TTS model for each speaker by further fine tuning on the italian model I have now, but it would be great to have a single model being able to deal with different speakers rather than a new model for each new speaker.
As a final consideration: in my opinion Tortoise is still the best open source TTS to this day, it has an exceptional expressivity and it is able to learn perfectly a new language. Unfortunately it seems to be a nightmare to generalize good training configs and it has all kinda of quirks when it comes to get a consistent result from its training.
seems code to generate similarities.pth here
https://github.com/neonbjb/DL-Art-School/blob/master/codes/scripts/audio/preparation/phase_3_generate_similarities.py#L108
as config for yml and pretrained voice clip
https://huggingface.co/jbetker/tortoise-filtering-models/tree/main
maybe you can try it @Brugio96
(i am still on collecting data and training)