Weird sound after most lines #325

Open
opened 2023-08-10 16:35:10 +00:00 by LunarLoomLagoon · 4 comments

I've trained multiple models that are capable of producing pretty good output.

However, with either model, they both make a weird sound like they're having a stroke at the end of most lines (or at the beginning of a new line, it's hard to distinguish). It's like a moaning/groaning/robot stroke.

Other than that the output is basically fantastic. What is causing this?

I've trained multiple models that are capable of producing pretty good output. However, with either model, they both make a weird sound like they're having a stroke at the end of most lines (or at the beginning of a new line, it's hard to distinguish). It's like a moaning/groaning/robot stroke. Other than that the output is basically fantastic. What is causing this?
Owner

I don't quite recall hearing that at the beginning of lines. I'm very confident it's an issue at the end of lines, and when everything is getting stitched together, it just seems like it's for the beginning of a new line.

From what I remember with this previous issue:

  • you should have a warning about your generated output not having any stop tokens from your input text being too long (effectively too big of a "context" for TorToiSe). The training configuration caps out its audio duration to 11.6s, so any output that would end up being longer is going to be effectively "undefined behavior".
  • if you aren't getting any warning about the input text being too long, play around with the Length Penalty to try and wrangle it in. I honestly don't recall if that ever made a difference, but I suppose if it didn't work it wouldn't be a feature in base TorToiSe.
  • and just to toss it into the air, VoiceFixer will have a weird brief artifact at the end of generations, nothing I would describe as a croaking or a stroke, but I remember VoiceFixer consistently having said artifact.
I don't quite recall hearing that at the beginning of lines. I'm very confident it's an issue at the end of lines, and when everything is getting stitched together, it just seems like it's for the beginning of a new line. From what I remember with [this previous issue](https://git.ecker.tech/mrq/ai-voice-cloning/issues/226#issuecomment-1843): * you should have a warning about your generated output not having any stop tokens from your input text being too long (effectively too big of a "context" for TorToiSe). The training configuration caps out its audio duration to 11.6s, so any output that would end up being longer is going to be effectively "undefined behavior". * if you aren't getting any warning about the input text being too long, play around with the `Length Penalty` to try and wrangle it in. I honestly don't recall if that ever made a difference, but I suppose if it didn't work it wouldn't be a feature in base TorToiSe. * and just to toss it into the air, VoiceFixer will have a weird brief artifact at the end of generations, nothing I would describe as a croaking or a stroke, but I remember VoiceFixer consistently having said artifact.

It seems to have been a bug that's cleared after rebooting. I was not receiving the errors you described.

One other question, as this project for me is related to producing audiobooks from ebooks:

Currently, this sort of output is typical:

[1/18] Generating line: An E-rank hunter volunteering to go inside an A-rank dungeon?!
Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth
Loaded autoregressive model
Loading voice: wavenet with model d1f79232
Loading voice: wavenet
Reading from latent: ./voices\wavenet\cond_latents_d1f79232.pth
Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth
Loaded autoregressive model
Generating line took 17.155798196792603 seconds
[2/18] Generating line: The group erupted in pandemonium.
Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth
Loaded autoregressive model
Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth
Loaded autoregressive model
Generating line took 16.486565589904785 seconds
[3/18] Generating line: “You want to go when it’s absolutely swarming with high-level beasts in there?” “What the heck are you thinking, Sung?” “You’re still young with so much to live for!
Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth
Loaded autoregressive model
Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth
Loaded autoregressive model
Generating line took 19.02802038192749 seconds
[4/18] Generating line: Don’t go risking your life just for a little bit of extra cash!” The excavation team swarmed around Jinwoo.
Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth
Loaded autoregressive model
Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth
Loaded autoregressive model
Generating line took 17.855751991271973 seconds

But I'm certain the first few times I generated some content, it didn't have a bunch of loading steps in between each line. Is there a way to fix that? All that loading accounts for 90% of the time it takes to generate. The sentences themselves take like 2-6 seconds each (might even be faster if it could use both 4090s).

It seems to have been a bug that's cleared after rebooting. I was not receiving the errors you described. One other question, as this project for me is related to producing audiobooks from ebooks: Currently, this sort of output is typical: ``` [1/18] Generating line: An E-rank hunter volunteering to go inside an A-rank dungeon?! Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth Loaded autoregressive model Loading voice: wavenet with model d1f79232 Loading voice: wavenet Reading from latent: ./voices\wavenet\cond_latents_d1f79232.pth Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth Loaded autoregressive model Generating line took 17.155798196792603 seconds [2/18] Generating line: The group erupted in pandemonium. Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth Loaded autoregressive model Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth Loaded autoregressive model Generating line took 16.486565589904785 seconds [3/18] Generating line: “You want to go when it’s absolutely swarming with high-level beasts in there?” “What the heck are you thinking, Sung?” “You’re still young with so much to live for! Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth Loaded autoregressive model Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth Loaded autoregressive model Generating line took 19.02802038192749 seconds [4/18] Generating line: Don’t go risking your life just for a little bit of extra cash!” The excavation team swarmed around Jinwoo. Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth Loaded autoregressive model Loading autoregressive model: C:\Repos\ai-voice-cloning\models\tortoise\autoregressive.pth Loaded autoregressive model Generating line took 17.855751991271973 seconds ``` But I'm certain the first few times I generated some content, it didn't have a bunch of loading steps in between each line. Is there a way to fix that? All that loading accounts for 90% of the time it takes to generate. The sentences themselves take like 2-6 seconds each (might even be faster if it could use both 4090s).
Owner

But I'm certain the first few times I generated some content, it didn't have a bunch of loading steps in between each line. Is there a way to fix that? All that loading accounts for 90% of the time it takes to generate. The sentences themselves take like 2-6 seconds each (might even be faster if it could use both 4090s).

mmm, I did some cursory glances and added a bit more aggressive checks to ensure that the model doesn't get reloaded on the TorToiSe side in commit 9afa7154. A git pull origin master on .\ai-voice-cloning\modules\tortoise-tts\ should definitely update it, since I'm not too sure if the update script would update it, as I might have to bump up the submodule commit hash for the AIVC repo.

I'm not too sure what exactly was causing the problem, but I'll just write it off as "I was being very stupid with assuming a path string would be the same at all times" and instead just opted to rely on os.path.samefile to determine if two path strings point to the same file. It seems to work now, at least on whatever's left of my testing environment on Windows for TorToiSe.

> But I'm certain the first few times I generated some content, it didn't have a bunch of loading steps in between each line. Is there a way to fix that? All that loading accounts for 90% of the time it takes to generate. The sentences themselves take like 2-6 seconds each (might even be faster if it could use both 4090s). mmm, I did some cursory glances and added a bit more aggressive checks to ensure that the model doesn't get reloaded on the TorToiSe side in commit [`9afa7154`](https://git.ecker.tech/mrq/tortoise-tts/commit/9afa71542bfbf9810bcd533489b5ca0c5b30fdee). A `git pull origin master` on `.\ai-voice-cloning\modules\tortoise-tts\` should definitely update it, since I'm not too sure if the update script would update it, as I might have to bump up the submodule commit hash for the AIVC repo. I'm not too sure what exactly was causing the problem, but I'll just write it off as "I was being very stupid with assuming a path string would be the same at all times" and instead just opted to rely on `os.path.samefile` to determine if two path strings point to the same file. It seems to work now, at least on whatever's left of my testing environment on Windows for TorToiSe.

What seemed to work for me was reset all parameters of generation to default, and progressively retweak them to reach my previous configuration. We’ll see whether it’s a long term solution!

EDIT : a month later, this technique seems to always work at one point. Also, I've noticed that if you put an empty line between two lines instead of beginning your new line right under the previous one, you'll have such artefacts.
For instance if you do :
"The first line

The second line"

Instead of

"The first line
The second line"

You'll likely have artefacts if you use the default /n delimiter

EDIT2 : This also means that if you have a space or a line break after one of your lines in your prompt, it will generate an artefact. For instance :

"This is a first line"

You must make sure that there's no line break after that line, or space.

What seemed to work for me was reset all parameters of generation to default, and progressively retweak them to reach my previous configuration. We’ll see whether it’s a long term solution! EDIT : a month later, this technique seems to always work at one point. Also, I've noticed that if you put an empty line between two lines instead of beginning your new line right under the previous one, you'll have such artefacts. For instance if you do : "The first line The second line" Instead of "The first line The second line" You'll likely have artefacts if you use the default /n delimiter EDIT2 : This also means that if you have a space or a line break after one of your lines in your prompt, it will generate an artefact. For instance : "This is a first line" You must make sure that there's no line break after that line, or space.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#325
No description provided.