forked from mrq/tortoise-tts
I didn't have to suck off a wizard for DirectML support (courtesy of https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/7600 for leading the way)
This commit is contained in:
parent
50b4e2c458
commit
3f8302a680
23
README.md
23
README.md
|
@ -1,13 +1,9 @@
|
||||||
# AI Voice Cloning for Retards and Savants
|
# AI Voice Cloning for Retards and Savants
|
||||||
|
|
||||||
This [rentry](https://rentry.org/AI-Voice-Cloning/) aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows (with an Nvidia GPU), as well as a stepping stone for anons that genuinely want to play around with [TorToiSe](https://github.com/neonbjb/tortoise-tts).
|
This [rentry](https://rentry.org/AI-Voice-Cloning/) aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows, as well as a stepping stone for anons that genuinely want to play around with [TorToiSe](https://github.com/neonbjb/tortoise-tts).
|
||||||
|
|
||||||
Similar to my own findings for Stable Diffusion image generation, this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing.
|
Similar to my own findings for Stable Diffusion image generation, this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing.
|
||||||
|
|
||||||
>\>B-but what about the colab notebook/hugging space instance??
|
|
||||||
|
|
||||||
I link those a bit later on as alternatives for Windows+AMD users. You're free to skip the installation section and jump after that.
|
|
||||||
|
|
||||||
>\>Ugh... why bother when I can just abuse 11.AI?
|
>\>Ugh... why bother when I can just abuse 11.AI?
|
||||||
|
|
||||||
I very much encourage (You) to use 11.AI while it's still viable to use. For the layman, it's easier to go through the hoops of coughing up the $5 or abusing the free trial over actually setting up a TorToiSe environment and dealing with its quirks.
|
I very much encourage (You) to use 11.AI while it's still viable to use. For the layman, it's easier to go through the hoops of coughing up the $5 or abusing the free trial over actually setting up a TorToiSe environment and dealing with its quirks.
|
||||||
|
@ -39,16 +35,15 @@ My fork boasts the following additions, fixes, and optimizations:
|
||||||
- additionally, regenerating them if the script detects they're out of date
|
- additionally, regenerating them if the script detects they're out of date
|
||||||
* uses the entire audio sample instead of the first four seconds of each sound file for better reproducing
|
* uses the entire audio sample instead of the first four seconds of each sound file for better reproducing
|
||||||
* activated unused DDIM sampler
|
* activated unused DDIM sampler
|
||||||
* ease of setup for the most inexperienced Windows users
|
|
||||||
* use of some optimizations like `kv_cache`ing for the autoregression sample pass, and keeping data on GPU
|
* use of some optimizations like `kv_cache`ing for the autoregression sample pass, and keeping data on GPU
|
||||||
|
* compatability with DirectML
|
||||||
|
* easy install scripts
|
||||||
* and more!
|
* and more!
|
||||||
|
|
||||||
## Installing
|
## Installing
|
||||||
|
|
||||||
Outside of the very small prerequisites, everything needed to get TorToiSe working is included in the repo.
|
Outside of the very small prerequisites, everything needed to get TorToiSe working is included in the repo.
|
||||||
|
|
||||||
For Windows users with an AMD GPU, ~~tough luck, as ROCm drivers are not (easily) available for Windows, and requires inane patches with PyTorch.~~ you're almost in luck, as hardware acceleration for any\* device is possible with PyTorch-DirectML. **!**NOTE**!**: DirectML support is currently being worked on, so for now, consider using the [Colab notebook](https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing), or the [Hugging Face space](https://huggingface.co/spaces/mdnestor/tortoise), for `tortoise-tts`. **!**NOTE**!**: these two do not use this repo's fork.
|
|
||||||
|
|
||||||
### Pre-Requirements
|
### Pre-Requirements
|
||||||
|
|
||||||
Windows:
|
Windows:
|
||||||
|
@ -71,16 +66,22 @@ After installing Python, open the Start Menu and search for `Command Prompt`. Ty
|
||||||
Paste `git clone https://git.ecker.tech/mrq/tortoise-tts` to download TorToiSe and additional scripts, then hit Enter. Inexperienced users can just download the repo as a ZIP, and extract.
|
Paste `git clone https://git.ecker.tech/mrq/tortoise-tts` to download TorToiSe and additional scripts, then hit Enter. Inexperienced users can just download the repo as a ZIP, and extract.
|
||||||
|
|
||||||
Afterwards, run the setup script, depending on your GPU, to automatically set things up.
|
Afterwards, run the setup script, depending on your GPU, to automatically set things up.
|
||||||
* ~~AMD: `setup-directml.bat`~~
|
* AMD: `setup-directml.bat`
|
||||||
* NVIDIA: `setup-cuda.bat`
|
* NVIDIA: `setup-cuda.bat`
|
||||||
|
|
||||||
If you've done everything right, you shouldn't have any errors.
|
If you've done everything right, you shouldn't have any errors.
|
||||||
|
|
||||||
##### Note on DirectML Support
|
##### Note on DirectML Support
|
||||||
|
|
||||||
At first, I thought it was just one simple problem that needed to be fixed, but as I picked at it and did a new install (having CUDA enabled too caused some things to silently "work" despite using DML instead), more problems cropped up, exposing that PyTorch-DirectML isn't quite ready yet.
|
PyTorch-DirectML is very, very experimental and is still not production quality. There's some headaches with the need for hairy kludgy patches.
|
||||||
|
|
||||||
I doubt even if I sucked off a wizard, there'd still be other problems cropping up.
|
These patches rely on transfering the tensor between the GPU and CPU as a hotfix, so performance is definitely harmed.
|
||||||
|
|
||||||
|
Both the conditional latent computation and the vocoder pass have to be done on the CPU entirely because of some quirks with DirectML.
|
||||||
|
|
||||||
|
On my 6800XT, VRAM usage climbs almost the entire 16GiB, so be wary if you OOM somehow. Low VRAM flags may NOT have any additional impact from the constant copying anyways.
|
||||||
|
|
||||||
|
For AMD users, I still might suggest using Linux+ROCm as it's (relatively) headache free, but I had stability problems.
|
||||||
|
|
||||||
#### Linux
|
#### Linux
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
call .\tortoise-venv\Scripts\activate.bat
|
call .\tortoise-venv\Scripts\activate.bat
|
||||||
python .\app.py
|
accelerate launch --num_cpu_threads_per_process=6 app.py
|
||||||
deactivate
|
deactivate
|
||||||
pause
|
pause
|
|
@ -176,7 +176,10 @@ def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_la
|
||||||
model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
|
model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
|
||||||
verbose=verbose, progress=progress, desc=desc)
|
verbose=verbose, progress=progress, desc=desc)
|
||||||
|
|
||||||
return denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
|
mel = denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
|
||||||
|
if get_device_name() == "dml":
|
||||||
|
mel = mel.cpu()
|
||||||
|
return mel
|
||||||
|
|
||||||
|
|
||||||
def classify_audio_clip(clip):
|
def classify_audio_clip(clip):
|
||||||
|
@ -449,6 +452,9 @@ class TextToSpeech:
|
||||||
:return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
|
:return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
|
||||||
Sample rate is 24kHz.
|
Sample rate is 24kHz.
|
||||||
"""
|
"""
|
||||||
|
if get_device_name() == "dml":
|
||||||
|
half_p = False
|
||||||
|
|
||||||
self.diffusion.enable_fp16 = half_p
|
self.diffusion.enable_fp16 = half_p
|
||||||
deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)
|
deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)
|
||||||
|
|
||||||
|
@ -477,6 +483,8 @@ class TextToSpeech:
|
||||||
with torch.no_grad():
|
with torch.no_grad():
|
||||||
samples = []
|
samples = []
|
||||||
num_batches = num_autoregressive_samples // self.autoregressive_batch_size
|
num_batches = num_autoregressive_samples // self.autoregressive_batch_size
|
||||||
|
if num_autoregressive_samples < self.autoregressive_batch_size:
|
||||||
|
num_autoregressive_samples = 1
|
||||||
stop_mel_token = self.autoregressive.stop_mel_token
|
stop_mel_token = self.autoregressive.stop_mel_token
|
||||||
calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
|
calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
|
||||||
|
|
||||||
|
@ -553,16 +561,31 @@ class TextToSpeech:
|
||||||
if not self.minor_optimizations:
|
if not self.minor_optimizations:
|
||||||
self.autoregressive = self.autoregressive.to(self.device)
|
self.autoregressive = self.autoregressive.to(self.device)
|
||||||
|
|
||||||
|
if get_device_name() == "dml":
|
||||||
|
text_tokens = text_tokens.cpu()
|
||||||
|
best_results = best_results.cpu()
|
||||||
|
auto_conditioning = auto_conditioning.cpu()
|
||||||
|
self.autoregressive = self.autoregressive.cpu()
|
||||||
|
|
||||||
best_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
|
best_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
|
||||||
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results,
|
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results,
|
||||||
torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
|
torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
|
||||||
return_latent=True, clip_inputs=False)
|
return_latent=True, clip_inputs=False)
|
||||||
|
|
||||||
|
if get_device_name() == "dml":
|
||||||
|
self.autoregressive = self.autoregressive.to(self.device)
|
||||||
|
best_results = best_results.to(self.device)
|
||||||
|
best_latents = best_latents.to(self.device)
|
||||||
|
|
||||||
if not self.minor_optimizations:
|
if not self.minor_optimizations:
|
||||||
self.autoregressive = self.autoregressive.cpu()
|
self.autoregressive = self.autoregressive.cpu()
|
||||||
self.diffusion = self.diffusion.to(self.device)
|
self.diffusion = self.diffusion.to(self.device)
|
||||||
self.vocoder = self.vocoder.to(self.device)
|
self.vocoder = self.vocoder.to(self.device)
|
||||||
|
|
||||||
|
if get_device_name() == "dml":
|
||||||
|
self.vocoder = self.vocoder.cpu()
|
||||||
|
|
||||||
|
del text_tokens
|
||||||
del auto_conditioning
|
del auto_conditioning
|
||||||
|
|
||||||
wav_candidates = []
|
wav_candidates = []
|
||||||
|
@ -584,6 +607,7 @@ class TextToSpeech:
|
||||||
mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, diffusion_conditioning,
|
mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, diffusion_conditioning,
|
||||||
temperature=diffusion_temperature, verbose=verbose, progress=progress, desc="Transforming autoregressive outputs into audio..", sampler=diffusion_sampler,
|
temperature=diffusion_temperature, verbose=verbose, progress=progress, desc="Transforming autoregressive outputs into audio..", sampler=diffusion_sampler,
|
||||||
input_sample_rate=self.input_sample_rate, output_sample_rate=self.output_sample_rate)
|
input_sample_rate=self.input_sample_rate, output_sample_rate=self.output_sample_rate)
|
||||||
|
|
||||||
wav = self.vocoder.inference(mel)
|
wav = self.vocoder.inference(mel)
|
||||||
wav_candidates.append(wav.cpu())
|
wav_candidates.append(wav.cpu())
|
||||||
|
|
||||||
|
|
|
@ -8,7 +8,7 @@ import torch.nn.functional as F
|
||||||
from torch import autocast
|
from torch import autocast
|
||||||
|
|
||||||
from tortoise.models.arch_util import normalization, AttentionBlock
|
from tortoise.models.arch_util import normalization, AttentionBlock
|
||||||
|
from tortoise.utils.device import get_device_name
|
||||||
|
|
||||||
def is_latent(t):
|
def is_latent(t):
|
||||||
return t.dtype == torch.float
|
return t.dtype == torch.float
|
||||||
|
@ -141,7 +141,7 @@ class DiffusionTts(nn.Module):
|
||||||
in_tokens=8193,
|
in_tokens=8193,
|
||||||
out_channels=200, # mean and variance
|
out_channels=200, # mean and variance
|
||||||
dropout=0,
|
dropout=0,
|
||||||
use_fp16=True,
|
use_fp16=False,
|
||||||
num_heads=16,
|
num_heads=16,
|
||||||
# Parameters for regularization.
|
# Parameters for regularization.
|
||||||
layer_drop=.1,
|
layer_drop=.1,
|
||||||
|
@ -302,7 +302,8 @@ class DiffusionTts(nn.Module):
|
||||||
unused_params.extend(list(lyr.parameters()))
|
unused_params.extend(list(lyr.parameters()))
|
||||||
else:
|
else:
|
||||||
# First and last blocks will have autocast disabled for improved precision.
|
# First and last blocks will have autocast disabled for improved precision.
|
||||||
with autocast(x.device.type, enabled=self.enable_fp16 and i != 0):
|
# x.device.type
|
||||||
|
with autocast(device_type='cuda', enabled=self.enable_fp16 and i != 0):
|
||||||
x = lyr(x, time_emb)
|
x = lyr(x, time_emb)
|
||||||
|
|
||||||
x = x.float()
|
x = x.float()
|
||||||
|
|
0
tortoise/models/vocoder.py
Normal file → Executable file
0
tortoise/models/vocoder.py
Normal file → Executable file
|
@ -1,37 +1,9 @@
|
||||||
import torch
|
import torch
|
||||||
|
|
||||||
def has_dml():
|
def has_dml():
|
||||||
"""
|
|
||||||
# huggingface's transformer/GPT2 model will just lead to a long track of problems
|
|
||||||
# I will suck off a wizard if he gets this remedied somehow
|
|
||||||
"""
|
|
||||||
"""
|
|
||||||
# Note 1:
|
|
||||||
# self.inference_model.generate will lead to this error in torch.LongTensor.new:
|
|
||||||
# RuntimeError: new(): expected key in DispatchKeySet(CPU, CUDA, HIP, XLA, MPS, IPU, XPU, HPU, Lazy, Meta) but got: PrivateUse1
|
|
||||||
# Patching "./venv/lib/site-packages/transformers/generation_utils.py:1906" with:
|
|
||||||
# unfinished_sequences = input_ids.new_tensor(input_ids.shape[0], device=input_ids.device).fill_(1)
|
|
||||||
# "fixes" it, but meets another error/crash about an unimplemented functions.........
|
|
||||||
"""
|
|
||||||
"""
|
|
||||||
# Note 2:
|
|
||||||
# torch.load() will gripe about something CUDA not existing
|
|
||||||
# remedy this with passing map_location="cpu"
|
|
||||||
"""
|
|
||||||
"""
|
|
||||||
# Note 3:
|
|
||||||
# stft requires device='cpu' or it'll crash about some error about an unimplemented function I do not remember
|
|
||||||
"""
|
|
||||||
"""
|
|
||||||
# Note 4:
|
|
||||||
# 'Tensor.multinominal' and 'Tensor.repeat_interleave' throws errors about being unimplemented and falls back to CPU and crashes
|
|
||||||
"""
|
|
||||||
return False
|
|
||||||
"""
|
|
||||||
import importlib
|
import importlib
|
||||||
loader = importlib.find_loader('torch_directml')
|
loader = importlib.find_loader('torch_directml')
|
||||||
return loader is not None
|
return loader is not None
|
||||||
"""
|
|
||||||
|
|
||||||
def get_device_name():
|
def get_device_name():
|
||||||
name = 'cpu'
|
name = 'cpu'
|
||||||
|
@ -69,3 +41,22 @@ def get_device_batch_size():
|
||||||
elif availableGb > 7:
|
elif availableGb > 7:
|
||||||
return 4
|
return 4
|
||||||
return 1
|
return 1
|
||||||
|
|
||||||
|
if has_dml():
|
||||||
|
_cumsum = torch.cumsum
|
||||||
|
_repeat_interleave = torch.repeat_interleave
|
||||||
|
_multinomial = torch.multinomial
|
||||||
|
|
||||||
|
_Tensor_new = torch.Tensor.new
|
||||||
|
_Tensor_cumsum = torch.Tensor.cumsum
|
||||||
|
_Tensor_repeat_interleave = torch.Tensor.repeat_interleave
|
||||||
|
_Tensor_multinomial = torch.Tensor.multinomial
|
||||||
|
|
||||||
|
torch.cumsum = lambda input, *args, **kwargs: ( _cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
|
||||||
|
torch.repeat_interleave = lambda input, *args, **kwargs: ( _repeat_interleave(input.to("cpu"), *args, **kwargs).to(input.device) )
|
||||||
|
torch.multinomial = lambda input, *args, **kwargs: ( _multinomial(input.to("cpu"), *args, **kwargs).to(input.device) )
|
||||||
|
|
||||||
|
torch.Tensor.new = lambda self, *args, **kwargs: ( _Tensor_new(self.to("cpu"), *args, **kwargs).to(self.device) )
|
||||||
|
torch.Tensor.cumsum = lambda self, *args, **kwargs: ( _Tensor_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )
|
||||||
|
torch.Tensor.repeat_interleave = lambda self, *args, **kwargs: ( _Tensor_repeat_interleave(self.to("cpu"), *args, **kwargs).to(self.device) )
|
||||||
|
torch.Tensor.multinomial = lambda self, *args, **kwargs: ( _Tensor_multinomial(self.to("cpu"), *args, **kwargs).to(self.device) )
|
Loading…
Reference in New Issue
Block a user