I didn't have to suck off a wizard for DirectML support (courtesy of https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/7600 for leading the way)

2023-02-09 05:05:21 +00:00 · 2023-02-09 05:05:21 +00:00 · 3f8302a680
commit 3f8302a680
parent 50b4e2c458
6 changed files with 62 additions and 45 deletions
--- a/README.md
+++ b/README.md
@ -1,13 +1,9 @@
 # AI Voice Cloning for Retards and Savants
-This [rentry](https://rentry.org/AI-Voice-Cloning/) aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows (with an Nvidia GPU), as well as a stepping stone for anons that genuinely want to play around with [TorToiSe](https://github.com/neonbjb/tortoise-tts).
+This [rentry](https://rentry.org/AI-Voice-Cloning/) aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows, as well as a stepping stone for anons that genuinely want to play around with [TorToiSe](https://github.com/neonbjb/tortoise-tts).
 Similar to my own findings for Stable Diffusion image generation, this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing.
 >\>B-but what about the colab notebook/hugging space instance??
 I link those a bit later on as alternatives for Windows+AMD users. You're free to skip the installation section and jump after that.
 >\>Ugh... why bother when I can just abuse 11.AI?
 I very much encourage (You) to use 11.AI while it's still viable to use. For the layman, it's easier to go through the hoops of coughing up the $5 or abusing the free trial over actually setting up a TorToiSe environment and dealing with its quirks.
@ -39,16 +35,15 @@ My fork boasts the following additions, fixes, and optimizations:
 	- additionally, regenerating them if the script detects they're out of date
 * uses the entire audio sample instead of the first four seconds of each sound file for better reproducing
 * activated unused DDIM sampler
 * ease of setup for the most inexperienced Windows users
 * use of some optimizations like `kv_cache`ing for the autoregression sample pass, and keeping data on GPU 
 * compatability with DirectML
 * easy install scripts
 * and more!
 ## Installing
 Outside of the very small prerequisites, everything needed to get TorToiSe working is included in the repo.
 For Windows users with an AMD GPU, ~~tough luck, as ROCm drivers are not (easily) available for Windows, and requires inane patches with PyTorch.~~ you're almost in luck, as hardware acceleration for any\* device is possible with PyTorch-DirectML. **!**NOTE**!**: DirectML support is currently being worked on, so for now, consider using the [Colab notebook](https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing), or the [Hugging Face space](https://huggingface.co/spaces/mdnestor/tortoise), for `tortoise-tts`. **!**NOTE**!**: these two do not use this repo's fork.
 ### Pre-Requirements
 Windows:
@ -71,16 +66,22 @@ After installing Python, open the Start Menu and search for `Command Prompt`. Ty
 Paste `git clone https://git.ecker.tech/mrq/tortoise-tts` to download TorToiSe and additional scripts, then hit Enter. Inexperienced users can just download the repo as a ZIP, and extract.
 Afterwards, run the setup script, depending on your GPU, to automatically set things up.
-* ~~AMD: `setup-directml.bat`~~
+* AMD: `setup-directml.bat`
 * NVIDIA: `setup-cuda.bat`
 If you've done everything right, you shouldn't have any errors.
 ##### Note on DirectML Support
-At first, I thought it was just one simple problem that needed to be fixed, but as I picked at it and did a new install (having CUDA enabled too caused some things to silently "work" despite using DML instead), more problems cropped up, exposing that PyTorch-DirectML isn't quite ready yet.
+PyTorch-DirectML is very, very experimental and is still not production quality. There's some headaches with the need for hairy kludgy patches.
-I doubt even if I sucked off a wizard, there'd still be other problems cropping up.
+These patches rely on transfering the tensor between the GPU and CPU as a hotfix, so performance is definitely harmed.
 Both the conditional latent computation and the vocoder pass have to be done on the CPU entirely because of some quirks with DirectML.
 On my 6800XT, VRAM usage climbs almost the entire 16GiB, so be wary if you OOM somehow. Low VRAM flags may NOT have any additional impact from the constant copying anyways.
 For AMD users, I still might suggest using Linux+ROCm as it's (relatively) headache free, but I had stability problems.
 #### Linux
--- a/start.bat
+++ b/start.bat
@ -1,4 +1,4 @@
 call .\tortoise-venv\Scripts\activate.bat
-python .\app.py
+accelerate launch --num_cpu_threads_per_process=6 app.py
 deactivate
 pause
--- a/tortoise/api.py
+++ b/tortoise/api.py
@ -176,7 +176,10 @@ def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_la
                                      model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
                                     verbose=verbose, progress=progress, desc=desc)
-        return denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
+        mel = denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
        if get_device_name() == "dml":
            mel = mel.cpu()
        return mel
 def classify_audio_clip(clip):
@ -449,6 +452,9 @@ class TextToSpeech:
        :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
                 Sample rate is 24kHz.
        """
        if get_device_name() == "dml":
            half_p = False
        self.diffusion.enable_fp16 = half_p
        deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)
@ -477,6 +483,8 @@ class TextToSpeech:
        with torch.no_grad():
            samples = []
            num_batches = num_autoregressive_samples // self.autoregressive_batch_size
            if num_autoregressive_samples < self.autoregressive_batch_size:
                num_autoregressive_samples = 1
            stop_mel_token = self.autoregressive.stop_mel_token
            calm_token = 83  # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
@ -553,16 +561,31 @@ class TextToSpeech:
            if not self.minor_optimizations:
                self.autoregressive = self.autoregressive.to(self.device)
            if get_device_name() == "dml":
                text_tokens = text_tokens.cpu()
                best_results = best_results.cpu()
                auto_conditioning = auto_conditioning.cpu()
                self.autoregressive = self.autoregressive.cpu()
            best_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
                                               torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results,
                                               torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
                                               return_latent=True, clip_inputs=False)
            if get_device_name() == "dml":
                self.autoregressive = self.autoregressive.to(self.device)
                best_results = best_results.to(self.device)
                best_latents = best_latents.to(self.device)
            if not self.minor_optimizations:
                self.autoregressive = self.autoregressive.cpu()
                self.diffusion = self.diffusion.to(self.device)
                self.vocoder = self.vocoder.to(self.device)
            if get_device_name() == "dml":
                self.vocoder = self.vocoder.cpu()
            del text_tokens
            del auto_conditioning
            wav_candidates = []
@ -584,6 +607,7 @@ class TextToSpeech:
                mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, diffusion_conditioning,
                                               temperature=diffusion_temperature, verbose=verbose, progress=progress, desc="Transforming autoregressive outputs into audio..", sampler=diffusion_sampler,
                                               input_sample_rate=self.input_sample_rate, output_sample_rate=self.output_sample_rate)
                wav = self.vocoder.inference(mel)
                wav_candidates.append(wav.cpu())
--- a/tortoise/models/diffusion_decoder.py
+++ b/tortoise/models/diffusion_decoder.py
@ -8,7 +8,7 @@ import torch.nn.functional as F
 from torch import autocast
 from tortoise.models.arch_util import normalization, AttentionBlock
-
+from tortoise.utils.device import get_device_name
 def is_latent(t):
    return t.dtype == torch.float
@ -141,7 +141,7 @@ class DiffusionTts(nn.Module):
            in_tokens=8193,
            out_channels=200,  # mean and variance
            dropout=0,
-            use_fp16=True,
+            use_fp16=False,
            num_heads=16,
            # Parameters for regularization.
            layer_drop=.1,
@ -302,7 +302,8 @@ class DiffusionTts(nn.Module):
                unused_params.extend(list(lyr.parameters()))
            else:
                # First and last blocks will have autocast disabled for improved precision.
-                with autocast(x.device.type, enabled=self.enable_fp16 and i != 0):
+                # x.device.type
                with autocast(device_type='cuda', enabled=self.enable_fp16 and i != 0):
                    x = lyr(x, time_emb)
        x = x.float()
--- a/tortoise/models/vocoder.py
+++ b/tortoise/models/vocoder.py
--- a/tortoise/utils/device.py
+++ b/tortoise/utils/device.py
@ -1,37 +1,9 @@
 import torch
 def has_dml():
    """
    # huggingface's transformer/GPT2 model will just lead to a long track of problems
    # I will suck off a wizard if he gets this remedied somehow
    """
    """
    # Note 1:
    # self.inference_model.generate will lead to this error in torch.LongTensor.new:
    #   RuntimeError: new(): expected key in DispatchKeySet(CPU, CUDA, HIP, XLA, MPS, IPU, XPU, HPU, Lazy, Meta) but got: PrivateUse1
    # Patching "./venv/lib/site-packages/transformers/generation_utils.py:1906" with:
    #   unfinished_sequences = input_ids.new_tensor(input_ids.shape[0], device=input_ids.device).fill_(1)
    # "fixes" it, but meets another error/crash about an unimplemented functions.........
    """
    """
    # Note 2:
    # torch.load() will gripe about something CUDA not existing
    # remedy this with passing map_location="cpu"
    """
    """
    # Note 3:
    # stft requires device='cpu' or it'll crash about some error about an unimplemented function I do not remember
    """
    """
    # Note 4:
    # 'Tensor.multinominal' and 'Tensor.repeat_interleave' throws errors about being unimplemented and falls back to CPU and crashes
    """
    return False
    """
    import importlib
    loader = importlib.find_loader('torch_directml')
    return loader is not None
    """
 def get_device_name():
    name = 'cpu'
@ -69,3 +41,22 @@ def get_device_batch_size():
        elif availableGb > 7:
            return 4
    return 1
 if has_dml():
    _cumsum = torch.cumsum
    _repeat_interleave = torch.repeat_interleave
    _multinomial = torch.multinomial
    _Tensor_new = torch.Tensor.new
    _Tensor_cumsum = torch.Tensor.cumsum
    _Tensor_repeat_interleave = torch.Tensor.repeat_interleave
    _Tensor_multinomial = torch.Tensor.multinomial
    torch.cumsum = lambda input, *args, **kwargs: ( _cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
    torch.repeat_interleave = lambda input, *args, **kwargs: ( _repeat_interleave(input.to("cpu"), *args, **kwargs).to(input.device) )
    torch.multinomial = lambda input, *args, **kwargs: ( _multinomial(input.to("cpu"), *args, **kwargs).to(input.device) )
    torch.Tensor.new = lambda self, *args, **kwargs: ( _Tensor_new(self.to("cpu"), *args, **kwargs).to(self.device) )
    torch.Tensor.cumsum = lambda self, *args, **kwargs: ( _Tensor_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )
    torch.Tensor.repeat_interleave = lambda self, *args, **kwargs: ( _Tensor_repeat_interleave(self.to("cpu"), *args, **kwargs).to(self.device) )
    torch.Tensor.multinomial = lambda self, *args, **kwargs: ( _Tensor_multinomial(self.to("cpu"), *args, **kwargs).to(self.device) )