I didn't have to suck off a wizard for DirectML support (courtesy of https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/7600 for leading the way)

2023-02-09 05:05:21 +00:00 · 2023-02-09 05:05:21 +00:00 · 3f8302a680
commit 3f8302a680
parent 50b4e2c458
6 changed files with 62 additions and 45 deletions
--- a/README.md
+++ b/README.md
@ -1,13 +1,9 @@
 # AI Voice Cloning for Retards and Savants

-This [rentry](https://rentry.org/AI-Voice-Cloning/) aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows (with an Nvidia GPU), as well as a stepping stone for anons that genuinely want to play around with [TorToiSe](https://github.com/neonbjb/tortoise-tts).
+This [rentry](https://rentry.org/AI-Voice-Cloning/) aims to serve as both a foolproof guide for setting up AI voice cloning tools for legitimate, local use on Windows, as well as a stepping stone for anons that genuinely want to play around with [TorToiSe](https://github.com/neonbjb/tortoise-tts).

 Similar to my own findings for Stable Diffusion image generation, this rentry may appear a little disheveled as I note my new findings with TorToiSe. Please keep this in mind if the guide seems to shift a bit or sound confusing.

->\>B-but what about the colab notebook/hugging space instance??
-
-I link those a bit later on as alternatives for Windows+AMD users. You're free to skip the installation section and jump after that.
-
 >\>Ugh... why bother when I can just abuse 11.AI?

 I very much encourage (You) to use 11.AI while it's still viable to use. For the layman, it's easier to go through the hoops of coughing up the $5 or abusing the free trial over actually setting up a TorToiSe environment and dealing with its quirks.
@ -39,16 +35,15 @@ My fork boasts the following additions, fixes, and optimizations:
 	- additionally, regenerating them if the script detects they're out of date
 * uses the entire audio sample instead of the first four seconds of each sound file for better reproducing
 * activated unused DDIM sampler
-* ease of setup for the most inexperienced Windows users
 * use of some optimizations like `kv_cache`ing for the autoregression sample pass, and keeping data on GPU 
+* compatability with DirectML
+* easy install scripts
 * and more!

 ## Installing

 Outside of the very small prerequisites, everything needed to get TorToiSe working is included in the repo.

-For Windows users with an AMD GPU, ~~tough luck, as ROCm drivers are not (easily) available for Windows, and requires inane patches with PyTorch.~~ you're almost in luck, as hardware acceleration for any\* device is possible with PyTorch-DirectML. **!**NOTE**!**: DirectML support is currently being worked on, so for now, consider using the [Colab notebook](https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing), or the [Hugging Face space](https://huggingface.co/spaces/mdnestor/tortoise), for `tortoise-tts`. **!**NOTE**!**: these two do not use this repo's fork.
-
 ### Pre-Requirements

 Windows:
@ -71,16 +66,22 @@ After installing Python, open the Start Menu and search for `Command Prompt`. Ty
 Paste `git clone https://git.ecker.tech/mrq/tortoise-tts` to download TorToiSe and additional scripts, then hit Enter. Inexperienced users can just download the repo as a ZIP, and extract.

 Afterwards, run the setup script, depending on your GPU, to automatically set things up.
-* ~~AMD: `setup-directml.bat`~~
+* AMD: `setup-directml.bat`
 * NVIDIA: `setup-cuda.bat`

 If you've done everything right, you shouldn't have any errors.

 ##### Note on DirectML Support

-At first, I thought it was just one simple problem that needed to be fixed, but as I picked at it and did a new install (having CUDA enabled too caused some things to silently "work" despite using DML instead), more problems cropped up, exposing that PyTorch-DirectML isn't quite ready yet.
+PyTorch-DirectML is very, very experimental and is still not production quality. There's some headaches with the need for hairy kludgy patches.

-I doubt even if I sucked off a wizard, there'd still be other problems cropping up.
+These patches rely on transfering the tensor between the GPU and CPU as a hotfix, so performance is definitely harmed.
+
+Both the conditional latent computation and the vocoder pass have to be done on the CPU entirely because of some quirks with DirectML.
+
+On my 6800XT, VRAM usage climbs almost the entire 16GiB, so be wary if you OOM somehow. Low VRAM flags may NOT have any additional impact from the constant copying anyways.
+
+For AMD users, I still might suggest using Linux+ROCm as it's (relatively) headache free, but I had stability problems.

 #### Linux

--- a/start.bat
+++ b/start.bat
@ -1,4 +1,4 @@
 call .\tortoise-venv\Scripts\activate.bat
-python .\app.py
+accelerate launch --num_cpu_threads_per_process=6 app.py
 deactivate
 pause
--- a/tortoise/api.py
+++ b/tortoise/api.py
@ -176,7 +176,10 @@ def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_la
                                      model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
                                     verbose=verbose, progress=progress, desc=desc)

-        return denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
+        mel = denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
+        if get_device_name() == "dml":
+            mel = mel.cpu()
+        return mel


 def classify_audio_clip(clip):
@ -449,6 +452,9 @@ class TextToSpeech:
        :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
                 Sample rate is 24kHz.
        """
+        if get_device_name() == "dml":
+            half_p = False
+
        self.diffusion.enable_fp16 = half_p
        deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)

@ -477,6 +483,8 @@ class TextToSpeech:
        with torch.no_grad():
            samples = []
            num_batches = num_autoregressive_samples // self.autoregressive_batch_size
+            if num_autoregressive_samples < self.autoregressive_batch_size:
+                num_autoregressive_samples = 1
            stop_mel_token = self.autoregressive.stop_mel_token
            calm_token = 83  # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
            
@ -553,16 +561,31 @@ class TextToSpeech:
            if not self.minor_optimizations:
                self.autoregressive = self.autoregressive.to(self.device)

+            if get_device_name() == "dml":
+                text_tokens = text_tokens.cpu()
+                best_results = best_results.cpu()
+                auto_conditioning = auto_conditioning.cpu()
+                self.autoregressive = self.autoregressive.cpu()
+
            best_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
                                               torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results,
                                               torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
                                               return_latent=True, clip_inputs=False)
            
+            if get_device_name() == "dml":
+                self.autoregressive = self.autoregressive.to(self.device)
+                best_results = best_results.to(self.device)
+                best_latents = best_latents.to(self.device)
+            
            if not self.minor_optimizations:
                self.autoregressive = self.autoregressive.cpu()
                self.diffusion = self.diffusion.to(self.device)
                self.vocoder = self.vocoder.to(self.device)
            
+            if get_device_name() == "dml":
+                self.vocoder = self.vocoder.cpu()
+            
+            del text_tokens
            del auto_conditioning

            wav_candidates = []
@ -584,6 +607,7 @@ class TextToSpeech:
                mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, diffusion_conditioning,
                                               temperature=diffusion_temperature, verbose=verbose, progress=progress, desc="Transforming autoregressive outputs into audio..", sampler=diffusion_sampler,
                                               input_sample_rate=self.input_sample_rate, output_sample_rate=self.output_sample_rate)
+
                wav = self.vocoder.inference(mel)
                wav_candidates.append(wav.cpu())
            
--- a/tortoise/models/diffusion_decoder.py
+++ b/tortoise/models/diffusion_decoder.py
@ -8,7 +8,7 @@ import torch.nn.functional as F
 from torch import autocast

 from tortoise.models.arch_util import normalization, AttentionBlock
-
+from tortoise.utils.device import get_device_name

 def is_latent(t):
    return t.dtype == torch.float
@ -141,7 +141,7 @@ class DiffusionTts(nn.Module):
            in_tokens=8193,
            out_channels=200,  # mean and variance
            dropout=0,
-            use_fp16=True,
+            use_fp16=False,
            num_heads=16,
            # Parameters for regularization.
            layer_drop=.1,
@ -302,7 +302,8 @@ class DiffusionTts(nn.Module):
                unused_params.extend(list(lyr.parameters()))
            else:
                # First and last blocks will have autocast disabled for improved precision.
-                with autocast(x.device.type, enabled=self.enable_fp16 and i != 0):
+                # x.device.type
+                with autocast(device_type='cuda', enabled=self.enable_fp16 and i != 0):
                    x = lyr(x, time_emb)

        x = x.float()
--- a/tortoise/models/vocoder.py
+++ b/tortoise/models/vocoder.py
--- a/tortoise/utils/device.py
+++ b/tortoise/utils/device.py
@ -1,37 +1,9 @@
 import torch

 def has_dml():
-    """
-    # huggingface's transformer/GPT2 model will just lead to a long track of problems
-    # I will suck off a wizard if he gets this remedied somehow
-    """
-    """
-    # Note 1:
-    # self.inference_model.generate will lead to this error in torch.LongTensor.new:
-    #   RuntimeError: new(): expected key in DispatchKeySet(CPU, CUDA, HIP, XLA, MPS, IPU, XPU, HPU, Lazy, Meta) but got: PrivateUse1
-    # Patching "./venv/lib/site-packages/transformers/generation_utils.py:1906" with:
-    #   unfinished_sequences = input_ids.new_tensor(input_ids.shape[0], device=input_ids.device).fill_(1)
-    # "fixes" it, but meets another error/crash about an unimplemented functions.........
-    """
-    """
-    # Note 2:
-    # torch.load() will gripe about something CUDA not existing
-    # remedy this with passing map_location="cpu"
-    """
-    """
-    # Note 3:
-    # stft requires device='cpu' or it'll crash about some error about an unimplemented function I do not remember
-    """
-    """
-    # Note 4:
-    # 'Tensor.multinominal' and 'Tensor.repeat_interleave' throws errors about being unimplemented and falls back to CPU and crashes
-    """
-    return False
-    """
    import importlib
    loader = importlib.find_loader('torch_directml')
    return loader is not None
-    """

 def get_device_name():
    name = 'cpu'
@ -69,3 +41,22 @@ def get_device_batch_size():
        elif availableGb > 7:
            return 4
    return 1
+
+if has_dml():
+    _cumsum = torch.cumsum
+    _repeat_interleave = torch.repeat_interleave
+    _multinomial = torch.multinomial
+    
+    _Tensor_new = torch.Tensor.new
+    _Tensor_cumsum = torch.Tensor.cumsum
+    _Tensor_repeat_interleave = torch.Tensor.repeat_interleave
+    _Tensor_multinomial = torch.Tensor.multinomial
+
+    torch.cumsum = lambda input, *args, **kwargs: ( _cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
+    torch.repeat_interleave = lambda input, *args, **kwargs: ( _repeat_interleave(input.to("cpu"), *args, **kwargs).to(input.device) )
+    torch.multinomial = lambda input, *args, **kwargs: ( _multinomial(input.to("cpu"), *args, **kwargs).to(input.device) )
+    
+    torch.Tensor.new = lambda self, *args, **kwargs: ( _Tensor_new(self.to("cpu"), *args, **kwargs).to(self.device) )
+    torch.Tensor.cumsum = lambda self, *args, **kwargs: ( _Tensor_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )
+    torch.Tensor.repeat_interleave = lambda self, *args, **kwargs: ( _Tensor_repeat_interleave(self.to("cpu"), *args, **kwargs).to(self.device) )
+    torch.Tensor.multinomial = lambda self, *args, **kwargs: ( _Tensor_multinomial(self.to("cpu"), *args, **kwargs).to(self.device) )