42 changed files with 1951 additions and 1609 deletions
--- a/README.md
+++ b/README.md
@ -1,7 +1,5 @@
 # (QoL improvements for) TorToiSe

-This repo is for my modifications to [neonbjb/tortoise-tts](https://github.com/neonbjb/tortoise-tts). If you need the original README, refer to the original repo.
+This repo is for my modifications to [neonbjb/tortoise-tts](https://github.com/neonbjb/tortoise-tts).

-\> w-where'd everything go?
-
-Please migrate to [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning), as that repo is the more cohesive package for voice cloning.
+For the original repo, please go to [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning).
--- a/README_OLD.md
+++ b/README_OLD.md
@ -0,0 +1,283 @@
+# TorToiSe
+
+Tortoise is a text-to-speech program built with the following priorities:
+
+1. Strong multi-voice capabilities.
+2. Highly realistic prosody and intonation.
+
+This repo contains all the code needed to run Tortoise TTS in inference mode.
+
+A (*very*) rough draft of the Tortoise paper is now available in doc format. I would definitely appreciate any comments, suggestions or reviews:
+https://docs.google.com/document/d/13O_eyY65i6AkNrN_LdPhpUjGhyTNKYHvDrIvHnHe1GA
+
+### Version history
+
+#### v2.4; 2022/5/17
+- Removed CVVP model. Found that it does not, in fact, make an appreciable difference in the output.
+- Add better debugging support; existing tools now spit out debug files which can be used to reproduce bad runs.
+
+#### v2.3; 2022/5/12
+- New CLVP-large model for further improved decoding guidance.
+- Improvements to read.py and do_tts.py (new options)
+
+#### v2.2; 2022/5/5
+- Added several new voices from the training set.
+- Automated redaction. Wrap the text you want to use to prompt the model but not be spoken in brackets.
+- Bug fixes
+
+#### v2.1; 2022/5/2
+- Added ability to produce totally random voices.
+- Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent.
+- Added ability to use your own pretrained models.
+- Refactored directory structures.
+- Performance improvements & bug fixes.
+
+## What's in a name?
+
+I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model
+is insanely slow. It leverages both an autoregressive decoder **and** a diffusion decoder; both known for their low
+sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.
+
+## Demos
+
+See [this page](http://nonint.com/static/tortoise_v2_examples.html) for a large list of example outputs.
+
+Cool application of Tortoise+GPT-3 (not by me): https://twitter.com/lexman_ai
+
+## Usage guide
+
+### Colab
+
+Colab is the easiest way to try this out. I've put together a notebook you can use here:
+https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing
+
+### Local Installation
+
+If you want to use this on your own computer, you must have an NVIDIA GPU.
+
+First, install pytorch using these instructions: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/).
+On Windows, I **highly** recommend using the Conda installation path. I have been told that if you do not do this, you
+will spend a lot of time chasing dependency problems.
+
+Next, install TorToiSe and it's dependencies:
+
+```shell
+git clone https://github.com/neonbjb/tortoise-tts.git
+cd tortoise-tts
+python -m pip install -r ./requirements.txt
+python setup.py install
+```
+
+If you are on windows, you will also need to install pysoundfile: `conda install -c conda-forge pysoundfile`
+
+### do_tts.py
+
+This script allows you to speak a single phrase with one or more voices.
+```shell
+python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
+```
+
+### read.py
+
+This script provides tools for reading large amounts of text.
+
+```shell
+python tortoise/read.py --textfile <your text to be read> --voice random
+```
+
+This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series
+of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and
+output that as well.
+
+Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running `read.py` with the --regenerate
+argument.
+
+### API
+
+Tortoise can be used programmatically, like so:
+
+```python
+reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
+tts = api.TextToSpeech()
+pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
+```
+
+## Voice customization guide
+
+Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
+
+These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
+
+### Random voice
+
+I've included a feature which randomly generates a voice. These voices don't actually exist and will be random every time you run
+it. The results are quite fascinating and I recommend you play around with it!
+
+You can use the random voice by passing in 'random' as the voice name. Tortoise will take care of the rest.
+
+For the those in the ML space: this is created by projecting a random vector onto the voice conditioning latent space.
+
+### Provided voices
+
+This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform
+far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see
+what Tortoise can do for zero-shot mimicking, take a look at the others.
+
+### Adding a new voice
+
+To add new voices to Tortoise, you will need to do the following:
+
+1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
+2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
+3. Save the clips as a WAV file with floating point format and a 22,050 sample rate.
+4. Create a subdirectory in voices/
+5. Put your clips in that subdirectory.
+6. Run tortoise utilities with --voice=<your_subdirectory_name>.
+
+### Picking good reference clips
+
+As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking
+good clips:
+
+1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
+2. Avoid speeches. These generally have distortion caused by the amplification system.
+3. Avoid clips from phone calls.
+4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
+5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
+6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.
+
+## Advanced Usage
+
+### Generation settings
+
+Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs
+that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using
+various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've
+set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with
+these settings (and it's very likely that I missed something!)
+
+These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
+```api.tts``` for a full list.
+
+### Prompt engineering
+
+Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion
+by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to
+take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the
+prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality).
+
+### Playing with the voice latent
+
+Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
+then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
+are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
+
+This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
+what it thinks the "average" of those two voices sounds like.
+
+#### Generating conditioning latents from voices
+
+Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
+will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
+
+Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
+
+#### Using raw conditioning latents to generate speech
+
+After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
+".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
+
+### Send me feedback!
+
+Probabilistic models like Tortoise are best thought of as an "augmented search" - in this case, through the space of possible
+utterances of a specific string of text. The impact of community involvement in perusing these spaces (such as is being done with
+GPT-3 or CLIP) has really surprised me. If you find something neat that you can do with Tortoise that isn't documented here,
+please report it to me! I would be glad to publish it to this page.
+
+## Tortoise-detect
+
+Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip
+came from Tortoise.
+
+This classifier can be run on any computer, usage is as follows:
+
+```commandline
+python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
+```
+
+This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
+as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false
+positives.
+
+## Model architecture
+
+Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
+models that work together. I've assembled a write-up of the system architecture here:
+[https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)
+
+## Training
+
+These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of
+~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own
+[DLAS](https://github.com/neonbjb/DL-Art-School) trainer.
+
+I currently do not have plans to release the training configurations or methodology. See the next section..
+
+## Ethical Considerations
+
+Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began
+wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system
+could be misused are many. It doesn't take much creativity to think up how.
+
+After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:
+
+1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
+2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
+3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
+4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See `tortoise-detect` above.
+5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.
+
+### Diversity
+
+The diversity expressed by ML models is strongly tied to the datasets they were trained on.
+
+Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to
+balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities
+or of people who speak with strong accents.
+
+## Looking forward
+
+Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when
+training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training
+of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with
+exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.
+
+I want to mention here
+that I think Tortoise could do be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks
+or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason
+to believe that the same is not true of TTS.
+
+The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer.
+Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve.
+
+If you are an ethical organization with computational resources to spare interested in seeing what this model could do
+if properly scaled out, please reach out to me! I would love to collaborate on this.
+
+## Acknowledgements
+
+This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to
+credit a few of the amazing folks in the community that have helped make this happen:
+
+- Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
+- [Ramesh et al](https://arxiv.org/pdf/2102.12092.pdf) who authored the DALLE paper, which is the inspiration behind Tortoise.
+- [Nichol and Dhariwal](https://arxiv.org/pdf/2102.09672.pdf) who authored the (revision of) the code that drives the diffusion model.
+- [Jang et al](https://arxiv.org/pdf/2106.07889.pdf) who developed and open-sourced univnet, the vocoder this repo uses.
+- [Kim and Jung](https://github.com/mindslab-ai/univnet) who implemented univnet pytorch model.
+- [lucidrains](https://github.com/lucidrains) who writes awesome open source pytorch models, many of which are used here.
+- [Patrick von Platen](https://huggingface.co/patrickvonplaten) whose guides on setting up wav2vec were invaluable to building my dataset.
+
+## Notice
+
+Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.
+
+If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.
--- a/list_devices.py
+++ b/list_devices.py
@ -0,0 +1,5 @@
+import torch
+
+devices = [f"cuda:{i} => {torch.cuda.get_device_name(i)}" for i in range(torch.cuda.device_count())]
+
+print(devices)
--- a/main.py
+++ b/main.py
@ -0,0 +1,34 @@
+import os
+import webui as mrq
+
+if 'TORTOISE_MODELS_DIR' not in os.environ:
+    os.environ['TORTOISE_MODELS_DIR'] = os.path.realpath(os.path.join(os.getcwd(), './models/tortoise/'))
+
+if 'TRANSFORMERS_CACHE' not in os.environ:
+    os.environ['TRANSFORMERS_CACHE'] = os.path.realpath(os.path.join(os.getcwd(), './models/transformers/'))
+
+if __name__ == "__main__":
+    mrq.args = mrq.setup_args()
+
+    if mrq.args.listen_path is not None and mrq.args.listen_path != "/":
+        import uvicorn
+        uvicorn.run("main:app", host=mrq.args.listen_host, port=mrq.args.listen_port if not None else 8000)
+    else:
+        mrq.webui = mrq.setup_gradio()
+        mrq.webui.launch(share=mrq.args.share, prevent_thread_lock=True, server_name=mrq.args.listen_host, server_port=mrq.args.listen_port)
+        mrq.tts = mrq.setup_tortoise()
+
+        mrq.webui.block_thread()
+elif __name__ == "main":
+    from fastapi import FastAPI
+    import gradio as gr
+
+    import sys
+    sys.argv = [sys.argv[0]]
+
+    app = FastAPI()
+    mrq.args = mrq.setup_args()
+    mrq.webui = mrq.setup_gradio()
+    app = gr.mount_gradio_app(app, mrq.webui, path=mrq.args.listen_path)
+
+    mrq.tts = mrq.setup_tortoise()
--- a/requirements.txt
+++ b/requirements.txt
@ -7,9 +7,9 @@ progressbar
 einops
 unidecode
 scipy
-librosa==0.8.1
+librosa
 torchaudio
 threadpoolctl
 appdirs
-numpy<=1.23.5
+numpy
 numba
--- a/setup-cuda.bat
+++ b/setup-cuda.bat
@ -0,0 +1,8 @@
+python -m venv tortoise-venv
+call .\tortoise-venv\Scripts\activate.bat
+python -m pip install --upgrade pip
+python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
+python -m pip install -r ./requirements.txt
+python -m pip install -r ./requirements_legacy.txt
+deactivate
+pause
--- a/setup-cuda.sh
+++ b/setup-cuda.sh
@ -0,0 +1,8 @@
+python -m venv tortoise-venv
+source ./tortoise-venv/bin/activate
+python -m pip install --upgrade pip
+# CUDA
+pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
+python -m pip install -r ./requirements.txt
+python -m pip install -r ./requirements_legacy.txt
+deactivate
--- a/setup-directml.bat
+++ b/setup-directml.bat
@ -0,0 +1,8 @@
+python -m venv tortoise-venv
+call .\tortoise-venv\Scripts\activate.bat
+python -m pip install --upgrade pip
+python -m pip install torch torchvision torchaudio torch-directml
+python -m pip install -r ./requirements.txt
+python -m pip install -r ./requirements_legacy.txt
+deactivate
+pause
--- a/setup-rocm.sh
+++ b/setup-rocm.sh
@ -0,0 +1,8 @@
+python -m venv tortoise-venv
+source ./tortoise-venv/bin/activate
+python -m pip install --upgrade pip
+# ROCM
+pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1 # 5.2 does not work for me desu
+python -m pip install -r ./requirements.txt
+python -m pip install -r ./requirements_legacy.txt
+deactivate
--- a/setup.py
+++ b/setup.py
@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf-8") as fh:
 setuptools.setup(
    name="TorToiSe",
    packages=setuptools.find_packages(),
-    version="2.4.5",
+    version="2.4.3",
    author="James Betker",
    author_email="james@adamant.ai",
    description="A high quality multi-voice text-to-speech library",
@ -29,12 +29,6 @@ setuptools.setup(
        'librosa',
        'transformers',
        'tokenizers',
-        'transformers==4.19',
-        'torchaudio',
-        'threadpoolctl',
-        'appdirs',
-        'numpy',
-        'numba',
    ],
    classifiers=[
        "Programming Language :: Python :: 3",
--- a/start.bat
+++ b/start.bat
@ -0,0 +1,4 @@
+call .\tortoise-venv\Scripts\activate.bat
+python main.py
+deactivate
+pause
--- a/start.sh
+++ b/start.sh
@ -0,0 +1,3 @@
+source ./tortoise-venv/bin/activate
+python3 ./main.py
+deactivate
--- a/tortoise/api.py
+++ b/tortoise/api.py
@ -5,7 +5,6 @@ import gc

 from time import time
 from urllib import request
-from urllib.request import ProxyHandler, build_opener, install_opener

 import torch
 import torch.nn.functional as F
@ -22,14 +21,12 @@ from tortoise.models.clvp import CLVP
 from tortoise.models.cvvp import CVVP
 from tortoise.models.random_latent_generator import RandomLatentConverter
 from tortoise.models.vocoder import UnivNetGenerator
-from tortoise.models.bigvgan import BigVGAN
-
 from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel
 from tortoise.utils.diffusion import SpacedDiffusion, space_timesteps, get_named_beta_schedule
 from tortoise.utils.tokenizer import VoiceBpeTokenizer
 from tortoise.utils.wav2vec_alignment import Wav2VecAlignment

-from tortoise.utils.device import get_device, get_device_name, get_device_batch_size, print_stats, do_gc
+from tortoise.utils.device import get_device, get_device_name, get_device_batch_size

 pbar = None
 STOP_SIGNAL = False
@ -43,46 +40,21 @@ MODELS = {
    'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth',
    'rlg_auto.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth',
    'rlg_diffuser.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth',
-    
-    'bigvgan_base_24khz_100band.pth': 'https://huggingface.co/ecker/tortoise-tts-models/resolve/main/models/bigvgan_base_24khz_100band.pth',
-    'bigvgan_24khz_100band.pth': 'https://huggingface.co/ecker/tortoise-tts-models/resolve/main/models/bigvgan_24khz_100band.pth',
-
-    'bigvgan_base_24khz_100band.json': 'https://huggingface.co/ecker/tortoise-tts-models/resolve/main/models/bigvgan_base_24khz_100band.json',
-    'bigvgan_24khz_100band.json': 'https://huggingface.co/ecker/tortoise-tts-models/resolve/main/models/bigvgan_24khz_100band.json',
 }

-def hash_file(path, algo="md5", buffer_size=0):
-    import hashlib
-
-    hash = None
-    if algo == "md5":
-        hash = hashlib.md5()
-    elif algo == "sha1":
-        hash = hashlib.sha1()
-    else:
-        raise Exception(f'Unknown hash algorithm specified: {algo}')
-
-    if not os.path.exists(path):
-        raise Exception(f'Path not found: {path}')
-
-    with open(path, 'rb') as f:
-        if buffer_size > 0:
-            while True:
-                data = f.read(buffer_size)
-                if not data:
-                    break
-                hash.update(data)
-        else:
-            hash.update(f.read())
-
-    return "{0}".format(hash.hexdigest())
-
-def check_for_kill_signal():
+def tqdm_override(arr, verbose=False, progress=None, desc=None):
    global STOP_SIGNAL
    if STOP_SIGNAL:
        STOP_SIGNAL = False
        raise Exception("Kill signal detected")

+    if verbose and desc is not None:
+        print(desc)
+
+    if progress is None:
+        return tqdm(arr, disable=not verbose)
+    return progress.tqdm(arr, desc=f'{progress.msg_prefix} {desc}' if hasattr(progress, 'msg_prefix') else desc, track_tqdm=True)
+
 def download_models(specific_models=None):
    """
    Call to download all the models that Tortoise uses.
@ -109,11 +81,6 @@ def download_models(specific_models=None):
        if os.path.exists(model_path):
            continue
        print(f'Downloading {model_name} from {url}...')
-
-        proxy = ProxyHandler({})
-        opener = build_opener(proxy)
-        opener.addheaders = [('User-Agent','mrq/AI-Voice-Cloning')]
-        install_opener(opener)
        request.urlretrieve(url, model_path, show_progress)
        print('Done.')

@ -150,7 +117,7 @@ def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusi
                           model_var_type='learned_range', loss_type='mse', betas=get_named_beta_schedule('linear', trained_diffusion_steps),
                           conditioning_free=cond_free, conditioning_free_k=cond_free_k)

-@torch.inference_mode()
+
 def format_conditioning(clip, cond_length=132300, device='cuda', sampling_rate=22050):
    """
    Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models.
@ -162,8 +129,8 @@ def format_conditioning(clip, cond_length=132300, device='cuda', sampling_rate=2
        rand_start = random.randint(0, gap)
        clip = clip[:, rand_start:rand_start + cond_length]
    mel_clip = TorchMelSpectrogram(sampling_rate=sampling_rate)(clip.unsqueeze(0)).squeeze(0)
-    mel_clip = mel_clip.unsqueeze(0)
-    return migrate_to_device(mel_clip, device)
+    return mel_clip.unsqueeze(0).to(device)
+

 def fix_autoregressive_output(codes, stop_token, complain=True):
    """
@ -194,8 +161,8 @@ def fix_autoregressive_output(codes, stop_token, complain=True):

    return codes

-@torch.inference_mode()
-def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True, desc=None, sampler="P", input_sample_rate=22050, output_sample_rate=24000):
+
+def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True, progress=None, desc=None, sampler="P", input_sample_rate=22050, output_sample_rate=24000):
    """
    Uses the specified diffusion model to convert discrete codes into a spectrogram.
    """
@ -208,7 +175,8 @@ def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_la
        
        diffuser.sampler = sampler.lower()
        mel = diffuser.sample_loop(diffusion_model, output_shape, noise=noise,
-                                      model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings}, desc=desc)
+                                      model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
+                                     verbose=verbose, progress=progress, desc=desc)

        mel = denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
        if get_device_name() == "dml":
@ -230,37 +198,12 @@ def classify_audio_clip(clip):
    results = F.softmax(classifier(clip), dim=-1)
    return results[0][0]

-def migrate_to_device( t, device ):
-    if t is None:
-        return t
-
-    if not hasattr(t, 'device'):
-        t.device = device
-        t.manually_track_device = True
-    elif t.device == device:
-        return t
-
-    if hasattr(t, 'manually_track_device') and t.manually_track_device:
-        t.device = device
-
-    t = t.to(device)
-    
-    do_gc()
-
-    return t
-
 class TextToSpeech:
    """
    Main entry point into Tortoise.
    """

-    def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, device=None,
-        minor_optimizations=True,
-        unsqueeze_sample_batches=False,
-        input_sample_rate=22050, output_sample_rate=24000,
-        autoregressive_model_path=None, diffusion_model_path=None, vocoder_model=None, tokenizer_json=None,
-#    ):
-        use_deepspeed=False):  # Add use_deepspeed parameter
+    def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, device=None, minor_optimizations=True, input_sample_rate=22050, output_sample_rate=24000):
        """
        Constructor
        :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing
@ -272,17 +215,13 @@ class TextToSpeech:
                                 Default is true.
        :param device: Device to use when running the model. If omitted, the device will be automatically chosen.
        """ 
-        self.loading = True
        if device is None:
            device = get_device(verbose=True)

-        self.version = [2,4,4] # to-do, autograb this from setup.py, or have setup.py autograb this
        self.input_sample_rate = input_sample_rate
        self.output_sample_rate = output_sample_rate
        self.minor_optimizations = minor_optimizations
-        self.unsqueeze_sample_batches = unsqueeze_sample_batches
-        self.use_deepspeed = use_deepspeed  # Store use_deepspeed as an instance variable
-        print(f'use_deepspeed api_debug {use_deepspeed}')
+
        # for clarity, it's simpler to split these up and just predicate them on requesting VRAM-consuming optimizations
        self.preloaded_tensors = minor_optimizations
        self.use_kv_cache = minor_optimizations
@ -297,23 +236,24 @@ class TextToSpeech:
        if self.enable_redaction:
            self.aligner = Wav2VecAlignment(device='cpu' if get_device_name() == "dml" else self.device)

-        self.load_tokenizer_json(tokenizer_json)
+        self.tokenizer = VoiceBpeTokenizer()

        if os.path.exists(f'{models_dir}/autoregressive.ptt'):
+            # Assume this is a traced directory.
            self.autoregressive = torch.jit.load(f'{models_dir}/autoregressive.ptt')
-        else:
-            if not autoregressive_model_path or not os.path.exists(autoregressive_model_path):
-                autoregressive_model_path = get_model_path('autoregressive.pth', models_dir)
-
-            self.load_autoregressive_model(autoregressive_model_path)
-
-        if os.path.exists(f'{models_dir}/diffusion_decoder.ptt'):
            self.diffusion = torch.jit.load(f'{models_dir}/diffusion_decoder.ptt')
        else:
-            if not diffusion_model_path or not os.path.exists(diffusion_model_path):
-                diffusion_model_path = get_model_path('diffusion_decoder.pth', models_dir)
+            self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30,
+                                          model_dim=1024,
+                                          heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False,
+                                          train_solo_embeddings=False).cpu().eval()
+            self.autoregressive.load_state_dict(torch.load(get_model_path('autoregressive.pth', models_dir)))
+            self.autoregressive.post_init_gpt2_config(kv_cache=self.use_kv_cache)

-            self.load_diffusion_model(diffusion_model_path)
+            self.diffusion = DiffusionTts(model_channels=1024, num_layers=10, in_channels=100, out_channels=200,
+                                          in_latent_channels=1024, in_tokens=8193, dropout=0, use_fp16=False, num_heads=16,
+                                          layer_drop=0, unconditioned_percentage=0).cpu().eval()
+            self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', models_dir)))


        self.clvp = CLVP(dim_text=768, dim_speech=768, dim_latent=768, num_text_tokens=256, text_enc_depth=20,
@ -323,168 +263,19 @@ class TextToSpeech:
        self.clvp.load_state_dict(torch.load(get_model_path('clvp2.pth', models_dir)))
        self.cvvp = None # CVVP model is only loaded if used.

-        self.vocoder_model = vocoder_model
-        self.load_vocoder_model(self.vocoder_model)
+        self.vocoder = UnivNetGenerator().cpu()
+        self.vocoder.load_state_dict(torch.load(get_model_path('vocoder.pth', models_dir), map_location=torch.device('cpu'))['model_g'])
+        self.vocoder.eval(inference=True)

        # Random latent generators (RLGs) are loaded lazily.
        self.rlg_auto = None
        self.rlg_diffusion = None

        if self.preloaded_tensors:
-            self.autoregressive = migrate_to_device( self.autoregressive, self.device )
-            self.diffusion = migrate_to_device( self.diffusion, self.device )
-            self.clvp = migrate_to_device( self.clvp, self.device )
-            self.vocoder = migrate_to_device( self.vocoder, self.device )
-
-        self.loading = False
-
-    def load_autoregressive_model(self, autoregressive_model_path, is_xtts=False):
-        if hasattr(self,"autoregressive_model_path") and os.path.samefile(self.autoregressive_model_path, autoregressive_model_path):
-            return
-
-        self.autoregressive_model_path = autoregressive_model_path if autoregressive_model_path and os.path.exists(autoregressive_model_path) else get_model_path('autoregressive.pth', self.models_dir)
-        new_hash = hash_file(self.autoregressive_model_path)
-
-        if hasattr(self,"autoregressive_model_hash") and self.autoregressive_model_hash == new_hash:
-            return
-
-        self.autoregressive_model_hash = new_hash
-
-        self.loading = True
-        print(f"Loading autoregressive model: {self.autoregressive_model_path}")
-
-        if hasattr(self, 'autoregressive'):
-            del self.autoregressive
-
-        # XTTS requires a different "dimensionality" for its autoregressive model
-        if new_hash == "e4ce21eae0043f7691d6a6c8540b74b8" or is_xtts:
-            dimensionality = {
-                "max_mel_tokens": 605,
-                "max_text_tokens": 402,
-                "max_prompt_tokens": 70,
-                "max_conditioning_inputs": 1,
-                "layers": 30,
-                "model_dim": 1024,
-                "heads": 16,
-                "number_text_tokens": 5023, # -1
-                "start_text_token": 261,
-                "stop_text_token": 0,
-                "number_mel_codes": 8194,
-                "start_mel_token": 8192,
-                "stop_mel_token": 8193,
-            }
-        else:
-            dimensionality = {
-                "max_mel_tokens": 604,
-                "max_text_tokens": 402,
-                "max_conditioning_inputs": 2,
-                "layers": 30,
-                "model_dim": 1024,
-                "heads": 16,
-                "number_text_tokens": 255,
-                "start_text_token": 255,
-                "checkpointing": False,
-                "train_solo_embeddings": False
-            }
-
-        self.autoregressive = UnifiedVoice(**dimensionality).cpu().eval()
-        self.autoregressive.load_state_dict(torch.load(self.autoregressive_model_path))
-        self.autoregressive.post_init_gpt2_config(use_deepspeed=self.use_deepspeed, kv_cache=self.use_kv_cache)
-        if self.preloaded_tensors:
-            self.autoregressive = migrate_to_device( self.autoregressive, self.device )
-
-        self.loading = False
-        print(f"Loaded autoregressive model")
-
-    def load_diffusion_model(self, diffusion_model_path):
-        if hasattr(self,"diffusion_model_path") and os.path.samefile(self.diffusion_model_path, diffusion_model_path):
-            return
-
-        self.loading = True
-
-        self.diffusion_model_path = diffusion_model_path if diffusion_model_path and os.path.exists(diffusion_model_path) else get_model_path('diffusion_decoder.pth', self.models_dir)
-        self.diffusion_model_hash = hash_file(self.diffusion_model_path)
-
-        if hasattr(self, 'diffusion'):
-            del self.diffusion
-
-        # XTTS does not require a different "dimensionality" for its diffusion model
-        dimensionality = {
-            "model_channels": 1024,
-            "num_layers": 10,
-            "in_channels": 100,
-            "out_channels": 200,
-            "in_latent_channels": 1024,
-            "in_tokens": 8193,
-            "dropout": 0,
-            "use_fp16": False,
-            "num_heads": 16,
-            "layer_drop": 0,
-            "unconditioned_percentage": 0
-        }
-        self.diffusion = DiffusionTts(**dimensionality)
-        self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', self.models_dir)))
-        if self.preloaded_tensors:
-            self.diffusion = migrate_to_device( self.diffusion, self.device )
-
-        self.loading = False
-        print(f"Loaded diffusion model")
-
-    def load_vocoder_model(self, vocoder_model):
-        if hasattr(self,"vocoder_model_path") and os.path.samefile(self.vocoder_model_path, vocoder_model):
-            return
-
-        self.loading = True
-
-        if hasattr(self, 'vocoder'):
-            del self.vocoder
-
-        print("Loading vocoder model:", vocoder_model)
-        if vocoder_model is None:
-            vocoder_model = 'bigvgan_24khz_100band'
-
-        if 'bigvgan' in vocoder_model:
-            # credit to https://github.com/deviandice / https://git.ecker.tech/mrq/ai-voice-cloning/issues/52
-            vocoder_key = 'generator'
-            self.vocoder_model_path = 'bigvgan_24khz_100band.pth'
-            if f'{vocoder_model}.pth' in MODELS:
-                self.vocoder_model_path = f'{vocoder_model}.pth'
-            vocoder_config = 'bigvgan_24khz_100band.json'
-            if f'{vocoder_model}.json' in MODELS:
-                vocoder_config = f'{vocoder_model}.json'
-            vocoder_config = get_model_path(vocoder_config, self.models_dir)
-
-            self.vocoder = BigVGAN(config=vocoder_config).cpu()
-        #elif vocoder_model == "univnet":
-        else:
-            vocoder_key = 'model_g'
-            self.vocoder_model_path = 'vocoder.pth'
-            self.vocoder = UnivNetGenerator().cpu()
-        
-        print(f"Loading vocoder model: {self.vocoder_model_path}")
-        self.vocoder.load_state_dict(torch.load(get_model_path(self.vocoder_model_path, self.models_dir), map_location=torch.device('cpu'))[vocoder_key])
-
-        self.vocoder.eval(inference=True)
-        if self.preloaded_tensors:
-            self.vocoder = migrate_to_device( self.vocoder, self.device )
-        self.loading = False
-        print(f"Loaded vocoder model")
-
-    def load_tokenizer_json(self, tokenizer_json):
-        if hasattr(self,"tokenizer_json") and os.path.samefile(self.tokenizer_json, tokenizer_json):
-            return
-        
-        self.loading = True
-        self.tokenizer_json = tokenizer_json if tokenizer_json else os.path.join(os.path.dirname(os.path.realpath(__file__)), '../tortoise/data/tokenizer.json')
-        print("Loading tokenizer JSON:", self.tokenizer_json)
-
-        if hasattr(self, 'tokenizer'):
-            del self.tokenizer
-
-        self.tokenizer = VoiceBpeTokenizer(vocab_file=self.tokenizer_json)
-
-        self.loading = False
-        print(f"Loaded tokenizer")
+            self.autoregressive = self.autoregressive.to(self.device)
+            self.diffusion = self.diffusion.to(self.device)
+            self.clvp = self.clvp.to(self.device)
+            self.vocoder = self.vocoder.to(self.device)

    def load_cvvp(self):
        """Load CVVP model."""
@ -493,17 +284,15 @@ class TextToSpeech:
        self.cvvp.load_state_dict(torch.load(get_model_path('cvvp.pth', self.models_dir)))
        
        if self.preloaded_tensors:
-            self.cvvp = migrate_to_device( self.cvvp, self.device )
+            self.cvvp = self.cvvp.to(self.device)

-    @torch.inference_mode()
-    def get_conditioning_latents(self, voice_samples, return_mels=False, verbose=False, slices=1, max_chunk_size=None, force_cpu=False, original_ar=False, original_diffusion=False):
+    def get_conditioning_latents(self, voice_samples, return_mels=False, verbose=False, progress=None, slices=1, max_chunk_size=None, force_cpu=False):
        """
        Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
        These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
        properties.
        :param voice_samples: List of 2 or more ~10 second reference clips, which should be torch tensors containing 22.05kHz waveform data.
        """
-
        with torch.no_grad():
            # computing conditional latents requires being done on the CPU if using DML because M$ still hasn't implemented some core functions
            if get_device_name() == "dml":
@ -513,75 +302,70 @@ class TextToSpeech:
            if not isinstance(voice_samples, list):
                voice_samples = [voice_samples]
            
-            resampler_22K = torchaudio.transforms.Resample(
+            voice_samples = [v.to(device) for v in voice_samples]
+
+            resampler = torchaudio.transforms.Resample(
                self.input_sample_rate,
-                22050,
+                self.output_sample_rate,
                lowpass_filter_width=16,
                rolloff=0.85,
                resampling_method="kaiser_window",
                beta=8.555504641634386,
-            ).to(device)
-
-            resampler_24K = torchaudio.transforms.Resample(
-                self.input_sample_rate,
-                24000,
-                lowpass_filter_width=16,
-                rolloff=0.85,
-                resampling_method="kaiser_window",
-                beta=8.555504641634386,
-            ).to(device)
-
-            voice_samples = [migrate_to_device(v, device)  for v in voice_samples]
+            )

+            samples = []
            auto_conds = []
-            diffusion_conds = []
+            for sample in voice_samples:
+                auto_conds.append(format_conditioning(sample, device=device, sampling_rate=self.input_sample_rate))
+                samples.append(resampler(sample.cpu()).to(device)) # icky no good, easier to do the resampling on CPU than figure out how to do it on GPU

-            if original_ar:
-                samples = [resampler_22K(sample) for sample in voice_samples]
-                for sample in tqdm(samples, desc="Computing AR conditioning latents..."):
-                    auto_conds.append(format_conditioning(sample, device=device, sampling_rate=self.input_sample_rate, cond_length=132300))
+            auto_conds = torch.stack(auto_conds, dim=1)
+
+
+            self.autoregressive = self.autoregressive.to(device)
+            auto_latent = self.autoregressive.get_conditioning(auto_conds)
+            if self.preloaded_tensors:
+                self.autoregressive = self.autoregressive.to(self.device)
            else:
-                samples = [resampler_22K(sample) for sample in voice_samples]
-                concat = torch.cat(samples, dim=-1)
-                chunk_size = concat.shape[-1]
+                self.autoregressive = self.autoregressive.cpu()

-                if slices == 0:
-                    slices = 1
-                elif max_chunk_size is not None and chunk_size > max_chunk_size:
+            
+            diffusion_conds = []
+            chunks = []
+
+            concat = torch.cat(samples, dim=-1)
+            chunk_size = concat.shape[-1]
+
+            if slices == 0:
+                slices = 1
+            else:
+                if max_chunk_size is not None and chunk_size > max_chunk_size:
                    slices = 1
                    while int(chunk_size / slices) > max_chunk_size:
                        slices = slices + 1

-                chunks = torch.chunk(concat, slices, dim=1)
-                chunk_size = chunks[0].shape[-1]
-
-                for chunk in tqdm(chunks, desc="Computing AR conditioning latents..."):
-                    auto_conds.append(format_conditioning(chunk, device=device, sampling_rate=self.input_sample_rate, cond_length=chunk_size))
-                
-
-            if original_diffusion:
-                samples = [resampler_24K(sample) for sample in voice_samples]
-                for sample in tqdm(samples, desc="Computing diffusion conditioning latents..."):
-                    sample = pad_or_truncate(sample, 102400)
-                    cond_mel = wav_to_univnet_mel(migrate_to_device(sample, device), do_normalization=False, device=self.device)
-                    diffusion_conds.append(cond_mel)
-            else:
-                samples = [resampler_24K(sample) for sample in voice_samples]
-                for chunk in tqdm(chunks, desc="Computing diffusion conditioning latents..."):
-                    check_for_kill_signal()
-                    chunk = pad_or_truncate(chunk, chunk_size)
-                    cond_mel = wav_to_univnet_mel(migrate_to_device( chunk, device ), do_normalization=False, device=device)
-                    diffusion_conds.append(cond_mel)
-
-            auto_conds = torch.stack(auto_conds, dim=1)
-            self.autoregressive = migrate_to_device( self.autoregressive, device )
-            auto_latent = self.autoregressive.get_conditioning(auto_conds)
-            self.autoregressive = migrate_to_device( self.autoregressive, self.device if self.preloaded_tensors else 'cpu' )
+            chunks = torch.chunk(concat, slices, dim=1)
+            chunk_size = chunks[0].shape[-1]
+            
+            # expand / truncate samples to match the common size
+            # required, as tensors need to be of the same length
+            for chunk in tqdm_override(chunks, verbose=verbose, progress=progress, desc="Computing conditioning latents..."):
+                chunk = pad_or_truncate(chunk, chunk_size)
+                cond_mel = wav_to_univnet_mel(chunk.to(device), do_normalization=False, device=device)
+                diffusion_conds.append(cond_mel)

            diffusion_conds = torch.stack(diffusion_conds, dim=1)
-            self.diffusion = migrate_to_device( self.diffusion, device )
+
+            self.diffusion = self.diffusion.to(device)
+            
            diffusion_latent = self.diffusion.get_conditioning(diffusion_conds)
-            self.diffusion = migrate_to_device( self.diffusion, self.device if self.preloaded_tensors else 'cpu' )
+
+            if self.preloaded_tensors:
+                self.diffusion = self.diffusion.to(self.device)
+            else:
+                self.diffusion = self.diffusion.cpu()
+
+

        if return_mels:
            return auto_latent, diffusion_latent, auto_conds, diffusion_conds
@ -621,15 +405,11 @@ class TextToSpeech:
        settings.update(kwargs) # allow overriding of preset settings with kwargs
        return self.tts(text, **settings)

-    @torch.inference_mode()
    def tts(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True, use_deterministic_seed=None,
            return_deterministic_state=False,
            # autoregressive generation parameters follow
            num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500,
            sample_batch_size=None,
-            autoregressive_model=None,
-            diffusion_model=None,
-            tokenizer_json=None,
            # CVVP parameters follow
            cvvp_amount=.0,
            # diffusion generation parameters follow
@ -637,6 +417,7 @@ class TextToSpeech:
            diffusion_sampler="P",
            breathing_room=8,
            half_p=False,
+            progress=None,
            **hf_generate_kwargs):
        """
        Produces an audio clip of the given text being spoken with the given reference voice.
@ -691,24 +472,7 @@ class TextToSpeech:
        self.diffusion.enable_fp16 = half_p
        deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)

-        if autoregressive_model is None:
-            autoregressive_model = self.autoregressive_model_path
-        elif autoregressive_model != self.autoregressive_model_path:
-            self.load_autoregressive_model(autoregressive_model)
-
-        if diffusion_model is None:
-            diffusion_model = self.diffusion_model_path
-        elif diffusion_model != self.diffusion_model_path:
-            self.load_diffusion_model(diffusion_model)
-
-        if tokenizer_json is None:
-            tokenizer_json = self.tokenizer_json
-        elif tokenizer_json != self.tokenizer_json:
-            self.load_tokenizer_json(tokenizer_json)
-
-        text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0)
-        text_tokens = migrate_to_device( text_tokens, self.device )
-
+        text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device)
        text_tokens = F.pad(text_tokens, (0, 1))  # This may not be necessary.
        assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'

@ -736,13 +500,12 @@ class TextToSpeech:
            stop_mel_token = self.autoregressive.stop_mel_token
            calm_token = 83  # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"

-            self.autoregressive = migrate_to_device( self.autoregressive, self.device )
-            auto_conditioning = migrate_to_device( auto_conditioning, self.device )
-            text_tokens = migrate_to_device( text_tokens, self.device )
+            self.autoregressive = self.autoregressive.to(self.device)
+            auto_conditioning = auto_conditioning.to(self.device)
+            text_tokens = text_tokens.to(self.device)

            with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=half_p):
-                for b in tqdm(range(num_batches), desc="Generating autoregressive samples"):
-                    check_for_kill_signal()
+                for b in tqdm_override(range(num_batches), verbose=verbose, progress=progress, desc="Generating autoregressive samples"):
                    codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens,
                                                                 do_sample=True,
                                                                 top_p=top_p,
@ -757,30 +520,24 @@ class TextToSpeech:
                    samples.append(codes)

            if not self.preloaded_tensors:
-                self.autoregressive = migrate_to_device( self.autoregressive, 'cpu' )
-
-            if self.unsqueeze_sample_batches:
-                new_samples = []
-                for batch in samples:
-                     for i in range(batch.shape[0]):
-                        new_samples.append(batch[i].unsqueeze(0))
-                samples = new_samples
+                self.autoregressive = self.autoregressive.cpu()
+                auto_conditioning = auto_conditioning.cpu()

            clip_results = []
+
            if auto_conds is not None:
-                auto_conditioning = migrate_to_device( auto_conditioning, self.device )
+                auto_conds = auto_conds.to(self.device)

            with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=half_p):
-                if not self.preloaded_tensors:
-                    self.autoregressive = migrate_to_device( self.autoregressive, 'cpu' )
-                    self.clvp = migrate_to_device( self.clvp, self.device )
+                if not self.minor_optimizations:
+                    self.autoregressive = self.autoregressive.cpu()
+                    self.clvp = self.clvp.to(self.device)

                if cvvp_amount > 0:
                    if self.cvvp is None:
                        self.load_cvvp()
-                    
-                    if not self.preloaded_tensors:
-                        self.cvvp = migrate_to_device( self.cvvp, self.device )
+                    if not self.minor_optimizations:
+                        self.cvvp = self.cvvp.to(self.device)
                
                desc="Computing best candidates"
                if verbose:
@ -789,9 +546,7 @@ class TextToSpeech:
                    else:
                        desc = f"Computing best candidates using CLVP {((1-cvvp_amount) * 100):2.0f}% and CVVP {(cvvp_amount * 100):2.0f}%"

-                
-                for batch in tqdm(samples, desc=desc):
-                    check_for_kill_signal()
+                for batch in tqdm_override(samples, verbose=verbose, progress=progress, desc=desc):
                    for i in range(batch.shape[0]):
                        batch[i] = fix_autoregressive_output(batch[i], stop_mel_token)

@ -811,31 +566,30 @@ class TextToSpeech:
                        clip_results.append(clvp)

            if not self.preloaded_tensors and auto_conds is not None:
-                auto_conds = migrate_to_device( auto_conds, 'cpu' )
+                auto_conds = auto_conds.cpu()

            clip_results = torch.cat(clip_results, dim=0)
            samples = torch.cat(samples, dim=0)
-            if k < num_autoregressive_samples:
-                best_results = samples[torch.topk(clip_results, k=k).indices]
-            else:
-                best_results = samples
+            best_results = samples[torch.topk(clip_results, k=k).indices]
            
            if not self.preloaded_tensors:
-                self.clvp = migrate_to_device( self.clvp, 'cpu' )
-                self.cvvp = migrate_to_device( self.cvvp, 'cpu' )
-            
-
-            if get_device_name() == "dml":
-                text_tokens = migrate_to_device( text_tokens, 'cpu' )
-                best_results = migrate_to_device( best_results, 'cpu' )
-                auto_conditioning = migrate_to_device( auto_conditioning, 'cpu' )
-                self.autoregressive = migrate_to_device( self.autoregressive, 'cpu' )
-            else:
-                auto_conditioning = auto_conditioning.to(self.device)
-                self.autoregressive = self.autoregressive.to(self.device)
+                self.clvp = self.clvp.cpu()
+                if self.cvvp is not None:
+                    self.cvvp = self.cvvp.cpu()

            del samples

+            if get_device_name() == "dml":
+                text_tokens = text_tokens.cpu()
+                best_results = best_results.cpu()
+                auto_conditioning = auto_conditioning.cpu()
+                self.autoregressive = self.autoregressive.cpu()
+            else:
+                #text_tokens = text_tokens.to(self.device)
+                #best_results = best_results.to(self.device)
+                auto_conditioning = auto_conditioning.to(self.device)
+                self.autoregressive = self.autoregressive.to(self.device)
+
            # The diffusion model actually wants the last hidden layer from the autoregressive model as conditioning
            # inputs. Re-produce those for the top results. This could be made more efficient by storing all of these
            # results, but will increase memory usage.
@ -844,19 +598,21 @@ class TextToSpeech:
                                               torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
                                               return_latent=True, clip_inputs=False)
            
-            diffusion_conditioning = migrate_to_device( diffusion_conditioning, self.device )
+            diffusion_conditioning = diffusion_conditioning.to(self.device)

            if get_device_name() == "dml":
-                self.autoregressive = migrate_to_device( self.autoregressive, self.device )
-                best_results = migrate_to_device( best_results, self.device )
-                best_latents = migrate_to_device( best_latents, self.device )
-                self.vocoder = migrate_to_device( self.vocoder, 'cpu' )
+                self.autoregressive = self.autoregressive.to(self.device)
+                best_results = best_results.to(self.device)
+                best_latents = best_latents.to(self.device)
+
+                self.vocoder = self.vocoder.cpu()
            else:
                if not self.preloaded_tensors:
-                    self.autoregressive = migrate_to_device( self.autoregressive, 'cpu' )
+                    self.autoregressive = self.autoregressive.cpu()
+
+                self.diffusion = self.diffusion.to(self.device)
+                self.vocoder = self.vocoder.to(self.device)

-                self.diffusion = migrate_to_device( self.diffusion, self.device )
-                self.vocoder = migrate_to_device( self.vocoder, self.device )
            
            del text_tokens
            del auto_conditioning
@ -878,21 +634,19 @@ class TextToSpeech:
                        break

                mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, diffusion_conditioning,
-                                               temperature=diffusion_temperature, desc="Transforming autoregressive outputs into audio..", sampler=diffusion_sampler,
+                                               temperature=diffusion_temperature, verbose=verbose, progress=progress, desc="Transforming autoregressive outputs into audio..", sampler=diffusion_sampler,
                                               input_sample_rate=self.input_sample_rate, output_sample_rate=self.output_sample_rate)

                wav = self.vocoder.inference(mel)
                wav_candidates.append(wav)
            
            if not self.preloaded_tensors:
-                self.diffusion = migrate_to_device( self.diffusion, 'cpu' )
-                self.vocoder = migrate_to_device( self.vocoder, 'cpu' )
+                self.diffusion = self.diffusion.cpu()
+                self.vocoder = self.vocoder.cpu()

            def potentially_redact(clip, text):
                if self.enable_redaction:
-                    t = clip.squeeze(1)
-                    t = migrate_to_device( t, 'cpu' if get_device_name() == "dml" else self.device)
-                    return self.aligner.redact(t, text, self.output_sample_rate).unsqueeze(1)
+                    return self.aligner.redact(clip.squeeze(1).to('cpu' if get_device_name() == "dml" else self.device), text, self.output_sample_rate).unsqueeze(1)
                return clip
            wav_candidates = [potentially_redact(wav_candidate, text) for wav_candidate in wav_candidates]

@ -901,7 +655,7 @@ class TextToSpeech:
            else:
                res = wav_candidates[0]

-            do_gc()
+            gc.collect()

            if return_deterministic_state:
                return res, (deterministic_seed, text, voice_samples, conditioning_latents)
--- a/tortoise/do_tts.py
+++ b/tortoise/do_tts.py
@ -14,7 +14,6 @@ if __name__ == '__main__':
    parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
                                                 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random')
    parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
-    parser.add_argument('--use_deepspeed', type=bool, help='Use deepspeed for speed bump.', default=True)
    parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/')
    parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'
                                                      'should only be specified if you have custom checkpoints.', default=MODELS_DIR)
@ -38,8 +37,8 @@ if __name__ == '__main__':


    os.makedirs(args.output_path, exist_ok=True)
-    #print(f'use_deepspeed do_tts_debug {use_deepspeed}')
-    tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed)
+
+    tts = TextToSpeech(models_dir=args.model_dir)

    selected_voices = args.voice.split(',')
    for k, selected_voice in enumerate(selected_voices):
--- a/tortoise/models/activations.py
+++ b/tortoise/models/activations.py
@ -1,120 +0,0 @@
-# Implementation adapted from https://github.com/EdwardDixon/snake under the MIT license.
-#   LICENSE is in incl_licenses directory.
-
-import torch
-from torch import nn, sin, pow
-from torch.nn import Parameter
-
-
-class Snake(nn.Module):
-    '''
-    Implementation of a sine-based periodic activation function
-    Shape:
-        - Input: (B, C, T)
-        - Output: (B, C, T), same shape as the input
-    Parameters:
-        - alpha - trainable parameter
-    References:
-        - This activation function is from this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
-        https://arxiv.org/abs/2006.08195
-    Examples:
-        >>> a1 = snake(256)
-        >>> x = torch.randn(256)
-        >>> x = a1(x)
-    '''
-    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
-        '''
-        Initialization.
-        INPUT:
-            - in_features: shape of the input
-            - alpha: trainable parameter
-            alpha is initialized to 1 by default, higher values = higher-frequency.
-            alpha will be trained along with the rest of your model.
-        '''
-        super(Snake, self).__init__()
-        self.in_features = in_features
-
-        # initialize alpha
-        self.alpha_logscale = alpha_logscale
-        if self.alpha_logscale: # log scale alphas initialized to zeros
-            self.alpha = Parameter(torch.zeros(in_features) * alpha)
-        else: # linear scale alphas initialized to ones
-            self.alpha = Parameter(torch.ones(in_features) * alpha)
-
-        self.alpha.requires_grad = alpha_trainable
-
-        self.no_div_by_zero = 0.000000001
-
-    def forward(self, x):
-        '''
-        Forward pass of the function.
-        Applies the function to the input elementwise.
-        Snake ∶= x + 1/a * sin^2 (xa)
-        '''
-        alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
-        if self.alpha_logscale:
-            alpha = torch.exp(alpha)
-        x = x + (1.0 / (alpha + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
-
-        return x
-
-
-class SnakeBeta(nn.Module):
-    '''
-    A modified Snake function which uses separate parameters for the magnitude of the periodic components
-    Shape:
-        - Input: (B, C, T)
-        - Output: (B, C, T), same shape as the input
-    Parameters:
-        - alpha - trainable parameter that controls frequency
-        - beta - trainable parameter that controls magnitude
-    References:
-        - This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
-        https://arxiv.org/abs/2006.08195
-    Examples:
-        >>> a1 = snakebeta(256)
-        >>> x = torch.randn(256)
-        >>> x = a1(x)
-    '''
-    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
-        '''
-        Initialization.
-        INPUT:
-            - in_features: shape of the input
-            - alpha - trainable parameter that controls frequency
-            - beta - trainable parameter that controls magnitude
-            alpha is initialized to 1 by default, higher values = higher-frequency.
-            beta is initialized to 1 by default, higher values = higher-magnitude.
-            alpha will be trained along with the rest of your model.
-        '''
-        super(SnakeBeta, self).__init__()
-        self.in_features = in_features
-
-        # initialize alpha
-        self.alpha_logscale = alpha_logscale
-        if self.alpha_logscale: # log scale alphas initialized to zeros
-            self.alpha = Parameter(torch.zeros(in_features) * alpha)
-            self.beta = Parameter(torch.zeros(in_features) * alpha)
-        else: # linear scale alphas initialized to ones
-            self.alpha = Parameter(torch.ones(in_features) * alpha)
-            self.beta = Parameter(torch.ones(in_features) * alpha)
-
-        self.alpha.requires_grad = alpha_trainable
-        self.beta.requires_grad = alpha_trainable
-
-        self.no_div_by_zero = 0.000000001
-
-    def forward(self, x):
-        '''
-        Forward pass of the function.
-        Applies the function to the input elementwise.
-        SnakeBeta ∶= x + 1/b * sin^2 (xa)
-        '''
-        alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
-        beta = self.beta.unsqueeze(0).unsqueeze(-1)
-        if self.alpha_logscale:
-            alpha = torch.exp(alpha)
-            beta = torch.exp(beta)
-        x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
-
-        return x
--- a/tortoise/models/alias_free_torch/init.py
+++ b/tortoise/models/alias_free_torch/init.py
@ -1,6 +0,0 @@
-# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
-#   LICENSE is in incl_licenses directory.
-
-from .filter import *
-from .resample import *
-from .act import *
--- a/tortoise/models/alias_free_torch/act.py
+++ b/tortoise/models/alias_free_torch/act.py
@ -1,28 +0,0 @@
-# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
-#   LICENSE is in incl_licenses directory.
-
-import torch.nn as nn
-from .resample import UpSample1d, DownSample1d
-
-
-class Activation1d(nn.Module):
-    def __init__(self,
-                 activation,
-                 up_ratio: int = 2,
-                 down_ratio: int = 2,
-                 up_kernel_size: int = 12,
-                 down_kernel_size: int = 12):
-        super().__init__()
-        self.up_ratio = up_ratio
-        self.down_ratio = down_ratio
-        self.act = activation
-        self.upsample = UpSample1d(up_ratio, up_kernel_size)
-        self.downsample = DownSample1d(down_ratio, down_kernel_size)
-
-    # x: [B,C,T]
-    def forward(self, x):
-        x = self.upsample(x)
-        x = self.act(x)
-        x = self.downsample(x)
-
-        return x
--- a/tortoise/models/alias_free_torch/filter.py
+++ b/tortoise/models/alias_free_torch/filter.py
@ -1,95 +0,0 @@
-# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
-#   LICENSE is in incl_licenses directory.
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import math
-
-if 'sinc' in dir(torch):
-    sinc = torch.sinc
-else:
-    # This code is adopted from adefossez's julius.core.sinc under the MIT License
-    # https://adefossez.github.io/julius/julius/core.html
-    #   LICENSE is in incl_licenses directory.
-    def sinc(x: torch.Tensor):
-        """
-        Implementation of sinc, i.e. sin(pi * x) / (pi * x)
-        __Warning__: Different to julius.sinc, the input is multiplied by `pi`!
-        """
-        return torch.where(x == 0,
-                           torch.tensor(1., device=x.device, dtype=x.dtype),
-                           torch.sin(math.pi * x) / math.pi / x)
-
-
-# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
-# https://adefossez.github.io/julius/julius/lowpass.html
-#   LICENSE is in incl_licenses directory.
-def kaiser_sinc_filter1d(cutoff, half_width, kernel_size): # return filter [1,1,kernel_size]
-    even = (kernel_size % 2 == 0)
-    half_size = kernel_size // 2
-
-    #For kaiser window
-    delta_f = 4 * half_width
-    A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
-    if A > 50.:
-        beta = 0.1102 * (A - 8.7)
-    elif A >= 21.:
-        beta = 0.5842 * (A - 21)**0.4 + 0.07886 * (A - 21.)
-    else:
-        beta = 0.
-    window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
-
-    # ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
-    if even:
-        time = (torch.arange(-half_size, half_size) + 0.5)
-    else:
-        time = torch.arange(kernel_size) - half_size
-    if cutoff == 0:
-        filter_ = torch.zeros_like(time)
-    else:
-        filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
-        # Normalize filter to have sum = 1, otherwise we will have a small leakage
-        # of the constant component in the input signal.
-        filter_ /= filter_.sum()
-        filter = filter_.view(1, 1, kernel_size)
-
-    return filter
-
-
-class LowPassFilter1d(nn.Module):
-    def __init__(self,
-                 cutoff=0.5,
-                 half_width=0.6,
-                 stride: int = 1,
-                 padding: bool = True,
-                 padding_mode: str = 'replicate',
-                 kernel_size: int = 12):
-        # kernel_size should be even number for stylegan3 setup,
-        # in this implementation, odd number is also possible.
-        super().__init__()
-        if cutoff < -0.:
-            raise ValueError("Minimum cutoff must be larger than zero.")
-        if cutoff > 0.5:
-            raise ValueError("A cutoff above 0.5 does not make sense.")
-        self.kernel_size = kernel_size
-        self.even = (kernel_size % 2 == 0)
-        self.pad_left = kernel_size // 2 - int(self.even)
-        self.pad_right = kernel_size // 2
-        self.stride = stride
-        self.padding = padding
-        self.padding_mode = padding_mode
-        filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
-        self.register_buffer("filter", filter)
-
-    #input [B, C, T]
-    def forward(self, x):
-        _, C, _ = x.shape
-
-        if self.padding:
-            x = F.pad(x, (self.pad_left, self.pad_right),
-                      mode=self.padding_mode)
-        out = F.conv1d(x, self.filter.expand(C, -1, -1),
-                       stride=self.stride, groups=C)
-
-        return out
--- a/tortoise/models/alias_free_torch/resample.py
+++ b/tortoise/models/alias_free_torch/resample.py
@ -1,49 +0,0 @@
-# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
-#   LICENSE is in incl_licenses directory.
-
-import torch.nn as nn
-from torch.nn import functional as F
-from .filter import LowPassFilter1d
-from .filter import kaiser_sinc_filter1d
-
-
-class UpSample1d(nn.Module):
-    def __init__(self, ratio=2, kernel_size=None):
-        super().__init__()
-        self.ratio = ratio
-        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
-        self.stride = ratio
-        self.pad = self.kernel_size // ratio - 1
-        self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
-        self.pad_right = self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
-        filter = kaiser_sinc_filter1d(cutoff=0.5 / ratio,
-                                      half_width=0.6 / ratio,
-                                      kernel_size=self.kernel_size)
-        self.register_buffer("filter", filter)
-
-    # x: [B, C, T]
-    def forward(self, x):
-        _, C, _ = x.shape
-
-        x = F.pad(x, (self.pad, self.pad), mode='replicate')
-        x = self.ratio * F.conv_transpose1d(
-            x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
-        x = x[..., self.pad_left:-self.pad_right]
-
-        return x
-
-
-class DownSample1d(nn.Module):
-    def __init__(self, ratio=2, kernel_size=None):
-        super().__init__()
-        self.ratio = ratio
-        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
-        self.lowpass = LowPassFilter1d(cutoff=0.5 / ratio,
-                                       half_width=0.6 / ratio,
-                                       stride=ratio,
-                                       kernel_size=self.kernel_size)
-
-    def forward(self, x):
-        xx = self.lowpass(x)
-
-        return xx
--- a/tortoise/models/autoregressive.py
+++ b/tortoise/models/autoregressive.py
@ -11,7 +11,6 @@ from tortoise.utils.typical_sampling import TypicalLogitsWarper

 from tortoise.utils.device import get_device_count

-import tortoise.utils.torch_intermediary as ml

 def null_position_embeddings(range, dim):
    return torch.zeros((range.shape[0], range.shape[1], dim), device=range.device)
@ -222,8 +221,7 @@ class ConditioningEncoder(nn.Module):
 class LearnedPositionEmbeddings(nn.Module):
    def __init__(self, seq_len, model_dim, init=.02):
        super().__init__()
-        # ml.Embedding
-        self.emb = ml.Embedding(seq_len, model_dim)
+        self.emb = nn.Embedding(seq_len, model_dim)
        # Initializing this way is standard for GPT-2
        self.emb.weight.data.normal_(mean=0.0, std=init)

@ -283,9 +281,9 @@ class MelEncoder(nn.Module):


 class UnifiedVoice(nn.Module):
-    def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=120, max_prompt_tokens=2, max_mel_tokens=250, max_conditioning_inputs=1,
+    def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=120, max_mel_tokens=250, max_conditioning_inputs=1,
                 mel_length_compression=1024, number_text_tokens=256,
-                 start_text_token=None, stop_text_token=0, number_mel_codes=8194, start_mel_token=8192,
+                 start_text_token=None, number_mel_codes=8194, start_mel_token=8192,
                 stop_mel_token=8193, train_solo_embeddings=False, use_mel_codes_as_input=True,
                 checkpointing=True, types=1):
        """
@ -295,7 +293,6 @@ class UnifiedVoice(nn.Module):
            heads: Number of transformer heads. Must be divisible by model_dim. Recommend model_dim//64
            max_text_tokens: Maximum number of text tokens that will be encountered by model.
            max_mel_tokens: Maximum number of MEL tokens that will be encountered by model.
-            max_prompt_tokens: compat set to 2, 70 for XTTS
            max_conditioning_inputs: Maximum number of conditioning inputs provided to the model. If (1), conditioning input can be of format (b,80,s), otherwise (b,n,80,s).
            mel_length_compression: The factor between <number_input_samples> and <mel_tokens>. Used to compute MEL code padding given wav input length.
            number_text_tokens:
@ -312,7 +309,7 @@ class UnifiedVoice(nn.Module):

        self.number_text_tokens = number_text_tokens
        self.start_text_token = number_text_tokens * types if start_text_token is None else start_text_token
-        self.stop_text_token = stop_text_token
+        self.stop_text_token = 0
        self.number_mel_codes = number_mel_codes
        self.start_mel_token = start_mel_token
        self.stop_mel_token = stop_mel_token
@ -320,16 +317,13 @@ class UnifiedVoice(nn.Module):
        self.heads = heads
        self.max_mel_tokens = max_mel_tokens
        self.max_text_tokens = max_text_tokens
-        self.max_prompt_tokens = max_prompt_tokens
        self.model_dim = model_dim
        self.max_conditioning_inputs = max_conditioning_inputs
        self.mel_length_compression = mel_length_compression
        self.conditioning_encoder = ConditioningEncoder(80, model_dim, num_attn_heads=heads)
-        # ml.Embedding
-        self.text_embedding = ml.Embedding(self.number_text_tokens*types+1, model_dim)
+        self.text_embedding = nn.Embedding(self.number_text_tokens*types+1, model_dim)
        if use_mel_codes_as_input:
-            # ml.Embedding
-            self.mel_embedding = ml.Embedding(self.number_mel_codes, model_dim)
+            self.mel_embedding = nn.Embedding(self.number_mel_codes, model_dim)
        else:
            self.mel_embedding = MelEncoder(model_dim, resblocks_per_reduction=1)
        self.gpt, self.mel_pos_embedding, self.text_pos_embedding, self.mel_layer_pos_embedding, self.text_layer_pos_embedding = \
@ -342,10 +336,8 @@ class UnifiedVoice(nn.Module):
            self.text_solo_embedding = 0

        self.final_norm = nn.LayerNorm(model_dim)
-        # nn.Linear
-        self.text_head = ml.Linear(model_dim, self.number_text_tokens*types+1)
-        # nn.Linear
-        self.mel_head = ml.Linear(model_dim, self.number_mel_codes)
+        self.text_head = nn.Linear(model_dim, self.number_text_tokens*types+1)
+        self.mel_head = nn.Linear(model_dim, self.number_mel_codes)

        # Initialize the embeddings per the GPT-2 scheme
        embeddings = [self.text_embedding]
@ -354,8 +346,8 @@ class UnifiedVoice(nn.Module):
        for module in embeddings:
            module.weight.data.normal_(mean=0.0, std=.02)

-    def post_init_gpt2_config(self, use_deepspeed=False, kv_cache=False):
-        seq_length = self.max_mel_tokens + self.max_text_tokens + self.max_prompt_tokens
+    def post_init_gpt2_config(self, kv_cache=False):
+        seq_length = self.max_mel_tokens + self.max_text_tokens + 2
        gpt_config = GPT2Config(vocab_size=self.max_mel_tokens,
                                n_positions=seq_length,
                                n_ctx=seq_length,
@ -365,17 +357,6 @@ class UnifiedVoice(nn.Module):
                                gradient_checkpointing=False,
                                use_cache=True)
        self.inference_model = GPT2InferenceModel(gpt_config, self.gpt, self.mel_pos_embedding, self.mel_embedding, self.final_norm, self.mel_head, kv_cache=kv_cache)
-        #print(f'use_deepspeed autoregressive_debug {use_deepspeed}')
-        if use_deepspeed and torch.cuda.is_available():
-            import deepspeed
-            self.ds_engine = deepspeed.init_inference(model=self.inference_model,  
-                                                    mp_size=1,
-                                                    replace_with_kernel_inject=True,
-                                                    dtype=torch.float32)
-            self.inference_model = self.ds_engine.module.eval()
-        else:
-            self.inference_model = self.inference_model.eval()
-            
        self.gpt.wte = self.mel_embedding

    def build_aligned_inputs_and_targets(self, input, start_token, stop_token):
@ -496,9 +477,9 @@ class UnifiedVoice(nn.Module):

    def inference_speech(self, speech_conditioning_latent, text_inputs, input_tokens=None, num_return_sequences=1,
                         max_generate_length=None, typical_sampling=False, typical_mass=.9, **hf_generate_kwargs):
-        seq_length = self.max_mel_tokens + self.max_text_tokens + self.max_prompt_tokens
+        seq_length = self.max_mel_tokens + self.max_text_tokens + 2
        if not hasattr(self, 'inference_model'):
-            self.post_init_gpt2_config(kv_cache=self.kv_cache)
+            self.post_init_gpt2_config(kv_cache=self.kv_cachepost_init_gpt2_config)
            

        text_inputs = F.pad(text_inputs, (0, 1), value=self.stop_text_token)
--- a/tortoise/models/bigvgan.py
+++ b/tortoise/models/bigvgan.py
@ -1,485 +0,0 @@
-# Copyright (c) 2022 NVIDIA CORPORATION. 
-#   Licensed under the MIT license.
-
-# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
-#   LICENSE is in incl_licenses directory.
-
-import json
-import os
-import torch, torch.utils.data
-import tortoise.models.activations as activations
-from torch.nn import Conv1d, ConvTranspose1d, Conv2d
-from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
-from tortoise.models.alias_free_torch import *
-from librosa.filters import mel as librosa_mel_fn
-
-LRELU_SLOPE = 0.1
-
-
-class AMPBlock1(torch.nn.Module):
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5), activation=None):
-        super(AMPBlock1, self).__init__()
-        self.h = h
-
-        self.convs1 = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
-                               padding=get_padding(kernel_size, dilation[0]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
-                               padding=get_padding(kernel_size, dilation[1]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
-                               padding=get_padding(kernel_size, dilation[2])))
-        ])
-        self.convs1.apply(init_weights)
-
-        self.convs2 = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1)))
-        ])
-        self.convs2.apply(init_weights)
-
-        self.num_layers = len(self.convs1) + len(self.convs2)  # total number of conv layers
-
-        if activation == 'snake':  # periodic nonlinearity with snake function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        elif activation == 'snakebeta':  # periodic nonlinearity with snakebeta function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'.")
-
-    def forward(self, x):
-        acts1, acts2 = self.activations[::2], self.activations[1::2]
-        for c1, c2, a1, a2 in zip(self.convs1, self.convs2, acts1, acts2):
-            xt = a1(x)
-            xt = c1(xt)
-            xt = a2(xt)
-            xt = c2(xt)
-            x = xt + x
-
-        return x
-
-    def remove_weight_norm(self):
-        for l in self.convs1:
-            remove_weight_norm(l)
-        for l in self.convs2:
-            remove_weight_norm(l)
-
-
-class AMPBlock2(torch.nn.Module):
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3), activation=None):
-        super(AMPBlock2, self).__init__()
-        self.h = h
-
-        self.convs = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
-                               padding=get_padding(kernel_size, dilation[0]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
-                               padding=get_padding(kernel_size, dilation[1])))
-        ])
-        self.convs.apply(init_weights)
-
-        self.num_layers = len(self.convs)  # total number of conv layers
-
-        if activation == 'snake':  # periodic nonlinearity with snake function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        elif activation == 'snakebeta':  # periodic nonlinearity with snakebeta function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'.")
-
-    def forward(self, x):
-        for c, a in zip(self.convs, self.activations):
-            xt = a(x)
-            xt = c(xt)
-            x = xt + x
-
-        return x
-
-    def remove_weight_norm(self):
-        for l in self.convs:
-            remove_weight_norm(l)
-
-
-
-class AttrDict(dict):
-    def __init__(self, *args, **kwargs):
-        super(AttrDict, self).__init__(*args, **kwargs)
-        self.__dict__ = self
-
-class BigVGAN(nn.Module):
-    # this is our main BigVGAN model. Applies anti-aliased periodic activation for resblocks.
-    def __init__(self, config=None, data=None):
-        super(BigVGAN, self).__init__()
-
-        """
-        with open(os.path.join(os.path.dirname(__file__), 'config.json'), 'r') as f:
-            data = f.read()
-        """
-        if config and data is None:
-            with open(config, 'r') as f:
-                data = f.read()
-            jsonConfig = json.loads(data)
-        elif data is not None:
-            if isinstance(data, str):
-                jsonConfig = json.loads(data)
-            else:
-                jsonConfig = data
-        else:
-            raise Exception("no config specified")
-
-
-        global h
-        h = AttrDict(jsonConfig)
-
-        self.mel_channel = h.num_mels
-        self.noise_dim = h.n_fft
-        self.hop_length = h.hop_size
-        self.num_kernels = len(h.resblock_kernel_sizes)
-        self.num_upsamples = len(h.upsample_rates)
-
-        # pre conv
-        self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3))
-
-        # define which AMPBlock to use. BigVGAN uses AMPBlock1 as default
-        resblock = AMPBlock1 if h.resblock == '1' else AMPBlock2
-
-        # transposed conv-based upsamplers. does not apply anti-aliasing
-        self.ups = nn.ModuleList()
-        for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
-            self.ups.append(nn.ModuleList([
-                weight_norm(ConvTranspose1d(h.upsample_initial_channel // (2 ** i),
-                                            h.upsample_initial_channel // (2 ** (i + 1)),
-                                            k, u, padding=(k - u) // 2))
-            ]))
-
-        # residual blocks using anti-aliased multi-periodicity composition modules (AMP)
-        self.resblocks = nn.ModuleList()
-        for i in range(len(self.ups)):
-            ch = h.upsample_initial_channel // (2 ** (i + 1))
-            for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
-                self.resblocks.append(resblock(h, ch, k, d, activation=h.activation))
-
-        # post conv
-        if h.activation == "snake":  # periodic nonlinearity with snake function and anti-aliasing
-            activation_post = activations.Snake(ch, alpha_logscale=h.snake_logscale)
-            self.activation_post = Activation1d(activation=activation_post)
-        elif h.activation == "snakebeta":  # periodic nonlinearity with snakebeta function and anti-aliasing
-            activation_post = activations.SnakeBeta(ch, alpha_logscale=h.snake_logscale)
-            self.activation_post = Activation1d(activation=activation_post)
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'.")
-
-        self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
-
-        # weight initialization
-        for i in range(len(self.ups)):
-            self.ups[i].apply(init_weights)
-        self.conv_post.apply(init_weights)
-
-    def forward(self,x, c):
-        # pre conv
-        x = self.conv_pre(x)
-
-        for i in range(self.num_upsamples):
-            # upsampling
-            for i_up in range(len(self.ups[i])):
-                x = self.ups[i][i_up](x)
-            # AMP blocks
-            xs = None
-            for j in range(self.num_kernels):
-                if xs is None:
-                    xs = self.resblocks[i * self.num_kernels + j](x)
-                else:
-                    xs += self.resblocks[i * self.num_kernels + j](x)
-            x = xs / self.num_kernels
-
-        # post conv
-        x = self.activation_post(x)
-        x = self.conv_post(x)
-        x = torch.tanh(x)
-
-        return x
-
-    def remove_weight_norm(self):
-        print('Removing weight norm...')
-        for l in self.ups:
-            for l_i in l:
-                remove_weight_norm(l_i)
-        for l in self.resblocks:
-            l.remove_weight_norm()
-        remove_weight_norm(self.conv_pre)
-        remove_weight_norm(self.conv_post)
-
-    def inference(self, c, z=None):
-        # pad input mel with zeros to cut artifact
-        # see https://github.com/seungwonpark/melgan/issues/8
-        zero = torch.full((c.shape[0], h.num_mels, 10), -11.5129).to(c.device)
-        mel = torch.cat((c, zero), dim=2)
-
-        if z is None:
-            z = torch.randn(c.shape[0], self.noise_dim, mel.size(2)).to(mel.device)
-
-        audio = self.forward(mel, z)
-        audio = audio[:, :, :-(self.hop_length * 10)]
-        audio = audio.clamp(min=-1, max=1)
-        return audio
-
-    def eval(self, inference=False):
-        super(BigVGAN, self).eval()
-        # don't remove weight norm while validation in training loop
-        if inference:
-            self.remove_weight_norm()
-
-
-class DiscriminatorP(nn.Module):
-    def __init__(self, h, period, kernel_size=5, stride=3, use_spectral_norm=False):
-        super(DiscriminatorP, self).__init__()
-        self.period = period
-        self.d_mult = h.discriminator_channel_mult
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.convs = nn.ModuleList([
-            norm_f(Conv2d(1, int(32 * self.d_mult), (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(int(32 * self.d_mult), int(128 * self.d_mult), (kernel_size, 1), (stride, 1),
-                          padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(int(128 * self.d_mult), int(512 * self.d_mult), (kernel_size, 1), (stride, 1),
-                          padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(int(512 * self.d_mult), int(1024 * self.d_mult), (kernel_size, 1), (stride, 1),
-                          padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(int(1024 * self.d_mult), int(1024 * self.d_mult), (kernel_size, 1), 1, padding=(2, 0))),
-        ])
-        self.conv_post = norm_f(Conv2d(int(1024 * self.d_mult), 1, (3, 1), 1, padding=(1, 0)))
-
-    def forward(self, x):
-        fmap = []
-
-        # 1d to 2d
-        b, c, t = x.shape
-        if t % self.period != 0:  # pad first
-            n_pad = self.period - (t % self.period)
-            x = F.pad(x, (0, n_pad), "reflect")
-            t = t + n_pad
-        x = x.view(b, c, t // self.period, self.period)
-
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, LRELU_SLOPE)
-            fmap.append(x)
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-
-        return x, fmap
-
-
-class MultiPeriodDiscriminator(nn.Module):
-    def __init__(self, h):
-        super(MultiPeriodDiscriminator, self).__init__()
-        self.mpd_reshapes = h.mpd_reshapes
-        print("mpd_reshapes: {}".format(self.mpd_reshapes))
-        discriminators = [DiscriminatorP(h, rs, use_spectral_norm=h.use_spectral_norm) for rs in self.mpd_reshapes]
-        self.discriminators = nn.ModuleList(discriminators)
-
-    def forward(self, y, y_hat):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-        for i, d in enumerate(self.discriminators):
-            y_d_r, fmap_r = d(y)
-            y_d_g, fmap_g = d(y_hat)
-            y_d_rs.append(y_d_r)
-            fmap_rs.append(fmap_r)
-            y_d_gs.append(y_d_g)
-            fmap_gs.append(fmap_g)
-
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
-
-
-class DiscriminatorR(nn.Module):
-    def __init__(self, cfg, resolution):
-        super().__init__()
-
-        self.resolution = resolution
-        assert len(self.resolution) == 3, \
-            "MRD layer requires list with len=3, got {}".format(self.resolution)
-        self.lrelu_slope = LRELU_SLOPE
-
-        norm_f = weight_norm if cfg.use_spectral_norm == False else spectral_norm
-        if hasattr(cfg, "mrd_use_spectral_norm"):
-            print("INFO: overriding MRD use_spectral_norm as {}".format(cfg.mrd_use_spectral_norm))
-            norm_f = weight_norm if cfg.mrd_use_spectral_norm == False else spectral_norm
-        self.d_mult = cfg.discriminator_channel_mult
-        if hasattr(cfg, "mrd_channel_mult"):
-            print("INFO: overriding mrd channel multiplier as {}".format(cfg.mrd_channel_mult))
-            self.d_mult = cfg.mrd_channel_mult
-
-        self.convs = nn.ModuleList([
-            norm_f(nn.Conv2d(1, int(32 * self.d_mult), (3, 9), padding=(1, 4))),
-            norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
-            norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
-            norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
-            norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 3), padding=(1, 1))),
-        ])
-        self.conv_post = norm_f(nn.Conv2d(int(32 * self.d_mult), 1, (3, 3), padding=(1, 1)))
-
-    def forward(self, x):
-        fmap = []
-
-        x = self.spectrogram(x)
-        x = x.unsqueeze(1)
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, self.lrelu_slope)
-            fmap.append(x)
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-
-        return x, fmap
-
-    def spectrogram(self, x):
-        n_fft, hop_length, win_length = self.resolution
-        x = F.pad(x, (int((n_fft - hop_length) / 2), int((n_fft - hop_length) / 2)), mode='reflect')
-        x = x.squeeze(1)
-        x = torch.stft(x, n_fft=n_fft, hop_length=hop_length, win_length=win_length, center=False, return_complex=True)
-        x = torch.view_as_real(x)  # [B, F, TT, 2]
-        mag = torch.norm(x, p=2, dim=-1)  # [B, F, TT]
-
-        return mag
-
-
-class MultiResolutionDiscriminator(nn.Module):
-    def __init__(self, cfg, debug=False):
-        super().__init__()
-        self.resolutions = cfg.resolutions
-        assert len(self.resolutions) == 3, \
-            "MRD requires list of list with len=3, each element having a list with len=3. got {}". \
-                format(self.resolutions)
-        self.discriminators = nn.ModuleList(
-            [DiscriminatorR(cfg, resolution) for resolution in self.resolutions]
-        )
-
-    def forward(self, y, y_hat):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-
-        for i, d in enumerate(self.discriminators):
-            y_d_r, fmap_r = d(x=y)
-            y_d_g, fmap_g = d(x=y_hat)
-            y_d_rs.append(y_d_r)
-            fmap_rs.append(fmap_r)
-            y_d_gs.append(y_d_g)
-            fmap_gs.append(fmap_g)
-
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
-
-def get_mel(x):
-    return mel_spectrogram(x, h.n_fft, h.num_mels, h.sampling_rate, h.hop_size, h.win_size, h.fmin, h.fmax)
-
-def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
-    if torch.min(y) < -1.:
-        print('min value is ', torch.min(y))
-    if torch.max(y) > 1.:
-        print('max value is ', torch.max(y))
-
-    global mel_basis, hann_window
-    if fmax not in mel_basis:
-        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
-        mel_basis[str(fmax)+'_'+str(y.device)] = torch.from_numpy(mel).float().to(y.device)
-        hann_window[str(y.device)] = torch.hann_window(win_size).to(y.device)
-
-    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
-    y = y.squeeze(1)
-
-    # complex tensor as default, then use view_as_real for future pytorch compatibility
-    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
-                      center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=True)
-    spec = torch.view_as_real(spec)
-    spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
-
-    spec = torch.matmul(mel_basis[str(fmax)+'_'+str(y.device)], spec)
-    spec = torch.nn.utils.spectral_normalize_torch(spec)
-
-    return spec
-
-def feature_loss(fmap_r, fmap_g):
-    loss = 0
-    for dr, dg in zip(fmap_r, fmap_g):
-        for rl, gl in zip(dr, dg):
-            loss += torch.mean(torch.abs(rl - gl))
-
-    return loss * 2
-
-
-def init_weights(m, mean=0.0, std=0.01):
-    classname = m.__class__.__name__
-    if classname.find("Conv") != -1:
-        m.weight.data.normal_(mean, std)
-
-
-def get_padding(kernel_size, dilation=1):
-    return int((kernel_size * dilation - dilation) / 2)
-
-
-def discriminator_loss(disc_real_outputs, disc_generated_outputs):
-    loss = 0
-    r_losses = []
-    g_losses = []
-    for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
-        r_loss = torch.mean((1 - dr) ** 2)
-        g_loss = torch.mean(dg ** 2)
-        loss += (r_loss + g_loss)
-        r_losses.append(r_loss.item())
-        g_losses.append(g_loss.item())
-
-    return loss, r_losses, g_losses
-
-
-def generator_loss(disc_outputs):
-    loss = 0
-    gen_losses = []
-    for dg in disc_outputs:
-        l = torch.mean((1 - dg) ** 2)
-        gen_losses.append(l)
-        loss += l
-
-    return loss, gen_losses
-
-
-if __name__ == '__main__':
-    model = BigVGAN()
-
-    c = torch.randn(3, 100, 10)
-    z = torch.randn(3, 64, 10)
-    print(c.shape)
-
-    y = model(c, z)
-    print(y.shape)
-    assert y.shape == torch.Size([3, 1, 2560])
-
-    pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
-    print(pytorch_total_params)
--- a/tortoise/models/classifier.py
+++ b/tortoise/models/classifier.py
@ -3,7 +3,6 @@ import torch.nn as nn

 from tortoise.models.arch_util import Upsample, Downsample, normalization, zero_module, AttentionBlock

-import tortoise.utils.torch_intermediary as ml

 class ResBlock(nn.Module):
    def __init__(
@ -125,8 +124,7 @@ class AudioMiniEncoderWithClassifierHead(nn.Module):
    def __init__(self, classes, distribute_zero_label=True, **kwargs):
        super().__init__()
        self.enc = AudioMiniEncoder(**kwargs)
-        # nn.Linear
-        self.head = ml.Linear(self.enc.dim, classes)
+        self.head = nn.Linear(self.enc.dim, classes)
        self.num_classes = classes
        self.distribute_zero_label = distribute_zero_label

--- a/tortoise/models/clvp.py
+++ b/tortoise/models/clvp.py
@ -7,9 +7,6 @@ from tortoise.models.arch_util import CheckpointedXTransformerEncoder
 from tortoise.models.transformer import Transformer
 from tortoise.models.xtransformers import Encoder

-import tortoise.utils.torch_intermediary as ml
-
-from tortoise.utils.device import print_stats, do_gc

 def exists(val):
    return val is not None
@ -47,15 +44,11 @@ class CLVP(nn.Module):
            use_xformers=False,
    ):
        super().__init__()
-        # nn.Embedding
-        self.text_emb = ml.Embedding(num_text_tokens, dim_text)
-        # nn.Linear
-        self.to_text_latent = ml.Linear(dim_text, dim_latent, bias=False)
+        self.text_emb = nn.Embedding(num_text_tokens, dim_text)
+        self.to_text_latent = nn.Linear(dim_text, dim_latent, bias=False)

-        # nn.Embedding
-        self.speech_emb = ml.Embedding(num_speech_tokens, dim_speech)
-        # nn.Linear
-        self.to_speech_latent = ml.Linear(dim_speech, dim_latent, bias=False)
+        self.speech_emb = nn.Embedding(num_speech_tokens, dim_speech)
+        self.to_speech_latent = nn.Linear(dim_speech, dim_latent, bias=False)

        if use_xformers:
            self.text_transformer = CheckpointedXTransformerEncoder(
@ -100,10 +93,8 @@ class CLVP(nn.Module):
        self.wav_token_compression = wav_token_compression
        self.xformers = use_xformers
        if not use_xformers:
-            # nn.Embedding
-            self.text_pos_emb = ml.Embedding(text_seq_len, dim_text)
-            # nn.Embedding
-            self.speech_pos_emb = ml.Embedding(num_speech_tokens, dim_speech)
+            self.text_pos_emb = nn.Embedding(text_seq_len, dim_text)
+            self.speech_pos_emb = nn.Embedding(num_speech_tokens, dim_speech)

    def forward(
            self,
@ -126,13 +117,14 @@ class CLVP(nn.Module):
            text_emb += self.text_pos_emb(torch.arange(text.shape[1], device=device))
            speech_emb += self.speech_pos_emb(torch.arange(speech_emb.shape[1], device=device))

-        
-        text_latents = self.to_text_latent(masked_mean(self.text_transformer(text_emb, mask=text_mask), text_mask, dim=1))
+        enc_text = self.text_transformer(text_emb, mask=text_mask)
+        enc_speech = self.speech_transformer(speech_emb, mask=voice_mask)

-        # on ROCm at least, allocated VRAM spikes here
-        do_gc()
-        speech_latents = self.to_speech_latent(masked_mean(self.speech_transformer(speech_emb, mask=voice_mask), voice_mask, dim=1))
-        do_gc()
+        text_latents = masked_mean(enc_text, text_mask, dim=1)
+        speech_latents = masked_mean(enc_speech, voice_mask, dim=1)
+
+        text_latents = self.to_text_latent(text_latents)
+        speech_latents = self.to_speech_latent(speech_latents)

        text_latents, speech_latents = map(lambda t: F.normalize(t, p=2, dim=-1), (text_latents, speech_latents))

--- a/tortoise/models/cvvp.py
+++ b/tortoise/models/cvvp.py
@ -6,7 +6,6 @@ from torch import einsum
 from tortoise.models.arch_util import AttentionBlock
 from tortoise.models.xtransformers import ContinuousTransformerWrapper, Encoder

-import tortoise.utils.torch_intermediary as ml

 def exists(val):
    return val is not None
@ -55,8 +54,7 @@ class CollapsingTransformer(nn.Module):
 class ConvFormatEmbedding(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__()
-        # nn.Embedding
-        self.emb = ml.Embedding(*args, **kwargs)
+        self.emb = nn.Embedding(*args, **kwargs)

    def forward(self, x):
        y = self.emb(x)
@ -85,8 +83,7 @@ class CVVP(nn.Module):
                                      nn.Conv1d(model_dim//2, model_dim, kernel_size=3, stride=2, padding=1))
        self.conditioning_transformer = CollapsingTransformer(
            model_dim, model_dim, transformer_heads, dropout, conditioning_enc_depth, cond_mask_percentage)
-        # nn.Linear
-        self.to_conditioning_latent = ml.Linear(
+        self.to_conditioning_latent = nn.Linear(
            latent_dim, latent_dim, bias=False)

        if mel_codes is None:
@ -96,8 +93,7 @@ class CVVP(nn.Module):
            self.speech_emb = ConvFormatEmbedding(mel_codes, model_dim)
        self.speech_transformer = CollapsingTransformer(
            model_dim, latent_dim, transformer_heads, dropout, speech_enc_depth, speech_mask_percentage)
-        # nn.Linear
-        self.to_speech_latent = ml.Linear(
+        self.to_speech_latent = nn.Linear(
            latent_dim, latent_dim, bias=False)

    def get_grad_norm_parameter_groups(self):
--- a/tortoise/models/diffusion_decoder.py
+++ b/tortoise/models/diffusion_decoder.py
@ -10,8 +10,6 @@ from torch import autocast
 from tortoise.models.arch_util import normalization, AttentionBlock
 from tortoise.utils.device import get_device_name

-import tortoise.utils.torch_intermediary as ml
-
 def is_latent(t):
    return t.dtype == torch.float

@ -89,8 +87,7 @@ class ResBlock(TimestepBlock):

        self.emb_layers = nn.Sequential(
            nn.SiLU(),
-            # nn.Linear
-            ml.Linear(
+            nn.Linear(
                emb_channels,
                2 * self.out_channels if use_scale_shift_norm else self.out_channels,
            ),
@ -163,19 +160,16 @@ class DiffusionTts(nn.Module):

        self.inp_block = nn.Conv1d(in_channels, model_channels, 3, 1, 1)
        self.time_embed = nn.Sequential(
-            # nn.Linear
-            ml.Linear(model_channels, model_channels),
+            nn.Linear(model_channels, model_channels),
            nn.SiLU(),
-            # nn.Linear
-            ml.Linear(model_channels, model_channels),
+            nn.Linear(model_channels, model_channels),
        )

        # Either code_converter or latent_converter is used, depending on what type of conditioning data is fed.
        # This model is meant to be able to be trained on both for efficiency purposes - it is far less computationally
        # complex to generate tokens, while generating latents will normally mean propagating through a deep autoregressive
        # transformer network.
-        # nn.Embedding
-        self.code_embedding = ml.Embedding(in_tokens, model_channels)
+        self.code_embedding = nn.Embedding(in_tokens, model_channels)
        self.code_converter = nn.Sequential(
            AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
            AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
--- a/tortoise/models/random_latent_generator.py
+++ b/tortoise/models/random_latent_generator.py
@ -4,7 +4,6 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F

-import tortoise.utils.torch_intermediary as ml

 def fused_leaky_relu(input, bias=None, negative_slope=0.2, scale=2 ** 0.5):
    if bias is not None:
@ -42,8 +41,7 @@ class RandomLatentConverter(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.layers = nn.Sequential(*[EqualLinear(channels, channels, lr_mul=.1) for _ in range(5)],
-                                    # nn.Linear
-                                    ml.Linear(channels, channels))
+                                    nn.Linear(channels, channels))
        self.channels = channels

    def forward(self, ref):
--- a/tortoise/models/transformer.py
+++ b/tortoise/models/transformer.py
@ -6,7 +6,6 @@ from einops import rearrange
 from rotary_embedding_torch import RotaryEmbedding, broadcat
 from torch import nn

-import tortoise.utils.torch_intermediary as ml

 # helpers

@ -121,12 +120,10 @@ class FeedForward(nn.Module):
    def __init__(self, dim, dropout = 0., mult = 4.):
        super().__init__()
        self.net = nn.Sequential(
-            # nn.Linear
-            ml.Linear(dim, dim * mult * 2),
+            nn.Linear(dim, dim * mult * 2),
            GEGLU(),
            nn.Dropout(dropout),
-            # nn.Linear
-            ml.Linear(dim * mult, dim)
+            nn.Linear(dim * mult, dim)
        )

    def forward(self, x):
@ -145,11 +142,9 @@ class Attention(nn.Module):

        self.causal = causal

-        # nn.Linear
-        self.to_qkv = ml.Linear(dim, inner_dim * 3, bias = False)
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
        self.to_out = nn.Sequential(
-            # nn.Linear
-            ml.Linear(inner_dim, dim),
+            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        )

--- a/tortoise/models/xtransformers.py
+++ b/tortoise/models/xtransformers.py
@ -8,8 +8,6 @@ import torch.nn.functional as F
 from einops import rearrange, repeat
 from torch import nn, einsum

-import tortoise.utils.torch_intermediary as ml
-
 DEFAULT_DIM_HEAD = 64

 Intermediates = namedtuple('Intermediates', [
@ -123,8 +121,7 @@ class AbsolutePositionalEmbedding(nn.Module):
    def __init__(self, dim, max_seq_len):
        super().__init__()
        self.scale = dim ** -0.5
-        # nn.Embedding
-        self.emb = ml.Embedding(max_seq_len, dim)
+        self.emb = nn.Embedding(max_seq_len, dim)

    def forward(self, x):
        n = torch.arange(x.shape[1], device=x.device)
@ -153,8 +150,7 @@ class RelativePositionBias(nn.Module):
        self.causal = causal
        self.num_buckets = num_buckets
        self.max_distance = max_distance
-        # nn.Embedding
-        self.relative_attention_bias = ml.Embedding(num_buckets, heads)
+        self.relative_attention_bias = nn.Embedding(num_buckets, heads)

    @staticmethod
    def _relative_position_bucket(relative_position, causal=True, num_buckets=32, max_distance=128):
@ -354,8 +350,7 @@ class RMSScaleShiftNorm(nn.Module):
        self.scale = dim ** -0.5
        self.eps = eps
        self.g = nn.Parameter(torch.ones(dim))
-        # nn.Linear
-        self.scale_shift_process = ml.Linear(dim * 2, dim * 2)
+        self.scale_shift_process = nn.Linear(dim * 2, dim * 2)

    def forward(self, x, norm_scale_shift_inp):
        norm = torch.norm(x, dim=-1, keepdim=True) * self.scale
@ -435,8 +430,7 @@ class GLU(nn.Module):
    def __init__(self, dim_in, dim_out, activation):
        super().__init__()
        self.act = activation
-        # nn.Linear
-        self.proj = ml.Linear(dim_in, dim_out * 2)
+        self.proj = nn.Linear(dim_in, dim_out * 2)

    def forward(self, x):
        x, gate = self.proj(x).chunk(2, dim=-1)
@ -461,8 +455,7 @@ class FeedForward(nn.Module):
        activation = ReluSquared() if relu_squared else nn.GELU()

        project_in = nn.Sequential(
-            # nn.Linear
-            ml.Linear(dim, inner_dim),
+            nn.Linear(dim, inner_dim),
            activation
        ) if not glu else GLU(dim, inner_dim, activation)

@ -470,8 +463,7 @@ class FeedForward(nn.Module):
            project_in,
            nn.LayerNorm(inner_dim) if post_act_ln else nn.Identity(),
            nn.Dropout(dropout),
-            # nn.Linear
-            ml.Linear(inner_dim, dim_out)
+            nn.Linear(inner_dim, dim_out)
        )

        # init last linear layer to 0
@ -524,20 +516,16 @@ class Attention(nn.Module):
            qk_dim = int(collab_compression * qk_dim)
            self.collab_mixing = nn.Parameter(torch.randn(heads, qk_dim))

-        # nn.Linear
-        self.to_q = ml.Linear(dim, qk_dim, bias=False)
-        # nn.Linear
-        self.to_k = ml.Linear(dim, qk_dim, bias=False)
-        # nn.Linear
-        self.to_v = ml.Linear(dim, v_dim, bias=False)
+        self.to_q = nn.Linear(dim, qk_dim, bias=False)
+        self.to_k = nn.Linear(dim, qk_dim, bias=False)
+        self.to_v = nn.Linear(dim, v_dim, bias=False)

        self.dropout = nn.Dropout(dropout)

        # add GLU gating for aggregated values, from alphafold2
        self.to_v_gate = None
        if gate_values:
-            # nn.Linear
-            self.to_v_gate = ml.Linear(dim, v_dim)
+            self.to_v_gate = nn.Linear(dim, v_dim)
            nn.init.constant_(self.to_v_gate.weight, 0)
            nn.init.constant_(self.to_v_gate.bias, 1)

@ -573,8 +561,7 @@ class Attention(nn.Module):

        # attention on attention
        self.attn_on_attn = on_attn
-        # nn.Linear
-        self.to_out = nn.Sequential(ml.Linear(v_dim, dim * 2), nn.GLU()) if on_attn else ml.Linear(v_dim, dim)
+        self.to_out = nn.Sequential(nn.Linear(v_dim, dim * 2), nn.GLU()) if on_attn else nn.Linear(v_dim, dim)

        self.rel_pos_bias = rel_pos_bias
        if rel_pos_bias:
@ -1064,8 +1051,7 @@ class ViTransformerWrapper(nn.Module):
        self.patch_size = patch_size

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
-        # nn.Linear
-        self.patch_to_embedding = ml.Linear(patch_dim, dim)
+        self.patch_to_embedding = nn.Linear(patch_dim, dim)
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)

@ -1123,21 +1109,18 @@ class TransformerWrapper(nn.Module):
        self.max_mem_len = max_mem_len
        self.shift_mem_down = shift_mem_down

-        # nn.Embedding
-        self.token_emb = ml.Embedding(num_tokens, emb_dim)
+        self.token_emb = nn.Embedding(num_tokens, emb_dim)
        self.pos_emb = AbsolutePositionalEmbedding(emb_dim, max_seq_len) if (
                    use_pos_emb and not attn_layers.has_pos_emb) else always(0)
        self.emb_dropout = nn.Dropout(emb_dropout)

-        # nn.Linear
-        self.project_emb = ml.Linear(emb_dim, dim) if emb_dim != dim else nn.Identity()
+        self.project_emb = nn.Linear(emb_dim, dim) if emb_dim != dim else nn.Identity()
        self.attn_layers = attn_layers
        self.norm = nn.LayerNorm(dim)

        self.init_()

-        # nn.Linear
-        self.to_logits = ml.Linear(dim, num_tokens) if not tie_embedding else lambda t: t @ self.token_emb.weight.t()
+        self.to_logits = nn.Linear(dim, num_tokens) if not tie_embedding else lambda t: t @ self.token_emb.weight.t()

        # memory tokens (like [cls]) from Memory Transformers paper
        num_memory_tokens = default(num_memory_tokens, 0)
@ -1224,14 +1207,12 @@ class ContinuousTransformerWrapper(nn.Module):
                    use_pos_emb and not attn_layers.has_pos_emb) else always(0)
        self.emb_dropout = nn.Dropout(emb_dropout)

-        # nn.Linear
-        self.project_in = ml.Linear(dim_in, dim) if exists(dim_in) else nn.Identity()
+        self.project_in = nn.Linear(dim_in, dim) if exists(dim_in) else nn.Identity()

        self.attn_layers = attn_layers
        self.norm = nn.LayerNorm(dim)

-        # nn.Linear
-        self.project_out = ml.Linear(dim, dim_out) if exists(dim_out) else nn.Identity()
+        self.project_out = nn.Linear(dim, dim_out) if exists(dim_out) else nn.Identity()

    def forward(
            self,
--- a/tortoise/read.py
+++ b/tortoise/read.py
@ -17,7 +17,6 @@ if __name__ == '__main__':
                                                 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='pat')
    parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/')
    parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
-    parser.add_argument('--use_deepspeed', type=bool, help='Use deepspeed for speed bump.', default=True)
    parser.add_argument('--regenerate', type=str, help='Comma-separated list of clip numbers to re-generate, or nothing.', default=None)
    parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice. Only the first candidate is actually used in the final product, the others can be used manually.', default=1)
    parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'
@ -26,7 +25,7 @@ if __name__ == '__main__':
    parser.add_argument('--produce_debug_state', type=bool, help='Whether or not to produce debug_state.pth, which can aid in reproducing problems. Defaults to true.', default=True)

    args = parser.parse_args()
-    tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed)
+    tts = TextToSpeech(models_dir=args.model_dir)

    outpath = args.output_path
    selected_voices = args.voice.split(',')
--- a/tortoise/utils/audio.py
+++ b/tortoise/utils/audio.py
@ -2,7 +2,6 @@ import os
 from glob import glob

 import librosa
-import soundfile as sf
 import torch
 import torchaudio
 import numpy as np
@ -25,9 +24,6 @@ def load_audio(audiopath, sampling_rate):
    elif audiopath[-4:] == '.mp3':
        audio, lsr = librosa.load(audiopath, sr=sampling_rate)
        audio = torch.FloatTensor(audio)
-    elif audiopath[-5:] == '.flac':
-        audio, lsr = sf.read(audiopath)
-        audio = torch.FloatTensor(audio)
    else:
        assert False, f"Unsupported audio format provided: {audiopath[-4:]}"

@ -89,94 +85,31 @@ def get_voices(extra_voice_dirs=[], load_latents=True):
        for sub in subs:
            subj = os.path.join(d, sub)
            if os.path.isdir(subj):
-                voices[sub] = list(glob(f'{subj}/*.wav')) + list(glob(f'{subj}/*.mp3')) + list(glob(f'{subj}/*.flac'))
+                voices[sub] = list(glob(f'{subj}/*.wav')) + list(glob(f'{subj}/*.mp3'))
                if load_latents:
                    voices[sub] = voices[sub] + list(glob(f'{subj}/*.pth'))
    return voices

-def get_voice( name, dir=get_voice_dir(), load_latents=True, extensions=["wav", "mp3", "flac"] ):
-    subj = f'{dir}/{name}/'
-    if not os.path.isdir(subj):
-        return
-    files = os.listdir(subj)
-    
-    if load_latents:
-        extensions.append("pth")

-    voice = []
-    for file in files:
-        ext = os.path.splitext(file)[-1][1:]
-        if ext not in extensions:
-            continue
-
-        voice.append(f'{subj}/{file}')
-
-    return sorted( voice )
-
-def get_voice_list(dir=get_voice_dir(), append_defaults=False, load_latents=True, extensions=["wav", "mp3", "flac"]):
-    defaults = [ "random", "microphone" ]
-    os.makedirs(dir, exist_ok=True)
-    #res = sorted([d for d in os.listdir(dir) if d not in defaults and os.path.isdir(os.path.join(dir, d)) and len(os.listdir(os.path.join(dir, d))) > 0 ])
-
-    res = []
-    for name in os.listdir(dir):
-        if name in defaults:
-            continue
-        if not os.path.isdir(f'{dir}/{name}'):
-            continue
-        if len(os.listdir(os.path.join(dir, name))) == 0:
-            continue
-        files = get_voice( name, dir=dir, extensions=extensions, load_latents=load_latents )
-
-        if len(files) > 0:
-            res.append(name)
-        else:
-            for subdir in os.listdir(f'{dir}/{name}'):
-                if not os.path.isdir(f'{dir}/{name}/{subdir}'):
-                    continue
-                files = get_voice( f'{name}/{subdir}', dir=dir, extensions=extensions, load_latents=load_latents )
-                if len(files) == 0:
-                    continue
-                res.append(f'{name}/{subdir}')
-
-    res = sorted(res)
-    
-    if append_defaults:
-        res = res + defaults
-    
-    return res
-
-
-def _get_voices( dirs=[get_voice_dir()], load_latents=True ):
-    voices = {}
-    for dir in dirs:
-        voice_list = get_voice_list(dir=dir)
-        voices |= { name: get_voice(name=name, dir=dir, load_latents=load_latents) for name in voice_list }
-
-    return voices
-
-def load_voice(voice, extra_voice_dirs=[], load_latents=True, sample_rate=22050, device='cpu', model_hash=None):
+def load_voice(voice, extra_voice_dirs=[], load_latents=True, sample_rate=22050, device='cpu'):
    if voice == 'random':
        return None, None

-    voices = _get_voices(dirs=[get_voice_dir()] + extra_voice_dirs, load_latents=load_latents)
-
+    voices = get_voices(extra_voice_dirs=extra_voice_dirs, load_latents=load_latents)
    paths = voices[voice]
-    mtime = 0
-    
-    latent = None
-    voices = []

-    for path in paths:
-        filename = os.path.basename(path)
-        if filename[-4:] == ".pth" and filename[:12] == "cond_latents":
-            if not model_hash and filename == "cond_latents.pth":
-                latent = path
-            elif model_hash and filename == f"cond_latents_{model_hash[:8]}.pth":
-                latent = path
+    mtime = 0
+    voices = []
+    latent = None
+    for file in paths:
+        if file[-16:] == "cond_latents.pth":
+            latent = file
+        elif file[-4:] == ".pth":
+            {}
+            # noop
        else:
-            voices.append(path)
-            mtime = max(mtime, os.path.getmtime(path))
+            voices.append(file)
+            mtime = max(mtime, os.path.getmtime(file))

    if load_latents and latent is not None:
        if os.path.getmtime(latent) > mtime:
--- a/tortoise/utils/device.py
+++ b/tortoise/utils/device.py
@ -1,130 +1,97 @@
-import torch
-import psutil
-import importlib
-
-DEVICE_OVERRIDE = None
-DEVICE_BATCH_SIZE_MAP = [(14, 16), (10,8), (7,4)]
-
-from inspect import currentframe, getframeinfo
-import gc
-
-def do_gc():
-    gc.collect()
-    try:
-        torch.cuda.empty_cache()
-    except Exception as e:
-        pass
-
-def print_stats(collect=False):
-    cf = currentframe().f_back
-    msg = f'{getframeinfo(cf).filename}:{cf.f_lineno}'
-
-    if collect:
-        do_gc()
-
-    tot = torch.cuda.get_device_properties(0).total_memory / (1024 ** 3)
-    res = torch.cuda.memory_reserved(0) / (1024 ** 3)
-    alloc = torch.cuda.memory_allocated(0) / (1024 ** 3)
-    print("[{}] Total: {:.3f} | Reserved: {:.3f} | Allocated: {:.3f} | Free: {:.3f}".format( msg, tot, res, alloc, tot-res ))
-
-
-def has_dml():
-    loader = importlib.find_loader('torch_directml')
-    if loader is None:
-        return False
-    
-    import torch_directml
-    return torch_directml.is_available()
-
-def set_device_name(name):
-    global DEVICE_OVERRIDE
-    DEVICE_OVERRIDE = name
-
-def get_device_name(attempt_gc=True):
-    global DEVICE_OVERRIDE
-    if DEVICE_OVERRIDE is not None and DEVICE_OVERRIDE != "":
-        return DEVICE_OVERRIDE
-
-    name = 'cpu'
-
-    if torch.cuda.is_available():
-        name = 'cuda'
-        if attempt_gc:
-            torch.cuda.empty_cache() # may have performance implications
-    elif has_dml():
-        name = 'dml'
-
-    return name
-
-def get_device(verbose=False):
-    name = get_device_name()
-
-    if verbose:
-        if name == 'cpu':
-            print("No hardware acceleration is available, falling back to CPU...")    
-        else:
-            print(f"Hardware acceleration found: {name}")
-
-    if name == "dml":
-        import torch_directml
-        return torch_directml.device()
-
-    return torch.device(name)
-
-def get_device_vram( name=get_device_name() ):
-    available = 1
-
-    if name == "cuda":
-        _, available = torch.cuda.mem_get_info()
-    elif name == "cpu":
-        available = psutil.virtual_memory()[4]
-
-    return available / (1024 ** 3)
-
-def get_device_batch_size(name=get_device_name()):
-    vram = get_device_vram(name)
-
-    if vram > 14:
-        return 16
-    elif vram > 10:
-        return 8
-    elif vram > 7:
-        return 4
-    """
-    for k, v in DEVICE_BATCH_SIZE_MAP:
-        if vram > k:
-            return v
-    """
-    return 1
-
-def get_device_count(name=get_device_name()):
-    if name == "cuda":
-        return torch.cuda.device_count()
-    if name == "dml":
-        import torch_directml
-        return torch_directml.device_count()
-
-    return 1
-
-
-# if you're getting errors make sure you've updated your torch-directml, and if you're still getting errors then you can uncomment the below block
-"""
-if has_dml():
-    _cumsum = torch.cumsum
-    _repeat_interleave = torch.repeat_interleave
-    _multinomial = torch.multinomial
-    
-    _Tensor_new = torch.Tensor.new
-    _Tensor_cumsum = torch.Tensor.cumsum
-    _Tensor_repeat_interleave = torch.Tensor.repeat_interleave
-    _Tensor_multinomial = torch.Tensor.multinomial
-
-    torch.cumsum = lambda input, *args, **kwargs: ( _cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
-    torch.repeat_interleave = lambda input, *args, **kwargs: ( _repeat_interleave(input.to("cpu"), *args, **kwargs).to(input.device) )
-    torch.multinomial = lambda input, *args, **kwargs: ( _multinomial(input.to("cpu"), *args, **kwargs).to(input.device) )
-    
-    torch.Tensor.new = lambda self, *args, **kwargs: ( _Tensor_new(self.to("cpu"), *args, **kwargs).to(self.device) )
-    torch.Tensor.cumsum = lambda self, *args, **kwargs: ( _Tensor_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )
-    torch.Tensor.repeat_interleave = lambda self, *args, **kwargs: ( _Tensor_repeat_interleave(self.to("cpu"), *args, **kwargs).to(self.device) )
-    torch.Tensor.multinomial = lambda self, *args, **kwargs: ( _Tensor_multinomial(self.to("cpu"), *args, **kwargs).to(self.device) )
-"""
+import torch
+import psutil
+import importlib
+
+DEVICE_OVERRIDE = None
+
+def has_dml():
+    loader = importlib.find_loader('torch_directml')
+    if loader is None:
+        return False
+    
+    import torch_directml
+    return torch_directml.is_available()
+
+def set_device_name(name):
+    global DEVICE_OVERRIDE
+    DEVICE_OVERRIDE = name
+
+def get_device_name():
+    global DEVICE_OVERRIDE
+    if DEVICE_OVERRIDE is not None and DEVICE_OVERRIDE != "":
+        return DEVICE_OVERRIDE
+
+    name = 'cpu'
+
+    if torch.cuda.is_available():
+        name = 'cuda'
+    elif has_dml():
+        name = 'dml'
+
+    return name
+
+def get_device(verbose=False):
+    name = get_device_name()
+
+    if verbose:
+        if name == 'cpu':
+            print("No hardware acceleration is available, falling back to CPU...")    
+        else:
+            print(f"Hardware acceleration found: {name}")
+
+    if name == "dml":
+        import torch_directml
+        return torch_directml.device()
+
+    return torch.device(name)
+
+def get_device_batch_size():
+    available = 1
+    name = get_device_name()
+
+    if name == "dml":
+        # there's nothing publically accessible in the DML API that exposes this
+        # there's a method to get currently used RAM statistics... as tiles
+        available = 1
+    elif name == "cuda":
+        _, available = torch.cuda.mem_get_info()
+    elif name == "cpu":
+        available = psutil.virtual_memory()[4]
+
+    availableGb = available / (1024 ** 3)
+    if availableGb > 14:
+        return 16
+    elif availableGb > 10:
+        return 8
+    elif availableGb > 7:
+        return 4
+    return 1
+
+def get_device_count(name=get_device_name()):
+    if name == "cuda":
+        return torch.cuda.device_count()
+    if name == "dml":
+        import torch_directml
+        return torch_directml.device_count()
+
+    return 1
+
+
+if has_dml():
+    _cumsum = torch.cumsum
+    _repeat_interleave = torch.repeat_interleave
+    _multinomial = torch.multinomial
+    
+    _Tensor_new = torch.Tensor.new
+    _Tensor_cumsum = torch.Tensor.cumsum
+    _Tensor_repeat_interleave = torch.Tensor.repeat_interleave
+    _Tensor_multinomial = torch.Tensor.multinomial
+
+    torch.cumsum = lambda input, *args, **kwargs: ( _cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
+    torch.repeat_interleave = lambda input, *args, **kwargs: ( _repeat_interleave(input.to("cpu"), *args, **kwargs).to(input.device) )
+    torch.multinomial = lambda input, *args, **kwargs: ( _multinomial(input.to("cpu"), *args, **kwargs).to(input.device) )
+    
+    torch.Tensor.new = lambda self, *args, **kwargs: ( _Tensor_new(self.to("cpu"), *args, **kwargs).to(self.device) )
+    torch.Tensor.cumsum = lambda self, *args, **kwargs: ( _Tensor_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )
+    torch.Tensor.repeat_interleave = lambda self, *args, **kwargs: ( _Tensor_repeat_interleave(self.to("cpu"), *args, **kwargs).to(self.device) )
+    torch.Tensor.multinomial = lambda self, *args, **kwargs: ( _Tensor_multinomial(self.to("cpu"), *args, **kwargs).to(self.device) )
--- a/tortoise/utils/diffusion.py
+++ b/tortoise/utils/diffusion.py
@ -13,7 +13,15 @@ import math
 import numpy as np
 import torch
 import torch as th
-from tqdm.auto import tqdm
+from tqdm import tqdm
+
+def tqdm_override(arr, verbose=False, progress=None, desc=None):
+    if verbose and desc is not None:
+        print(desc)
+        
+    if progress is None:
+        return tqdm(arr, disable=not verbose)
+    return progress.tqdm(arr, desc=f'{progress.msg_prefix} {desc}' if hasattr(progress, 'msg_prefix') else desc, track_tqdm=True)

 def normal_kl(mean1, logvar1, mean2, logvar2):
    """
@ -548,6 +556,7 @@ class GaussianDiffusion:
        model_kwargs=None,
        device=None,
        verbose=False,
+        progress=None,
        desc=None
    ):
        """
@ -580,6 +589,7 @@ class GaussianDiffusion:
            model_kwargs=model_kwargs,
            device=device,
            verbose=verbose,
+            progress=progress,
            desc=desc
        ):
            final = sample
@ -596,6 +606,7 @@ class GaussianDiffusion:
        model_kwargs=None,
        device=None,
        verbose=False,
+        progress=None,
        desc=None
    ):
        """
@ -615,7 +626,7 @@ class GaussianDiffusion:
            img = th.randn(*shape, device=device)
        indices = list(range(self.num_timesteps))[::-1]

-        for i in tqdm(indices, desc=desc):
+        for i in tqdm_override(indices, verbose=verbose, desc=desc, progress=progress):
            t = th.tensor([i] * shape[0], device=device)
            with th.no_grad():
                out = self.p_sample(
@ -730,6 +741,7 @@ class GaussianDiffusion:
        device=None,
        verbose=False,
        eta=0.0,
+        progress=None,
        desc=None,
    ):
        """
@ -749,6 +761,7 @@ class GaussianDiffusion:
            device=device,
            verbose=verbose,
            eta=eta,
+            progress=progress,
            desc=desc
        ):
            final = sample
@ -766,6 +779,7 @@ class GaussianDiffusion:
        device=None,
        verbose=False,
        eta=0.0,
+        progress=None,
        desc=None,
    ):
        """
@ -784,7 +798,10 @@ class GaussianDiffusion:
        indices = list(range(self.num_timesteps))[::-1]

        if verbose:
-            indices = tqdm(indices, desc=desc)
+            # Lazy import so that we don't depend on tqdm.
+            from tqdm.auto import tqdm
+
+            indices = tqdm_override(indices, verbose=verbose, desc=desc, progress=progress)

        for i in indices:
            t = th.tensor([i] * shape[0], device=device)
--- a/tortoise/utils/tokenizer.py
+++ b/tortoise/utils/tokenizer.py
@ -1,6 +1,5 @@
 import os
 import re
-import json

 import inflect
 import torch
@ -171,39 +170,16 @@ DEFAULT_VOCAB_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), '


 class VoiceBpeTokenizer:
-    def __init__(self, vocab_file=DEFAULT_VOCAB_FILE, preprocess=None):
-        with open(vocab_file, 'r', encoding='utf-8') as f:
-          vocab = json.load(f)
-
-        self.language = vocab['model']['language'] if 'language' in vocab['model'] else None
-
-        if preprocess is None:
-          self.preprocess = 'pre_tokenizer' in vocab and vocab['pre_tokenizer']
-        else:
-            self.preprocess = preprocess
+    def __init__(self, vocab_file=DEFAULT_VOCAB_FILE):
        if vocab_file is not None:
            self.tokenizer = Tokenizer.from_file(vocab_file)

    def preprocess_text(self, txt):
-        if self.language == 'ja':
-          import pykakasi
-
-          kks = pykakasi.kakasi()
-          results = kks.convert(txt)
-          words = []
-
-          for result in results:
-            words.append(result['kana'])
-
-          txt = " ".join(words)
-          txt = basic_cleaners(txt)
-        else:
-          txt = english_cleaners(txt)
+        txt = english_cleaners(txt)
        return txt

    def encode(self, txt):
-        if self.preprocess:
-          txt = self.preprocess_text(txt)
+        txt = self.preprocess_text(txt)
        txt = txt.replace(' ', '[SPACE]')
        return self.tokenizer.encode(txt).ids

--- a/tortoise/utils/torch_intermediary.py
+++ b/tortoise/utils/torch_intermediary.py
@ -1,65 +0,0 @@
-"""
-from bitsandbytes.nn import Linear8bitLt as Linear
-from bitsandbytes.nn import StableEmbedding as Embedding
-from bitsandbytes.optim.adam import Adam8bit as Adam
-from bitsandbytes.optim.adamw import AdamW8bit as AdamW
-"""
-"""
-from torch.nn import Linear
-from torch.nn import Embedding
-from torch.optim.adam import Adam
-from torch.optim.adamw import AdamW
-"""
-
-"""
-OVERRIDE_LINEAR = False
-OVERRIDE_EMBEDDING = False
-OVERRIDE_ADAM = False # True
-OVERRIDE_ADAMW = False # True
-"""
-
-import os
-
-USE_STABLE_EMBEDDING = False
-try:
-	OVERRIDE_LINEAR = False
-	OVERRIDE_EMBEDDING = False
-	OVERRIDE_ADAM = False
-	OVERRIDE_ADAMW = False
-
-	USE_STABLE_EMBEDDING = os.environ.get('BITSANDBYTES_USE_STABLE_EMBEDDING', '1' if USE_STABLE_EMBEDDING else '0') == '1'
-	OVERRIDE_LINEAR = os.environ.get('BITSANDBYTES_OVERRIDE_LINEAR', '1' if OVERRIDE_LINEAR else '0') == '1'
-	OVERRIDE_EMBEDDING = os.environ.get('BITSANDBYTES_OVERRIDE_EMBEDDING', '1' if OVERRIDE_EMBEDDING else '0') == '1'
-	OVERRIDE_ADAM = os.environ.get('BITSANDBYTES_OVERRIDE_ADAM', '1' if OVERRIDE_ADAM else '0') == '1'
-	OVERRIDE_ADAMW = os.environ.get('BITSANDBYTES_OVERRIDE_ADAMW', '1' if OVERRIDE_ADAMW else '0') == '1'
-	
-	if OVERRIDE_LINEAR or OVERRIDE_EMBEDDING or OVERRIDE_ADAM or OVERRIDE_ADAMW:
-		import bitsandbytes as bnb
-except Exception as e:
-	OVERRIDE_LINEAR = False
-	OVERRIDE_EMBEDDING = False
-	OVERRIDE_ADAM = False
-	OVERRIDE_ADAMW = False
-
-if OVERRIDE_LINEAR:
-	from bitsandbytes.nn import Linear8bitLt as Linear
-else:
-	from torch.nn import Linear
-
-if OVERRIDE_EMBEDDING:
-	if USE_STABLE_EMBEDDING:
-		from bitsandbytes.nn import StableEmbedding as Embedding
-	else:
-		from bitsandbytes.nn.modules import Embedding as Embedding
-else:
-	from torch.nn import Embedding
-
-if OVERRIDE_ADAM:
-	from bitsandbytes.optim.adam import Adam8bit as Adam
-else:
-	from torch.optim.adam import Adam
-
-if OVERRIDE_ADAMW:
-	from bitsandbytes.optim.adamw import AdamW8bit as AdamW
-else:
-	from torch.optim.adamw import AdamW
--- a/tortoise/utils/wav2vec_alignment.py
+++ b/tortoise/utils/wav2vec_alignment.py
@ -7,8 +7,6 @@ from transformers import Wav2Vec2ForCTC, Wav2Vec2FeatureExtractor, Wav2Vec2CTCTo
 from tortoise.utils.audio import load_audio
 from tortoise.utils.device import get_device

-import tortoise.utils.torch_intermediary as ml
-
 def max_alignment(s1, s2, skip_character='~', record=None):
    """
    A clever function that aligns s1 to s2 as best it can. Wherever a character from s1 is not found in s2, a '~' is
@ -144,7 +142,7 @@ class Wav2VecAlignment:
        non_redacted_intervals = []
        last_point = 0
        for i in range(len(fully_split)):
-            if i % 2 == 0 and fully_split[i] != "": # Check for empty string fixes index error
+            if i % 2 == 0:
                end_interval = max(0, last_point + len(fully_split[i]) - 1)
                non_redacted_intervals.append((last_point, end_interval))
            last_point += len(fully_split[i])
--- a/tortoise_tts.ipynb
+++ b/tortoise_tts.ipynb
@ -0,0 +1,137 @@
+{
+   "nbformat":4,
+   "nbformat_minor":0,
+   "metadata":{
+      "colab":{
+         "private_outputs":true,
+         "provenance":[
+            
+         ]
+      },
+      "kernelspec":{
+         "name":"python3",
+         "display_name":"Python 3"
+      },
+      "language_info":{
+         "name":"python"
+      },
+      "accelerator":"GPU",
+      "gpuClass":"standard"
+   },
+   "cells":[
+      {
+         "cell_type":"markdown",
+         "source":[
+            "## Initialization"
+         ],
+         "metadata":{
+            "id":"ni41hmE03DL6"
+         }
+      },
+      {
+         "cell_type":"code",
+         "execution_count":null,
+         "metadata":{
+            "id":"FtsMKKfH18iM"
+         },
+         "outputs":[
+            
+         ],
+         "source":[
+            "!git clone https://git.ecker.tech/mrq/ai-voice-cloning/\n",
+            "%cd ai-voice-cloning\n",
+            "!python -m pip install --upgrade pip\n",
+            "!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116\n",
+            "!python -m pip install -r ./requirements.txt"
+         ]
+      },
+      {
+         "cell_type":"code",
+         "source":[
+            "# colab requires the runtime to restart before use\n",
+            "exit()"
+         ],
+         "metadata":{
+            "id":"FVUOtSASCSJ8"
+         },
+         "execution_count":null,
+         "outputs":[
+            
+         ]
+      },
+      {
+         "cell_type":"markdown",
+         "source":[
+            "## Running"
+         ],
+         "metadata":{
+            "id":"o1gkfw3B3JSk"
+         }
+      },
+      {
+         "cell_type":"code",
+         "source":[
+            "%cd /content/ai-voice-cloning\n",
+            "\n",
+            "import os\n",
+            "import sys\n",
+            "\n",
+            "sys.argv = [\"\"]\n",
+            "sys.path.append('./src/')\n",
+            "\n",
+            "if 'TORTOISE_MODELS_DIR' not in os.environ:\n",
+            "\tos.environ['TORTOISE_MODELS_DIR'] = os.path.realpath(os.path.join(os.getcwd(), './models/tortoise/'))\n",
+            "\n",
+            "if 'TRANSFORMERS_CACHE' not in os.environ:\n",
+            "\tos.environ['TRANSFORMERS_CACHE'] = os.path.realpath(os.path.join(os.getcwd(), './models/transformers/'))\n",
+            "\n",
+            "from utils import *\n",
+            "from webui import *\n",
+            "\n",
+            "args = setup_args()\n",
+            "\n",
+            "webui = setup_gradio()\n",
+            "tts = setup_tortoise()\n",
+            "webui.launch(share=True, prevent_thread_lock=True, height=1000)\n",
+            "webui.block_thread()"
+         ],
+         "metadata":{
+            "id":"c_EQZLTA19c7"
+         },
+         "execution_count":null,
+         "outputs":[
+            
+         ]
+      },
+      {
+         "cell_type":"markdown",
+         "source":[
+            "## Exporting"
+         ],
+         "metadata":{
+            "id":"2AnVQxEJx47p"
+         }
+      },
+      {
+         "cell_type":"code",
+         "source":[
+            "%cd /content/ai-voice-cloning\n",
+            "!apt install -y p7zip-full\n",
+            "from datetime import datetime\n",
+            "timestamp = datetime.now().strftime('%m-%d-%Y_%H:%M:%S')\n",
+            "!mkdir -p \"../{timestamp}\"\n",
+            "!mv ./results/* \"../{timestamp}/.\"\n",
+            "!7z a -t7z -m0=lzma2 -mx=9 -mfb=64 -md=32m -ms=on \"../{timestamp}.7z\" \"../{timestamp}/\"\n",
+            "!ls ~/\n",
+            "!echo \"Finished zipping, archive is available at {timestamp}.7z\""
+         ],
+         "metadata":{
+            "id":"YOACiDCXx72G"
+         },
+         "execution_count":null,
+         "outputs":[
+            
+         ]
+      }
+   ]
+}
--- a/update-force.bat
+++ b/update-force.bat
@ -0,0 +1,3 @@
+git fetch --all
+git reset --hard origin/main
+call .\update.bat
--- a/update-force.sh
+++ b/update-force.sh
@ -0,0 +1,3 @@
+git fetch --all
+git reset --hard origin/main
+./update.sh
--- a/update.bat
+++ b/update.bat
@ -0,0 +1,7 @@
+git pull
+python -m venv tortoise-venv
+call .\tortoise-venv\Scripts\activate.bat
+python -m pip install --upgrade pip
+python -m pip install -r ./requirements.txt
+deactivate
+pause
--- a/update.sh
+++ b/update.sh
@ -0,0 +1,6 @@
+git pull
+python -m venv tortoise-venv
+source ./tortoise-venv/bin/activate
+python -m pip install --upgrade pip
+python -m pip install -r ./requirements.txt
+deactivate
--- a/voices/.gitkeep
+++ b/voices/.gitkeep
--- a/webui.py
+++ b/webui.py