Compare commits

...

No commits in common. "main" and "main" have entirely different histories.
main ... main

42 changed files with 1951 additions and 1609 deletions

View File

@ -1,7 +1,5 @@
# (QoL improvements for) TorToiSe
This repo is for my modifications to [neonbjb/tortoise-tts](https://github.com/neonbjb/tortoise-tts). If you need the original README, refer to the original repo.
This repo is for my modifications to [neonbjb/tortoise-tts](https://github.com/neonbjb/tortoise-tts).
\> w-where'd everything go?
Please migrate to [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning), as that repo is the more cohesive package for voice cloning.
For the original repo, please go to [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning).

283
README_OLD.md Executable file
View File

@ -0,0 +1,283 @@
# TorToiSe
Tortoise is a text-to-speech program built with the following priorities:
1. Strong multi-voice capabilities.
2. Highly realistic prosody and intonation.
This repo contains all the code needed to run Tortoise TTS in inference mode.
A (*very*) rough draft of the Tortoise paper is now available in doc format. I would definitely appreciate any comments, suggestions or reviews:
https://docs.google.com/document/d/13O_eyY65i6AkNrN_LdPhpUjGhyTNKYHvDrIvHnHe1GA
### Version history
#### v2.4; 2022/5/17
- Removed CVVP model. Found that it does not, in fact, make an appreciable difference in the output.
- Add better debugging support; existing tools now spit out debug files which can be used to reproduce bad runs.
#### v2.3; 2022/5/12
- New CLVP-large model for further improved decoding guidance.
- Improvements to read.py and do_tts.py (new options)
#### v2.2; 2022/5/5
- Added several new voices from the training set.
- Automated redaction. Wrap the text you want to use to prompt the model but not be spoken in brackets.
- Bug fixes
#### v2.1; 2022/5/2
- Added ability to produce totally random voices.
- Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent.
- Added ability to use your own pretrained models.
- Refactored directory structures.
- Performance improvements & bug fixes.
## What's in a name?
I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model
is insanely slow. It leverages both an autoregressive decoder **and** a diffusion decoder; both known for their low
sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.
## Demos
See [this page](http://nonint.com/static/tortoise_v2_examples.html) for a large list of example outputs.
Cool application of Tortoise+GPT-3 (not by me): https://twitter.com/lexman_ai
## Usage guide
### Colab
Colab is the easiest way to try this out. I've put together a notebook you can use here:
https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing
### Local Installation
If you want to use this on your own computer, you must have an NVIDIA GPU.
First, install pytorch using these instructions: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/).
On Windows, I **highly** recommend using the Conda installation path. I have been told that if you do not do this, you
will spend a lot of time chasing dependency problems.
Next, install TorToiSe and it's dependencies:
```shell
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python -m pip install -r ./requirements.txt
python setup.py install
```
If you are on windows, you will also need to install pysoundfile: `conda install -c conda-forge pysoundfile`
### do_tts.py
This script allows you to speak a single phrase with one or more voices.
```shell
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
```
### read.py
This script provides tools for reading large amounts of text.
```shell
python tortoise/read.py --textfile <your text to be read> --voice random
```
This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series
of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and
output that as well.
Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running `read.py` with the --regenerate
argument.
### API
Tortoise can be used programmatically, like so:
```python
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
```
## Voice customization guide
Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
### Random voice
I've included a feature which randomly generates a voice. These voices don't actually exist and will be random every time you run
it. The results are quite fascinating and I recommend you play around with it!
You can use the random voice by passing in 'random' as the voice name. Tortoise will take care of the rest.
For the those in the ML space: this is created by projecting a random vector onto the voice conditioning latent space.
### Provided voices
This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform
far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see
what Tortoise can do for zero-shot mimicking, take a look at the others.
### Adding a new voice
To add new voices to Tortoise, you will need to do the following:
1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
3. Save the clips as a WAV file with floating point format and a 22,050 sample rate.
4. Create a subdirectory in voices/
5. Put your clips in that subdirectory.
6. Run tortoise utilities with --voice=<your_subdirectory_name>.
### Picking good reference clips
As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking
good clips:
1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
2. Avoid speeches. These generally have distortion caused by the amplification system.
3. Avoid clips from phone calls.
4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.
## Advanced Usage
### Generation settings
Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs
that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using
various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've
set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with
these settings (and it's very likely that I missed something!)
These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
```api.tts``` for a full list.
### Prompt engineering
Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion
by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to
take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the
prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality).
### Playing with the voice latent
Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
what it thinks the "average" of those two voices sounds like.
#### Generating conditioning latents from voices
Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
#### Using raw conditioning latents to generate speech
After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
### Send me feedback!
Probabilistic models like Tortoise are best thought of as an "augmented search" - in this case, through the space of possible
utterances of a specific string of text. The impact of community involvement in perusing these spaces (such as is being done with
GPT-3 or CLIP) has really surprised me. If you find something neat that you can do with Tortoise that isn't documented here,
please report it to me! I would be glad to publish it to this page.
## Tortoise-detect
Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip
came from Tortoise.
This classifier can be run on any computer, usage is as follows:
```commandline
python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
```
This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false
positives.
## Model architecture
Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
models that work together. I've assembled a write-up of the system architecture here:
[https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)
## Training
These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of
~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own
[DLAS](https://github.com/neonbjb/DL-Art-School) trainer.
I currently do not have plans to release the training configurations or methodology. See the next section..
## Ethical Considerations
Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began
wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system
could be misused are many. It doesn't take much creativity to think up how.
After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:
1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See `tortoise-detect` above.
5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.
### Diversity
The diversity expressed by ML models is strongly tied to the datasets they were trained on.
Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to
balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities
or of people who speak with strong accents.
## Looking forward
Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when
training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training
of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with
exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.
I want to mention here
that I think Tortoise could do be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks
or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason
to believe that the same is not true of TTS.
The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer.
Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve.
If you are an ethical organization with computational resources to spare interested in seeing what this model could do
if properly scaled out, please reach out to me! I would love to collaborate on this.
## Acknowledgements
This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to
credit a few of the amazing folks in the community that have helped make this happen:
- Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
- [Ramesh et al](https://arxiv.org/pdf/2102.12092.pdf) who authored the DALLE paper, which is the inspiration behind Tortoise.
- [Nichol and Dhariwal](https://arxiv.org/pdf/2102.09672.pdf) who authored the (revision of) the code that drives the diffusion model.
- [Jang et al](https://arxiv.org/pdf/2106.07889.pdf) who developed and open-sourced univnet, the vocoder this repo uses.
- [Kim and Jung](https://github.com/mindslab-ai/univnet) who implemented univnet pytorch model.
- [lucidrains](https://github.com/lucidrains) who writes awesome open source pytorch models, many of which are used here.
- [Patrick von Platen](https://huggingface.co/patrickvonplaten) whose guides on setting up wav2vec were invaluable to building my dataset.
## Notice
Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.
If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.

5
list_devices.py Executable file
View File

@ -0,0 +1,5 @@
import torch
devices = [f"cuda:{i} => {torch.cuda.get_device_name(i)}" for i in range(torch.cuda.device_count())]
print(devices)

34
main.py Executable file
View File

@ -0,0 +1,34 @@
import os
import webui as mrq
if 'TORTOISE_MODELS_DIR' not in os.environ:
os.environ['TORTOISE_MODELS_DIR'] = os.path.realpath(os.path.join(os.getcwd(), './models/tortoise/'))
if 'TRANSFORMERS_CACHE' not in os.environ:
os.environ['TRANSFORMERS_CACHE'] = os.path.realpath(os.path.join(os.getcwd(), './models/transformers/'))
if __name__ == "__main__":
mrq.args = mrq.setup_args()
if mrq.args.listen_path is not None and mrq.args.listen_path != "/":
import uvicorn
uvicorn.run("main:app", host=mrq.args.listen_host, port=mrq.args.listen_port if not None else 8000)
else:
mrq.webui = mrq.setup_gradio()
mrq.webui.launch(share=mrq.args.share, prevent_thread_lock=True, server_name=mrq.args.listen_host, server_port=mrq.args.listen_port)
mrq.tts = mrq.setup_tortoise()
mrq.webui.block_thread()
elif __name__ == "main":
from fastapi import FastAPI
import gradio as gr
import sys
sys.argv = [sys.argv[0]]
app = FastAPI()
mrq.args = mrq.setup_args()
mrq.webui = mrq.setup_gradio()
app = gr.mount_gradio_app(app, mrq.webui, path=mrq.args.listen_path)
mrq.tts = mrq.setup_tortoise()

View File

@ -7,9 +7,9 @@ progressbar
einops
unidecode
scipy
librosa==0.8.1
librosa
torchaudio
threadpoolctl
appdirs
numpy<=1.23.5
numpy
numba

8
setup-cuda.bat Executable file
View File

@ -0,0 +1,8 @@
python -m venv tortoise-venv
call .\tortoise-venv\Scripts\activate.bat
python -m pip install --upgrade pip
python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
python -m pip install -r ./requirements.txt
python -m pip install -r ./requirements_legacy.txt
deactivate
pause

8
setup-cuda.sh Executable file
View File

@ -0,0 +1,8 @@
python -m venv tortoise-venv
source ./tortoise-venv/bin/activate
python -m pip install --upgrade pip
# CUDA
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
python -m pip install -r ./requirements.txt
python -m pip install -r ./requirements_legacy.txt
deactivate

8
setup-directml.bat Executable file
View File

@ -0,0 +1,8 @@
python -m venv tortoise-venv
call .\tortoise-venv\Scripts\activate.bat
python -m pip install --upgrade pip
python -m pip install torch torchvision torchaudio torch-directml
python -m pip install -r ./requirements.txt
python -m pip install -r ./requirements_legacy.txt
deactivate
pause

8
setup-rocm.sh Executable file
View File

@ -0,0 +1,8 @@
python -m venv tortoise-venv
source ./tortoise-venv/bin/activate
python -m pip install --upgrade pip
# ROCM
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1 # 5.2 does not work for me desu
python -m pip install -r ./requirements.txt
python -m pip install -r ./requirements_legacy.txt
deactivate

8
setup.py Normal file → Executable file
View File

@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf-8") as fh:
setuptools.setup(
name="TorToiSe",
packages=setuptools.find_packages(),
version="2.4.5",
version="2.4.3",
author="James Betker",
author_email="james@adamant.ai",
description="A high quality multi-voice text-to-speech library",
@ -29,12 +29,6 @@ setuptools.setup(
'librosa',
'transformers',
'tokenizers',
'transformers==4.19',
'torchaudio',
'threadpoolctl',
'appdirs',
'numpy',
'numba',
],
classifiers=[
"Programming Language :: Python :: 3",

4
start.bat Executable file
View File

@ -0,0 +1,4 @@
call .\tortoise-venv\Scripts\activate.bat
python main.py
deactivate
pause

3
start.sh Executable file
View File

@ -0,0 +1,3 @@
source ./tortoise-venv/bin/activate
python3 ./main.py
deactivate

View File

@ -5,7 +5,6 @@ import gc
from time import time
from urllib import request
from urllib.request import ProxyHandler, build_opener, install_opener
import torch
import torch.nn.functional as F
@ -22,14 +21,12 @@ from tortoise.models.clvp import CLVP
from tortoise.models.cvvp import CVVP
from tortoise.models.random_latent_generator import RandomLatentConverter
from tortoise.models.vocoder import UnivNetGenerator
from tortoise.models.bigvgan import BigVGAN
from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel
from tortoise.utils.diffusion import SpacedDiffusion, space_timesteps, get_named_beta_schedule
from tortoise.utils.tokenizer import VoiceBpeTokenizer
from tortoise.utils.wav2vec_alignment import Wav2VecAlignment
from tortoise.utils.device import get_device, get_device_name, get_device_batch_size, print_stats, do_gc
from tortoise.utils.device import get_device, get_device_name, get_device_batch_size
pbar = None
STOP_SIGNAL = False
@ -43,46 +40,21 @@ MODELS = {
'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth',
'rlg_auto.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth',
'rlg_diffuser.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth',
'bigvgan_base_24khz_100band.pth': 'https://huggingface.co/ecker/tortoise-tts-models/resolve/main/models/bigvgan_base_24khz_100band.pth',
'bigvgan_24khz_100band.pth': 'https://huggingface.co/ecker/tortoise-tts-models/resolve/main/models/bigvgan_24khz_100band.pth',
'bigvgan_base_24khz_100band.json': 'https://huggingface.co/ecker/tortoise-tts-models/resolve/main/models/bigvgan_base_24khz_100band.json',
'bigvgan_24khz_100band.json': 'https://huggingface.co/ecker/tortoise-tts-models/resolve/main/models/bigvgan_24khz_100band.json',
}
def hash_file(path, algo="md5", buffer_size=0):
import hashlib
hash = None
if algo == "md5":
hash = hashlib.md5()
elif algo == "sha1":
hash = hashlib.sha1()
else:
raise Exception(f'Unknown hash algorithm specified: {algo}')
if not os.path.exists(path):
raise Exception(f'Path not found: {path}')
with open(path, 'rb') as f:
if buffer_size > 0:
while True:
data = f.read(buffer_size)
if not data:
break
hash.update(data)
else:
hash.update(f.read())
return "{0}".format(hash.hexdigest())
def check_for_kill_signal():
def tqdm_override(arr, verbose=False, progress=None, desc=None):
global STOP_SIGNAL
if STOP_SIGNAL:
STOP_SIGNAL = False
raise Exception("Kill signal detected")
if verbose and desc is not None:
print(desc)
if progress is None:
return tqdm(arr, disable=not verbose)
return progress.tqdm(arr, desc=f'{progress.msg_prefix} {desc}' if hasattr(progress, 'msg_prefix') else desc, track_tqdm=True)
def download_models(specific_models=None):
"""
Call to download all the models that Tortoise uses.
@ -109,11 +81,6 @@ def download_models(specific_models=None):
if os.path.exists(model_path):
continue
print(f'Downloading {model_name} from {url}...')
proxy = ProxyHandler({})
opener = build_opener(proxy)
opener.addheaders = [('User-Agent','mrq/AI-Voice-Cloning')]
install_opener(opener)
request.urlretrieve(url, model_path, show_progress)
print('Done.')
@ -150,7 +117,7 @@ def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusi
model_var_type='learned_range', loss_type='mse', betas=get_named_beta_schedule('linear', trained_diffusion_steps),
conditioning_free=cond_free, conditioning_free_k=cond_free_k)
@torch.inference_mode()
def format_conditioning(clip, cond_length=132300, device='cuda', sampling_rate=22050):
"""
Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models.
@ -162,8 +129,8 @@ def format_conditioning(clip, cond_length=132300, device='cuda', sampling_rate=2
rand_start = random.randint(0, gap)
clip = clip[:, rand_start:rand_start + cond_length]
mel_clip = TorchMelSpectrogram(sampling_rate=sampling_rate)(clip.unsqueeze(0)).squeeze(0)
mel_clip = mel_clip.unsqueeze(0)
return migrate_to_device(mel_clip, device)
return mel_clip.unsqueeze(0).to(device)
def fix_autoregressive_output(codes, stop_token, complain=True):
"""
@ -194,8 +161,8 @@ def fix_autoregressive_output(codes, stop_token, complain=True):
return codes
@torch.inference_mode()
def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True, desc=None, sampler="P", input_sample_rate=22050, output_sample_rate=24000):
def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True, progress=None, desc=None, sampler="P", input_sample_rate=22050, output_sample_rate=24000):
"""
Uses the specified diffusion model to convert discrete codes into a spectrogram.
"""
@ -208,7 +175,8 @@ def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_la
diffuser.sampler = sampler.lower()
mel = diffuser.sample_loop(diffusion_model, output_shape, noise=noise,
model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings}, desc=desc)
model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
verbose=verbose, progress=progress, desc=desc)
mel = denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
if get_device_name() == "dml":
@ -230,37 +198,12 @@ def classify_audio_clip(clip):
results = F.softmax(classifier(clip), dim=-1)
return results[0][0]
def migrate_to_device( t, device ):
if t is None:
return t
if not hasattr(t, 'device'):
t.device = device
t.manually_track_device = True
elif t.device == device:
return t
if hasattr(t, 'manually_track_device') and t.manually_track_device:
t.device = device
t = t.to(device)
do_gc()
return t
class TextToSpeech:
"""
Main entry point into Tortoise.
"""
def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, device=None,
minor_optimizations=True,
unsqueeze_sample_batches=False,
input_sample_rate=22050, output_sample_rate=24000,
autoregressive_model_path=None, diffusion_model_path=None, vocoder_model=None, tokenizer_json=None,
# ):
use_deepspeed=False): # Add use_deepspeed parameter
def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, device=None, minor_optimizations=True, input_sample_rate=22050, output_sample_rate=24000):
"""
Constructor
:param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing
@ -272,17 +215,13 @@ class TextToSpeech:
Default is true.
:param device: Device to use when running the model. If omitted, the device will be automatically chosen.
"""
self.loading = True
if device is None:
device = get_device(verbose=True)
self.version = [2,4,4] # to-do, autograb this from setup.py, or have setup.py autograb this
self.input_sample_rate = input_sample_rate
self.output_sample_rate = output_sample_rate
self.minor_optimizations = minor_optimizations
self.unsqueeze_sample_batches = unsqueeze_sample_batches
self.use_deepspeed = use_deepspeed # Store use_deepspeed as an instance variable
print(f'use_deepspeed api_debug {use_deepspeed}')
# for clarity, it's simpler to split these up and just predicate them on requesting VRAM-consuming optimizations
self.preloaded_tensors = minor_optimizations
self.use_kv_cache = minor_optimizations
@ -297,23 +236,24 @@ class TextToSpeech:
if self.enable_redaction:
self.aligner = Wav2VecAlignment(device='cpu' if get_device_name() == "dml" else self.device)
self.load_tokenizer_json(tokenizer_json)
self.tokenizer = VoiceBpeTokenizer()
if os.path.exists(f'{models_dir}/autoregressive.ptt'):
# Assume this is a traced directory.
self.autoregressive = torch.jit.load(f'{models_dir}/autoregressive.ptt')
else:
if not autoregressive_model_path or not os.path.exists(autoregressive_model_path):
autoregressive_model_path = get_model_path('autoregressive.pth', models_dir)
self.load_autoregressive_model(autoregressive_model_path)
if os.path.exists(f'{models_dir}/diffusion_decoder.ptt'):
self.diffusion = torch.jit.load(f'{models_dir}/diffusion_decoder.ptt')
else:
if not diffusion_model_path or not os.path.exists(diffusion_model_path):
diffusion_model_path = get_model_path('diffusion_decoder.pth', models_dir)
self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30,
model_dim=1024,
heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False,
train_solo_embeddings=False).cpu().eval()
self.autoregressive.load_state_dict(torch.load(get_model_path('autoregressive.pth', models_dir)))
self.autoregressive.post_init_gpt2_config(kv_cache=self.use_kv_cache)
self.load_diffusion_model(diffusion_model_path)
self.diffusion = DiffusionTts(model_channels=1024, num_layers=10, in_channels=100, out_channels=200,
in_latent_channels=1024, in_tokens=8193, dropout=0, use_fp16=False, num_heads=16,
layer_drop=0, unconditioned_percentage=0).cpu().eval()
self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', models_dir)))
self.clvp = CLVP(dim_text=768, dim_speech=768, dim_latent=768, num_text_tokens=256, text_enc_depth=20,
@ -323,168 +263,19 @@ class TextToSpeech:
self.clvp.load_state_dict(torch.load(get_model_path('clvp2.pth', models_dir)))
self.cvvp = None # CVVP model is only loaded if used.
self.vocoder_model = vocoder_model
self.load_vocoder_model(self.vocoder_model)
self.vocoder = UnivNetGenerator().cpu()
self.vocoder.load_state_dict(torch.load(get_model_path('vocoder.pth', models_dir), map_location=torch.device('cpu'))['model_g'])
self.vocoder.eval(inference=True)
# Random latent generators (RLGs) are loaded lazily.
self.rlg_auto = None
self.rlg_diffusion = None
if self.preloaded_tensors:
self.autoregressive = migrate_to_device( self.autoregressive, self.device )
self.diffusion = migrate_to_device( self.diffusion, self.device )
self.clvp = migrate_to_device( self.clvp, self.device )
self.vocoder = migrate_to_device( self.vocoder, self.device )
self.loading = False
def load_autoregressive_model(self, autoregressive_model_path, is_xtts=False):
if hasattr(self,"autoregressive_model_path") and os.path.samefile(self.autoregressive_model_path, autoregressive_model_path):
return
self.autoregressive_model_path = autoregressive_model_path if autoregressive_model_path and os.path.exists(autoregressive_model_path) else get_model_path('autoregressive.pth', self.models_dir)
new_hash = hash_file(self.autoregressive_model_path)
if hasattr(self,"autoregressive_model_hash") and self.autoregressive_model_hash == new_hash:
return
self.autoregressive_model_hash = new_hash
self.loading = True
print(f"Loading autoregressive model: {self.autoregressive_model_path}")
if hasattr(self, 'autoregressive'):
del self.autoregressive
# XTTS requires a different "dimensionality" for its autoregressive model
if new_hash == "e4ce21eae0043f7691d6a6c8540b74b8" or is_xtts:
dimensionality = {
"max_mel_tokens": 605,
"max_text_tokens": 402,
"max_prompt_tokens": 70,
"max_conditioning_inputs": 1,
"layers": 30,
"model_dim": 1024,
"heads": 16,
"number_text_tokens": 5023, # -1
"start_text_token": 261,
"stop_text_token": 0,
"number_mel_codes": 8194,
"start_mel_token": 8192,
"stop_mel_token": 8193,
}
else:
dimensionality = {
"max_mel_tokens": 604,
"max_text_tokens": 402,
"max_conditioning_inputs": 2,
"layers": 30,
"model_dim": 1024,
"heads": 16,
"number_text_tokens": 255,
"start_text_token": 255,
"checkpointing": False,
"train_solo_embeddings": False
}
self.autoregressive = UnifiedVoice(**dimensionality).cpu().eval()
self.autoregressive.load_state_dict(torch.load(self.autoregressive_model_path))
self.autoregressive.post_init_gpt2_config(use_deepspeed=self.use_deepspeed, kv_cache=self.use_kv_cache)
if self.preloaded_tensors:
self.autoregressive = migrate_to_device( self.autoregressive, self.device )
self.loading = False
print(f"Loaded autoregressive model")
def load_diffusion_model(self, diffusion_model_path):
if hasattr(self,"diffusion_model_path") and os.path.samefile(self.diffusion_model_path, diffusion_model_path):
return
self.loading = True
self.diffusion_model_path = diffusion_model_path if diffusion_model_path and os.path.exists(diffusion_model_path) else get_model_path('diffusion_decoder.pth', self.models_dir)
self.diffusion_model_hash = hash_file(self.diffusion_model_path)
if hasattr(self, 'diffusion'):
del self.diffusion
# XTTS does not require a different "dimensionality" for its diffusion model
dimensionality = {
"model_channels": 1024,
"num_layers": 10,
"in_channels": 100,
"out_channels": 200,
"in_latent_channels": 1024,
"in_tokens": 8193,
"dropout": 0,
"use_fp16": False,
"num_heads": 16,
"layer_drop": 0,
"unconditioned_percentage": 0
}
self.diffusion = DiffusionTts(**dimensionality)
self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', self.models_dir)))
if self.preloaded_tensors:
self.diffusion = migrate_to_device( self.diffusion, self.device )
self.loading = False
print(f"Loaded diffusion model")
def load_vocoder_model(self, vocoder_model):
if hasattr(self,"vocoder_model_path") and os.path.samefile(self.vocoder_model_path, vocoder_model):
return
self.loading = True
if hasattr(self, 'vocoder'):
del self.vocoder
print("Loading vocoder model:", vocoder_model)
if vocoder_model is None:
vocoder_model = 'bigvgan_24khz_100band'
if 'bigvgan' in vocoder_model:
# credit to https://github.com/deviandice / https://git.ecker.tech/mrq/ai-voice-cloning/issues/52
vocoder_key = 'generator'
self.vocoder_model_path = 'bigvgan_24khz_100band.pth'
if f'{vocoder_model}.pth' in MODELS:
self.vocoder_model_path = f'{vocoder_model}.pth'
vocoder_config = 'bigvgan_24khz_100band.json'
if f'{vocoder_model}.json' in MODELS:
vocoder_config = f'{vocoder_model}.json'
vocoder_config = get_model_path(vocoder_config, self.models_dir)
self.vocoder = BigVGAN(config=vocoder_config).cpu()
#elif vocoder_model == "univnet":
else:
vocoder_key = 'model_g'
self.vocoder_model_path = 'vocoder.pth'
self.vocoder = UnivNetGenerator().cpu()
print(f"Loading vocoder model: {self.vocoder_model_path}")
self.vocoder.load_state_dict(torch.load(get_model_path(self.vocoder_model_path, self.models_dir), map_location=torch.device('cpu'))[vocoder_key])
self.vocoder.eval(inference=True)
if self.preloaded_tensors:
self.vocoder = migrate_to_device( self.vocoder, self.device )
self.loading = False
print(f"Loaded vocoder model")
def load_tokenizer_json(self, tokenizer_json):
if hasattr(self,"tokenizer_json") and os.path.samefile(self.tokenizer_json, tokenizer_json):
return
self.loading = True
self.tokenizer_json = tokenizer_json if tokenizer_json else os.path.join(os.path.dirname(os.path.realpath(__file__)), '../tortoise/data/tokenizer.json')
print("Loading tokenizer JSON:", self.tokenizer_json)
if hasattr(self, 'tokenizer'):
del self.tokenizer
self.tokenizer = VoiceBpeTokenizer(vocab_file=self.tokenizer_json)
self.loading = False
print(f"Loaded tokenizer")
self.autoregressive = self.autoregressive.to(self.device)
self.diffusion = self.diffusion.to(self.device)
self.clvp = self.clvp.to(self.device)
self.vocoder = self.vocoder.to(self.device)
def load_cvvp(self):
"""Load CVVP model."""
@ -493,17 +284,15 @@ class TextToSpeech:
self.cvvp.load_state_dict(torch.load(get_model_path('cvvp.pth', self.models_dir)))
if self.preloaded_tensors:
self.cvvp = migrate_to_device( self.cvvp, self.device )
self.cvvp = self.cvvp.to(self.device)
@torch.inference_mode()
def get_conditioning_latents(self, voice_samples, return_mels=False, verbose=False, slices=1, max_chunk_size=None, force_cpu=False, original_ar=False, original_diffusion=False):
def get_conditioning_latents(self, voice_samples, return_mels=False, verbose=False, progress=None, slices=1, max_chunk_size=None, force_cpu=False):
"""
Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
properties.
:param voice_samples: List of 2 or more ~10 second reference clips, which should be torch tensors containing 22.05kHz waveform data.
"""
with torch.no_grad():
# computing conditional latents requires being done on the CPU if using DML because M$ still hasn't implemented some core functions
if get_device_name() == "dml":
@ -513,75 +302,70 @@ class TextToSpeech:
if not isinstance(voice_samples, list):
voice_samples = [voice_samples]
resampler_22K = torchaudio.transforms.Resample(
voice_samples = [v.to(device) for v in voice_samples]
resampler = torchaudio.transforms.Resample(
self.input_sample_rate,
22050,
self.output_sample_rate,
lowpass_filter_width=16,
rolloff=0.85,
resampling_method="kaiser_window",
beta=8.555504641634386,
).to(device)
resampler_24K = torchaudio.transforms.Resample(
self.input_sample_rate,
24000,
lowpass_filter_width=16,
rolloff=0.85,
resampling_method="kaiser_window",
beta=8.555504641634386,
).to(device)
voice_samples = [migrate_to_device(v, device) for v in voice_samples]
)
samples = []
auto_conds = []
diffusion_conds = []
for sample in voice_samples:
auto_conds.append(format_conditioning(sample, device=device, sampling_rate=self.input_sample_rate))
samples.append(resampler(sample.cpu()).to(device)) # icky no good, easier to do the resampling on CPU than figure out how to do it on GPU
if original_ar:
samples = [resampler_22K(sample) for sample in voice_samples]
for sample in tqdm(samples, desc="Computing AR conditioning latents..."):
auto_conds.append(format_conditioning(sample, device=device, sampling_rate=self.input_sample_rate, cond_length=132300))
auto_conds = torch.stack(auto_conds, dim=1)
self.autoregressive = self.autoregressive.to(device)
auto_latent = self.autoregressive.get_conditioning(auto_conds)
if self.preloaded_tensors:
self.autoregressive = self.autoregressive.to(self.device)
else:
samples = [resampler_22K(sample) for sample in voice_samples]
concat = torch.cat(samples, dim=-1)
chunk_size = concat.shape[-1]
self.autoregressive = self.autoregressive.cpu()
if slices == 0:
slices = 1
elif max_chunk_size is not None and chunk_size > max_chunk_size:
diffusion_conds = []
chunks = []
concat = torch.cat(samples, dim=-1)
chunk_size = concat.shape[-1]
if slices == 0:
slices = 1
else:
if max_chunk_size is not None and chunk_size > max_chunk_size:
slices = 1
while int(chunk_size / slices) > max_chunk_size:
slices = slices + 1
chunks = torch.chunk(concat, slices, dim=1)
chunk_size = chunks[0].shape[-1]
for chunk in tqdm(chunks, desc="Computing AR conditioning latents..."):
auto_conds.append(format_conditioning(chunk, device=device, sampling_rate=self.input_sample_rate, cond_length=chunk_size))
if original_diffusion:
samples = [resampler_24K(sample) for sample in voice_samples]
for sample in tqdm(samples, desc="Computing diffusion conditioning latents..."):
sample = pad_or_truncate(sample, 102400)
cond_mel = wav_to_univnet_mel(migrate_to_device(sample, device), do_normalization=False, device=self.device)
diffusion_conds.append(cond_mel)
else:
samples = [resampler_24K(sample) for sample in voice_samples]
for chunk in tqdm(chunks, desc="Computing diffusion conditioning latents..."):
check_for_kill_signal()
chunk = pad_or_truncate(chunk, chunk_size)
cond_mel = wav_to_univnet_mel(migrate_to_device( chunk, device ), do_normalization=False, device=device)
diffusion_conds.append(cond_mel)
auto_conds = torch.stack(auto_conds, dim=1)
self.autoregressive = migrate_to_device( self.autoregressive, device )
auto_latent = self.autoregressive.get_conditioning(auto_conds)
self.autoregressive = migrate_to_device( self.autoregressive, self.device if self.preloaded_tensors else 'cpu' )
chunks = torch.chunk(concat, slices, dim=1)
chunk_size = chunks[0].shape[-1]
# expand / truncate samples to match the common size
# required, as tensors need to be of the same length
for chunk in tqdm_override(chunks, verbose=verbose, progress=progress, desc="Computing conditioning latents..."):
chunk = pad_or_truncate(chunk, chunk_size)
cond_mel = wav_to_univnet_mel(chunk.to(device), do_normalization=False, device=device)
diffusion_conds.append(cond_mel)
diffusion_conds = torch.stack(diffusion_conds, dim=1)
self.diffusion = migrate_to_device( self.diffusion, device )
self.diffusion = self.diffusion.to(device)
diffusion_latent = self.diffusion.get_conditioning(diffusion_conds)
self.diffusion = migrate_to_device( self.diffusion, self.device if self.preloaded_tensors else 'cpu' )
if self.preloaded_tensors:
self.diffusion = self.diffusion.to(self.device)
else:
self.diffusion = self.diffusion.cpu()
if return_mels:
return auto_latent, diffusion_latent, auto_conds, diffusion_conds
@ -621,15 +405,11 @@ class TextToSpeech:
settings.update(kwargs) # allow overriding of preset settings with kwargs
return self.tts(text, **settings)
@torch.inference_mode()
def tts(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True, use_deterministic_seed=None,
return_deterministic_state=False,
# autoregressive generation parameters follow
num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500,
sample_batch_size=None,
autoregressive_model=None,
diffusion_model=None,
tokenizer_json=None,
# CVVP parameters follow
cvvp_amount=.0,
# diffusion generation parameters follow
@ -637,6 +417,7 @@ class TextToSpeech:
diffusion_sampler="P",
breathing_room=8,
half_p=False,
progress=None,
**hf_generate_kwargs):
"""
Produces an audio clip of the given text being spoken with the given reference voice.
@ -691,24 +472,7 @@ class TextToSpeech:
self.diffusion.enable_fp16 = half_p
deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)
if autoregressive_model is None:
autoregressive_model = self.autoregressive_model_path
elif autoregressive_model != self.autoregressive_model_path:
self.load_autoregressive_model(autoregressive_model)
if diffusion_model is None:
diffusion_model = self.diffusion_model_path
elif diffusion_model != self.diffusion_model_path:
self.load_diffusion_model(diffusion_model)
if tokenizer_json is None:
tokenizer_json = self.tokenizer_json
elif tokenizer_json != self.tokenizer_json:
self.load_tokenizer_json(tokenizer_json)
text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0)
text_tokens = migrate_to_device( text_tokens, self.device )
text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device)
text_tokens = F.pad(text_tokens, (0, 1)) # This may not be necessary.
assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'
@ -736,13 +500,12 @@ class TextToSpeech:
stop_mel_token = self.autoregressive.stop_mel_token
calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
self.autoregressive = migrate_to_device( self.autoregressive, self.device )
auto_conditioning = migrate_to_device( auto_conditioning, self.device )
text_tokens = migrate_to_device( text_tokens, self.device )
self.autoregressive = self.autoregressive.to(self.device)
auto_conditioning = auto_conditioning.to(self.device)
text_tokens = text_tokens.to(self.device)
with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=half_p):
for b in tqdm(range(num_batches), desc="Generating autoregressive samples"):
check_for_kill_signal()
for b in tqdm_override(range(num_batches), verbose=verbose, progress=progress, desc="Generating autoregressive samples"):
codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens,
do_sample=True,
top_p=top_p,
@ -757,30 +520,24 @@ class TextToSpeech:
samples.append(codes)
if not self.preloaded_tensors:
self.autoregressive = migrate_to_device( self.autoregressive, 'cpu' )
if self.unsqueeze_sample_batches:
new_samples = []
for batch in samples:
for i in range(batch.shape[0]):
new_samples.append(batch[i].unsqueeze(0))
samples = new_samples
self.autoregressive = self.autoregressive.cpu()
auto_conditioning = auto_conditioning.cpu()
clip_results = []
if auto_conds is not None:
auto_conditioning = migrate_to_device( auto_conditioning, self.device )
auto_conds = auto_conds.to(self.device)
with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=half_p):
if not self.preloaded_tensors:
self.autoregressive = migrate_to_device( self.autoregressive, 'cpu' )
self.clvp = migrate_to_device( self.clvp, self.device )
if not self.minor_optimizations:
self.autoregressive = self.autoregressive.cpu()
self.clvp = self.clvp.to(self.device)
if cvvp_amount > 0:
if self.cvvp is None:
self.load_cvvp()
if not self.preloaded_tensors:
self.cvvp = migrate_to_device( self.cvvp, self.device )
if not self.minor_optimizations:
self.cvvp = self.cvvp.to(self.device)
desc="Computing best candidates"
if verbose:
@ -789,9 +546,7 @@ class TextToSpeech:
else:
desc = f"Computing best candidates using CLVP {((1-cvvp_amount) * 100):2.0f}% and CVVP {(cvvp_amount * 100):2.0f}%"
for batch in tqdm(samples, desc=desc):
check_for_kill_signal()
for batch in tqdm_override(samples, verbose=verbose, progress=progress, desc=desc):
for i in range(batch.shape[0]):
batch[i] = fix_autoregressive_output(batch[i], stop_mel_token)
@ -811,31 +566,30 @@ class TextToSpeech:
clip_results.append(clvp)
if not self.preloaded_tensors and auto_conds is not None:
auto_conds = migrate_to_device( auto_conds, 'cpu' )
auto_conds = auto_conds.cpu()
clip_results = torch.cat(clip_results, dim=0)
samples = torch.cat(samples, dim=0)
if k < num_autoregressive_samples:
best_results = samples[torch.topk(clip_results, k=k).indices]
else:
best_results = samples
best_results = samples[torch.topk(clip_results, k=k).indices]
if not self.preloaded_tensors:
self.clvp = migrate_to_device( self.clvp, 'cpu' )
self.cvvp = migrate_to_device( self.cvvp, 'cpu' )
if get_device_name() == "dml":
text_tokens = migrate_to_device( text_tokens, 'cpu' )
best_results = migrate_to_device( best_results, 'cpu' )
auto_conditioning = migrate_to_device( auto_conditioning, 'cpu' )
self.autoregressive = migrate_to_device( self.autoregressive, 'cpu' )
else:
auto_conditioning = auto_conditioning.to(self.device)
self.autoregressive = self.autoregressive.to(self.device)
self.clvp = self.clvp.cpu()
if self.cvvp is not None:
self.cvvp = self.cvvp.cpu()
del samples
if get_device_name() == "dml":
text_tokens = text_tokens.cpu()
best_results = best_results.cpu()
auto_conditioning = auto_conditioning.cpu()
self.autoregressive = self.autoregressive.cpu()
else:
#text_tokens = text_tokens.to(self.device)
#best_results = best_results.to(self.device)
auto_conditioning = auto_conditioning.to(self.device)
self.autoregressive = self.autoregressive.to(self.device)
# The diffusion model actually wants the last hidden layer from the autoregressive model as conditioning
# inputs. Re-produce those for the top results. This could be made more efficient by storing all of these
# results, but will increase memory usage.
@ -844,19 +598,21 @@ class TextToSpeech:
torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
return_latent=True, clip_inputs=False)
diffusion_conditioning = migrate_to_device( diffusion_conditioning, self.device )
diffusion_conditioning = diffusion_conditioning.to(self.device)
if get_device_name() == "dml":
self.autoregressive = migrate_to_device( self.autoregressive, self.device )
best_results = migrate_to_device( best_results, self.device )
best_latents = migrate_to_device( best_latents, self.device )
self.vocoder = migrate_to_device( self.vocoder, 'cpu' )
self.autoregressive = self.autoregressive.to(self.device)
best_results = best_results.to(self.device)
best_latents = best_latents.to(self.device)
self.vocoder = self.vocoder.cpu()
else:
if not self.preloaded_tensors:
self.autoregressive = migrate_to_device( self.autoregressive, 'cpu' )
self.autoregressive = self.autoregressive.cpu()
self.diffusion = self.diffusion.to(self.device)
self.vocoder = self.vocoder.to(self.device)
self.diffusion = migrate_to_device( self.diffusion, self.device )
self.vocoder = migrate_to_device( self.vocoder, self.device )
del text_tokens
del auto_conditioning
@ -878,21 +634,19 @@ class TextToSpeech:
break
mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, diffusion_conditioning,
temperature=diffusion_temperature, desc="Transforming autoregressive outputs into audio..", sampler=diffusion_sampler,
temperature=diffusion_temperature, verbose=verbose, progress=progress, desc="Transforming autoregressive outputs into audio..", sampler=diffusion_sampler,
input_sample_rate=self.input_sample_rate, output_sample_rate=self.output_sample_rate)
wav = self.vocoder.inference(mel)
wav_candidates.append(wav)
if not self.preloaded_tensors:
self.diffusion = migrate_to_device( self.diffusion, 'cpu' )
self.vocoder = migrate_to_device( self.vocoder, 'cpu' )
self.diffusion = self.diffusion.cpu()
self.vocoder = self.vocoder.cpu()
def potentially_redact(clip, text):
if self.enable_redaction:
t = clip.squeeze(1)
t = migrate_to_device( t, 'cpu' if get_device_name() == "dml" else self.device)
return self.aligner.redact(t, text, self.output_sample_rate).unsqueeze(1)
return self.aligner.redact(clip.squeeze(1).to('cpu' if get_device_name() == "dml" else self.device), text, self.output_sample_rate).unsqueeze(1)
return clip
wav_candidates = [potentially_redact(wav_candidate, text) for wav_candidate in wav_candidates]
@ -901,7 +655,7 @@ class TextToSpeech:
else:
res = wav_candidates[0]
do_gc()
gc.collect()
if return_deterministic_state:
return res, (deterministic_seed, text, voice_samples, conditioning_latents)

View File

@ -14,7 +14,6 @@ if __name__ == '__main__':
parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random')
parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
parser.add_argument('--use_deepspeed', type=bool, help='Use deepspeed for speed bump.', default=True)
parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/')
parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'
'should only be specified if you have custom checkpoints.', default=MODELS_DIR)
@ -38,8 +37,8 @@ if __name__ == '__main__':
os.makedirs(args.output_path, exist_ok=True)
#print(f'use_deepspeed do_tts_debug {use_deepspeed}')
tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed)
tts = TextToSpeech(models_dir=args.model_dir)
selected_voices = args.voice.split(',')
for k, selected_voice in enumerate(selected_voices):

View File

@ -1,120 +0,0 @@
# Implementation adapted from https://github.com/EdwardDixon/snake under the MIT license.
# LICENSE is in incl_licenses directory.
import torch
from torch import nn, sin, pow
from torch.nn import Parameter
class Snake(nn.Module):
'''
Implementation of a sine-based periodic activation function
Shape:
- Input: (B, C, T)
- Output: (B, C, T), same shape as the input
Parameters:
- alpha - trainable parameter
References:
- This activation function is from this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
https://arxiv.org/abs/2006.08195
Examples:
>>> a1 = snake(256)
>>> x = torch.randn(256)
>>> x = a1(x)
'''
def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
'''
Initialization.
INPUT:
- in_features: shape of the input
- alpha: trainable parameter
alpha is initialized to 1 by default, higher values = higher-frequency.
alpha will be trained along with the rest of your model.
'''
super(Snake, self).__init__()
self.in_features = in_features
# initialize alpha
self.alpha_logscale = alpha_logscale
if self.alpha_logscale: # log scale alphas initialized to zeros
self.alpha = Parameter(torch.zeros(in_features) * alpha)
else: # linear scale alphas initialized to ones
self.alpha = Parameter(torch.ones(in_features) * alpha)
self.alpha.requires_grad = alpha_trainable
self.no_div_by_zero = 0.000000001
def forward(self, x):
'''
Forward pass of the function.
Applies the function to the input elementwise.
Snake = x + 1/a * sin^2 (xa)
'''
alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
if self.alpha_logscale:
alpha = torch.exp(alpha)
x = x + (1.0 / (alpha + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
return x
class SnakeBeta(nn.Module):
'''
A modified Snake function which uses separate parameters for the magnitude of the periodic components
Shape:
- Input: (B, C, T)
- Output: (B, C, T), same shape as the input
Parameters:
- alpha - trainable parameter that controls frequency
- beta - trainable parameter that controls magnitude
References:
- This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
https://arxiv.org/abs/2006.08195
Examples:
>>> a1 = snakebeta(256)
>>> x = torch.randn(256)
>>> x = a1(x)
'''
def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
'''
Initialization.
INPUT:
- in_features: shape of the input
- alpha - trainable parameter that controls frequency
- beta - trainable parameter that controls magnitude
alpha is initialized to 1 by default, higher values = higher-frequency.
beta is initialized to 1 by default, higher values = higher-magnitude.
alpha will be trained along with the rest of your model.
'''
super(SnakeBeta, self).__init__()
self.in_features = in_features
# initialize alpha
self.alpha_logscale = alpha_logscale
if self.alpha_logscale: # log scale alphas initialized to zeros
self.alpha = Parameter(torch.zeros(in_features) * alpha)
self.beta = Parameter(torch.zeros(in_features) * alpha)
else: # linear scale alphas initialized to ones
self.alpha = Parameter(torch.ones(in_features) * alpha)
self.beta = Parameter(torch.ones(in_features) * alpha)
self.alpha.requires_grad = alpha_trainable
self.beta.requires_grad = alpha_trainable
self.no_div_by_zero = 0.000000001
def forward(self, x):
'''
Forward pass of the function.
Applies the function to the input elementwise.
SnakeBeta = x + 1/b * sin^2 (xa)
'''
alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
beta = self.beta.unsqueeze(0).unsqueeze(-1)
if self.alpha_logscale:
alpha = torch.exp(alpha)
beta = torch.exp(beta)
x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
return x

View File

@ -1,6 +0,0 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
# LICENSE is in incl_licenses directory.
from .filter import *
from .resample import *
from .act import *

View File

@ -1,28 +0,0 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
# LICENSE is in incl_licenses directory.
import torch.nn as nn
from .resample import UpSample1d, DownSample1d
class Activation1d(nn.Module):
def __init__(self,
activation,
up_ratio: int = 2,
down_ratio: int = 2,
up_kernel_size: int = 12,
down_kernel_size: int = 12):
super().__init__()
self.up_ratio = up_ratio
self.down_ratio = down_ratio
self.act = activation
self.upsample = UpSample1d(up_ratio, up_kernel_size)
self.downsample = DownSample1d(down_ratio, down_kernel_size)
# x: [B,C,T]
def forward(self, x):
x = self.upsample(x)
x = self.act(x)
x = self.downsample(x)
return x

View File

@ -1,95 +0,0 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
# LICENSE is in incl_licenses directory.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
if 'sinc' in dir(torch):
sinc = torch.sinc
else:
# This code is adopted from adefossez's julius.core.sinc under the MIT License
# https://adefossez.github.io/julius/julius/core.html
# LICENSE is in incl_licenses directory.
def sinc(x: torch.Tensor):
"""
Implementation of sinc, i.e. sin(pi * x) / (pi * x)
__Warning__: Different to julius.sinc, the input is multiplied by `pi`!
"""
return torch.where(x == 0,
torch.tensor(1., device=x.device, dtype=x.dtype),
torch.sin(math.pi * x) / math.pi / x)
# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
# https://adefossez.github.io/julius/julius/lowpass.html
# LICENSE is in incl_licenses directory.
def kaiser_sinc_filter1d(cutoff, half_width, kernel_size): # return filter [1,1,kernel_size]
even = (kernel_size % 2 == 0)
half_size = kernel_size // 2
#For kaiser window
delta_f = 4 * half_width
A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
if A > 50.:
beta = 0.1102 * (A - 8.7)
elif A >= 21.:
beta = 0.5842 * (A - 21)**0.4 + 0.07886 * (A - 21.)
else:
beta = 0.
window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
# ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
if even:
time = (torch.arange(-half_size, half_size) + 0.5)
else:
time = torch.arange(kernel_size) - half_size
if cutoff == 0:
filter_ = torch.zeros_like(time)
else:
filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
# Normalize filter to have sum = 1, otherwise we will have a small leakage
# of the constant component in the input signal.
filter_ /= filter_.sum()
filter = filter_.view(1, 1, kernel_size)
return filter
class LowPassFilter1d(nn.Module):
def __init__(self,
cutoff=0.5,
half_width=0.6,
stride: int = 1,
padding: bool = True,
padding_mode: str = 'replicate',
kernel_size: int = 12):
# kernel_size should be even number for stylegan3 setup,
# in this implementation, odd number is also possible.
super().__init__()
if cutoff < -0.:
raise ValueError("Minimum cutoff must be larger than zero.")
if cutoff > 0.5:
raise ValueError("A cutoff above 0.5 does not make sense.")
self.kernel_size = kernel_size
self.even = (kernel_size % 2 == 0)
self.pad_left = kernel_size // 2 - int(self.even)
self.pad_right = kernel_size // 2
self.stride = stride
self.padding = padding
self.padding_mode = padding_mode
filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
self.register_buffer("filter", filter)
#input [B, C, T]
def forward(self, x):
_, C, _ = x.shape
if self.padding:
x = F.pad(x, (self.pad_left, self.pad_right),
mode=self.padding_mode)
out = F.conv1d(x, self.filter.expand(C, -1, -1),
stride=self.stride, groups=C)
return out

View File

@ -1,49 +0,0 @@
# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
# LICENSE is in incl_licenses directory.
import torch.nn as nn
from torch.nn import functional as F
from .filter import LowPassFilter1d
from .filter import kaiser_sinc_filter1d
class UpSample1d(nn.Module):
def __init__(self, ratio=2, kernel_size=None):
super().__init__()
self.ratio = ratio
self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
self.stride = ratio
self.pad = self.kernel_size // ratio - 1
self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
self.pad_right = self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
filter = kaiser_sinc_filter1d(cutoff=0.5 / ratio,
half_width=0.6 / ratio,
kernel_size=self.kernel_size)
self.register_buffer("filter", filter)
# x: [B, C, T]
def forward(self, x):
_, C, _ = x.shape
x = F.pad(x, (self.pad, self.pad), mode='replicate')
x = self.ratio * F.conv_transpose1d(
x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
x = x[..., self.pad_left:-self.pad_right]
return x
class DownSample1d(nn.Module):
def __init__(self, ratio=2, kernel_size=None):
super().__init__()
self.ratio = ratio
self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
self.lowpass = LowPassFilter1d(cutoff=0.5 / ratio,
half_width=0.6 / ratio,
stride=ratio,
kernel_size=self.kernel_size)
def forward(self, x):
xx = self.lowpass(x)
return xx

View File

@ -11,7 +11,6 @@ from tortoise.utils.typical_sampling import TypicalLogitsWarper
from tortoise.utils.device import get_device_count
import tortoise.utils.torch_intermediary as ml
def null_position_embeddings(range, dim):
return torch.zeros((range.shape[0], range.shape[1], dim), device=range.device)
@ -222,8 +221,7 @@ class ConditioningEncoder(nn.Module):
class LearnedPositionEmbeddings(nn.Module):
def __init__(self, seq_len, model_dim, init=.02):
super().__init__()
# ml.Embedding
self.emb = ml.Embedding(seq_len, model_dim)
self.emb = nn.Embedding(seq_len, model_dim)
# Initializing this way is standard for GPT-2
self.emb.weight.data.normal_(mean=0.0, std=init)
@ -283,9 +281,9 @@ class MelEncoder(nn.Module):
class UnifiedVoice(nn.Module):
def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=120, max_prompt_tokens=2, max_mel_tokens=250, max_conditioning_inputs=1,
def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=120, max_mel_tokens=250, max_conditioning_inputs=1,
mel_length_compression=1024, number_text_tokens=256,
start_text_token=None, stop_text_token=0, number_mel_codes=8194, start_mel_token=8192,
start_text_token=None, number_mel_codes=8194, start_mel_token=8192,
stop_mel_token=8193, train_solo_embeddings=False, use_mel_codes_as_input=True,
checkpointing=True, types=1):
"""
@ -295,7 +293,6 @@ class UnifiedVoice(nn.Module):
heads: Number of transformer heads. Must be divisible by model_dim. Recommend model_dim//64
max_text_tokens: Maximum number of text tokens that will be encountered by model.
max_mel_tokens: Maximum number of MEL tokens that will be encountered by model.
max_prompt_tokens: compat set to 2, 70 for XTTS
max_conditioning_inputs: Maximum number of conditioning inputs provided to the model. If (1), conditioning input can be of format (b,80,s), otherwise (b,n,80,s).
mel_length_compression: The factor between <number_input_samples> and <mel_tokens>. Used to compute MEL code padding given wav input length.
number_text_tokens:
@ -312,7 +309,7 @@ class UnifiedVoice(nn.Module):
self.number_text_tokens = number_text_tokens
self.start_text_token = number_text_tokens * types if start_text_token is None else start_text_token
self.stop_text_token = stop_text_token
self.stop_text_token = 0
self.number_mel_codes = number_mel_codes
self.start_mel_token = start_mel_token
self.stop_mel_token = stop_mel_token
@ -320,16 +317,13 @@ class UnifiedVoice(nn.Module):
self.heads = heads
self.max_mel_tokens = max_mel_tokens
self.max_text_tokens = max_text_tokens
self.max_prompt_tokens = max_prompt_tokens
self.model_dim = model_dim
self.max_conditioning_inputs = max_conditioning_inputs
self.mel_length_compression = mel_length_compression
self.conditioning_encoder = ConditioningEncoder(80, model_dim, num_attn_heads=heads)
# ml.Embedding
self.text_embedding = ml.Embedding(self.number_text_tokens*types+1, model_dim)
self.text_embedding = nn.Embedding(self.number_text_tokens*types+1, model_dim)
if use_mel_codes_as_input:
# ml.Embedding
self.mel_embedding = ml.Embedding(self.number_mel_codes, model_dim)
self.mel_embedding = nn.Embedding(self.number_mel_codes, model_dim)
else:
self.mel_embedding = MelEncoder(model_dim, resblocks_per_reduction=1)
self.gpt, self.mel_pos_embedding, self.text_pos_embedding, self.mel_layer_pos_embedding, self.text_layer_pos_embedding = \
@ -342,10 +336,8 @@ class UnifiedVoice(nn.Module):
self.text_solo_embedding = 0
self.final_norm = nn.LayerNorm(model_dim)
# nn.Linear
self.text_head = ml.Linear(model_dim, self.number_text_tokens*types+1)
# nn.Linear
self.mel_head = ml.Linear(model_dim, self.number_mel_codes)
self.text_head = nn.Linear(model_dim, self.number_text_tokens*types+1)
self.mel_head = nn.Linear(model_dim, self.number_mel_codes)
# Initialize the embeddings per the GPT-2 scheme
embeddings = [self.text_embedding]
@ -354,8 +346,8 @@ class UnifiedVoice(nn.Module):
for module in embeddings:
module.weight.data.normal_(mean=0.0, std=.02)
def post_init_gpt2_config(self, use_deepspeed=False, kv_cache=False):
seq_length = self.max_mel_tokens + self.max_text_tokens + self.max_prompt_tokens
def post_init_gpt2_config(self, kv_cache=False):
seq_length = self.max_mel_tokens + self.max_text_tokens + 2
gpt_config = GPT2Config(vocab_size=self.max_mel_tokens,
n_positions=seq_length,
n_ctx=seq_length,
@ -365,17 +357,6 @@ class UnifiedVoice(nn.Module):
gradient_checkpointing=False,
use_cache=True)
self.inference_model = GPT2InferenceModel(gpt_config, self.gpt, self.mel_pos_embedding, self.mel_embedding, self.final_norm, self.mel_head, kv_cache=kv_cache)
#print(f'use_deepspeed autoregressive_debug {use_deepspeed}')
if use_deepspeed and torch.cuda.is_available():
import deepspeed
self.ds_engine = deepspeed.init_inference(model=self.inference_model,
mp_size=1,
replace_with_kernel_inject=True,
dtype=torch.float32)
self.inference_model = self.ds_engine.module.eval()
else:
self.inference_model = self.inference_model.eval()
self.gpt.wte = self.mel_embedding
def build_aligned_inputs_and_targets(self, input, start_token, stop_token):
@ -496,9 +477,9 @@ class UnifiedVoice(nn.Module):
def inference_speech(self, speech_conditioning_latent, text_inputs, input_tokens=None, num_return_sequences=1,
max_generate_length=None, typical_sampling=False, typical_mass=.9, **hf_generate_kwargs):
seq_length = self.max_mel_tokens + self.max_text_tokens + self.max_prompt_tokens
seq_length = self.max_mel_tokens + self.max_text_tokens + 2
if not hasattr(self, 'inference_model'):
self.post_init_gpt2_config(kv_cache=self.kv_cache)
self.post_init_gpt2_config(kv_cache=self.kv_cachepost_init_gpt2_config)
text_inputs = F.pad(text_inputs, (0, 1), value=self.stop_text_token)

View File

@ -1,485 +0,0 @@
# Copyright (c) 2022 NVIDIA CORPORATION.
# Licensed under the MIT license.
# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
# LICENSE is in incl_licenses directory.
import json
import os
import torch, torch.utils.data
import tortoise.models.activations as activations
from torch.nn import Conv1d, ConvTranspose1d, Conv2d
from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
from tortoise.models.alias_free_torch import *
from librosa.filters import mel as librosa_mel_fn
LRELU_SLOPE = 0.1
class AMPBlock1(torch.nn.Module):
def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5), activation=None):
super(AMPBlock1, self).__init__()
self.h = h
self.convs1 = nn.ModuleList([
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
padding=get_padding(kernel_size, dilation[0]))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
padding=get_padding(kernel_size, dilation[1]))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
padding=get_padding(kernel_size, dilation[2])))
])
self.convs1.apply(init_weights)
self.convs2 = nn.ModuleList([
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
padding=get_padding(kernel_size, 1))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
padding=get_padding(kernel_size, 1))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
padding=get_padding(kernel_size, 1)))
])
self.convs2.apply(init_weights)
self.num_layers = len(self.convs1) + len(self.convs2) # total number of conv layers
if activation == 'snake': # periodic nonlinearity with snake function and anti-aliasing
self.activations = nn.ModuleList([
Activation1d(
activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
for _ in range(self.num_layers)
])
elif activation == 'snakebeta': # periodic nonlinearity with snakebeta function and anti-aliasing
self.activations = nn.ModuleList([
Activation1d(
activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
for _ in range(self.num_layers)
])
else:
raise NotImplementedError(
"activation incorrectly specified. check the config file and look for 'activation'.")
def forward(self, x):
acts1, acts2 = self.activations[::2], self.activations[1::2]
for c1, c2, a1, a2 in zip(self.convs1, self.convs2, acts1, acts2):
xt = a1(x)
xt = c1(xt)
xt = a2(xt)
xt = c2(xt)
x = xt + x
return x
def remove_weight_norm(self):
for l in self.convs1:
remove_weight_norm(l)
for l in self.convs2:
remove_weight_norm(l)
class AMPBlock2(torch.nn.Module):
def __init__(self, h, channels, kernel_size=3, dilation=(1, 3), activation=None):
super(AMPBlock2, self).__init__()
self.h = h
self.convs = nn.ModuleList([
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
padding=get_padding(kernel_size, dilation[0]))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
padding=get_padding(kernel_size, dilation[1])))
])
self.convs.apply(init_weights)
self.num_layers = len(self.convs) # total number of conv layers
if activation == 'snake': # periodic nonlinearity with snake function and anti-aliasing
self.activations = nn.ModuleList([
Activation1d(
activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
for _ in range(self.num_layers)
])
elif activation == 'snakebeta': # periodic nonlinearity with snakebeta function and anti-aliasing
self.activations = nn.ModuleList([
Activation1d(
activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
for _ in range(self.num_layers)
])
else:
raise NotImplementedError(
"activation incorrectly specified. check the config file and look for 'activation'.")
def forward(self, x):
for c, a in zip(self.convs, self.activations):
xt = a(x)
xt = c(xt)
x = xt + x
return x
def remove_weight_norm(self):
for l in self.convs:
remove_weight_norm(l)
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
class BigVGAN(nn.Module):
# this is our main BigVGAN model. Applies anti-aliased periodic activation for resblocks.
def __init__(self, config=None, data=None):
super(BigVGAN, self).__init__()
"""
with open(os.path.join(os.path.dirname(__file__), 'config.json'), 'r') as f:
data = f.read()
"""
if config and data is None:
with open(config, 'r') as f:
data = f.read()
jsonConfig = json.loads(data)
elif data is not None:
if isinstance(data, str):
jsonConfig = json.loads(data)
else:
jsonConfig = data
else:
raise Exception("no config specified")
global h
h = AttrDict(jsonConfig)
self.mel_channel = h.num_mels
self.noise_dim = h.n_fft
self.hop_length = h.hop_size
self.num_kernels = len(h.resblock_kernel_sizes)
self.num_upsamples = len(h.upsample_rates)
# pre conv
self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3))
# define which AMPBlock to use. BigVGAN uses AMPBlock1 as default
resblock = AMPBlock1 if h.resblock == '1' else AMPBlock2
# transposed conv-based upsamplers. does not apply anti-aliasing
self.ups = nn.ModuleList()
for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
self.ups.append(nn.ModuleList([
weight_norm(ConvTranspose1d(h.upsample_initial_channel // (2 ** i),
h.upsample_initial_channel // (2 ** (i + 1)),
k, u, padding=(k - u) // 2))
]))
# residual blocks using anti-aliased multi-periodicity composition modules (AMP)
self.resblocks = nn.ModuleList()
for i in range(len(self.ups)):
ch = h.upsample_initial_channel // (2 ** (i + 1))
for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
self.resblocks.append(resblock(h, ch, k, d, activation=h.activation))
# post conv
if h.activation == "snake": # periodic nonlinearity with snake function and anti-aliasing
activation_post = activations.Snake(ch, alpha_logscale=h.snake_logscale)
self.activation_post = Activation1d(activation=activation_post)
elif h.activation == "snakebeta": # periodic nonlinearity with snakebeta function and anti-aliasing
activation_post = activations.SnakeBeta(ch, alpha_logscale=h.snake_logscale)
self.activation_post = Activation1d(activation=activation_post)
else:
raise NotImplementedError(
"activation incorrectly specified. check the config file and look for 'activation'.")
self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
# weight initialization
for i in range(len(self.ups)):
self.ups[i].apply(init_weights)
self.conv_post.apply(init_weights)
def forward(self,x, c):
# pre conv
x = self.conv_pre(x)
for i in range(self.num_upsamples):
# upsampling
for i_up in range(len(self.ups[i])):
x = self.ups[i][i_up](x)
# AMP blocks
xs = None
for j in range(self.num_kernels):
if xs is None:
xs = self.resblocks[i * self.num_kernels + j](x)
else:
xs += self.resblocks[i * self.num_kernels + j](x)
x = xs / self.num_kernels
# post conv
x = self.activation_post(x)
x = self.conv_post(x)
x = torch.tanh(x)
return x
def remove_weight_norm(self):
print('Removing weight norm...')
for l in self.ups:
for l_i in l:
remove_weight_norm(l_i)
for l in self.resblocks:
l.remove_weight_norm()
remove_weight_norm(self.conv_pre)
remove_weight_norm(self.conv_post)
def inference(self, c, z=None):
# pad input mel with zeros to cut artifact
# see https://github.com/seungwonpark/melgan/issues/8
zero = torch.full((c.shape[0], h.num_mels, 10), -11.5129).to(c.device)
mel = torch.cat((c, zero), dim=2)
if z is None:
z = torch.randn(c.shape[0], self.noise_dim, mel.size(2)).to(mel.device)
audio = self.forward(mel, z)
audio = audio[:, :, :-(self.hop_length * 10)]
audio = audio.clamp(min=-1, max=1)
return audio
def eval(self, inference=False):
super(BigVGAN, self).eval()
# don't remove weight norm while validation in training loop
if inference:
self.remove_weight_norm()
class DiscriminatorP(nn.Module):
def __init__(self, h, period, kernel_size=5, stride=3, use_spectral_norm=False):
super(DiscriminatorP, self).__init__()
self.period = period
self.d_mult = h.discriminator_channel_mult
norm_f = weight_norm if use_spectral_norm == False else spectral_norm
self.convs = nn.ModuleList([
norm_f(Conv2d(1, int(32 * self.d_mult), (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
norm_f(Conv2d(int(32 * self.d_mult), int(128 * self.d_mult), (kernel_size, 1), (stride, 1),
padding=(get_padding(5, 1), 0))),
norm_f(Conv2d(int(128 * self.d_mult), int(512 * self.d_mult), (kernel_size, 1), (stride, 1),
padding=(get_padding(5, 1), 0))),
norm_f(Conv2d(int(512 * self.d_mult), int(1024 * self.d_mult), (kernel_size, 1), (stride, 1),
padding=(get_padding(5, 1), 0))),
norm_f(Conv2d(int(1024 * self.d_mult), int(1024 * self.d_mult), (kernel_size, 1), 1, padding=(2, 0))),
])
self.conv_post = norm_f(Conv2d(int(1024 * self.d_mult), 1, (3, 1), 1, padding=(1, 0)))
def forward(self, x):
fmap = []
# 1d to 2d
b, c, t = x.shape
if t % self.period != 0: # pad first
n_pad = self.period - (t % self.period)
x = F.pad(x, (0, n_pad), "reflect")
t = t + n_pad
x = x.view(b, c, t // self.period, self.period)
for l in self.convs:
x = l(x)
x = F.leaky_relu(x, LRELU_SLOPE)
fmap.append(x)
x = self.conv_post(x)
fmap.append(x)
x = torch.flatten(x, 1, -1)
return x, fmap
class MultiPeriodDiscriminator(nn.Module):
def __init__(self, h):
super(MultiPeriodDiscriminator, self).__init__()
self.mpd_reshapes = h.mpd_reshapes
print("mpd_reshapes: {}".format(self.mpd_reshapes))
discriminators = [DiscriminatorP(h, rs, use_spectral_norm=h.use_spectral_norm) for rs in self.mpd_reshapes]
self.discriminators = nn.ModuleList(discriminators)
def forward(self, y, y_hat):
y_d_rs = []
y_d_gs = []
fmap_rs = []
fmap_gs = []
for i, d in enumerate(self.discriminators):
y_d_r, fmap_r = d(y)
y_d_g, fmap_g = d(y_hat)
y_d_rs.append(y_d_r)
fmap_rs.append(fmap_r)
y_d_gs.append(y_d_g)
fmap_gs.append(fmap_g)
return y_d_rs, y_d_gs, fmap_rs, fmap_gs
class DiscriminatorR(nn.Module):
def __init__(self, cfg, resolution):
super().__init__()
self.resolution = resolution
assert len(self.resolution) == 3, \
"MRD layer requires list with len=3, got {}".format(self.resolution)
self.lrelu_slope = LRELU_SLOPE
norm_f = weight_norm if cfg.use_spectral_norm == False else spectral_norm
if hasattr(cfg, "mrd_use_spectral_norm"):
print("INFO: overriding MRD use_spectral_norm as {}".format(cfg.mrd_use_spectral_norm))
norm_f = weight_norm if cfg.mrd_use_spectral_norm == False else spectral_norm
self.d_mult = cfg.discriminator_channel_mult
if hasattr(cfg, "mrd_channel_mult"):
print("INFO: overriding mrd channel multiplier as {}".format(cfg.mrd_channel_mult))
self.d_mult = cfg.mrd_channel_mult
self.convs = nn.ModuleList([
norm_f(nn.Conv2d(1, int(32 * self.d_mult), (3, 9), padding=(1, 4))),
norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 3), padding=(1, 1))),
])
self.conv_post = norm_f(nn.Conv2d(int(32 * self.d_mult), 1, (3, 3), padding=(1, 1)))
def forward(self, x):
fmap = []
x = self.spectrogram(x)
x = x.unsqueeze(1)
for l in self.convs:
x = l(x)
x = F.leaky_relu(x, self.lrelu_slope)
fmap.append(x)
x = self.conv_post(x)
fmap.append(x)
x = torch.flatten(x, 1, -1)
return x, fmap
def spectrogram(self, x):
n_fft, hop_length, win_length = self.resolution
x = F.pad(x, (int((n_fft - hop_length) / 2), int((n_fft - hop_length) / 2)), mode='reflect')
x = x.squeeze(1)
x = torch.stft(x, n_fft=n_fft, hop_length=hop_length, win_length=win_length, center=False, return_complex=True)
x = torch.view_as_real(x) # [B, F, TT, 2]
mag = torch.norm(x, p=2, dim=-1) # [B, F, TT]
return mag
class MultiResolutionDiscriminator(nn.Module):
def __init__(self, cfg, debug=False):
super().__init__()
self.resolutions = cfg.resolutions
assert len(self.resolutions) == 3, \
"MRD requires list of list with len=3, each element having a list with len=3. got {}". \
format(self.resolutions)
self.discriminators = nn.ModuleList(
[DiscriminatorR(cfg, resolution) for resolution in self.resolutions]
)
def forward(self, y, y_hat):
y_d_rs = []
y_d_gs = []
fmap_rs = []
fmap_gs = []
for i, d in enumerate(self.discriminators):
y_d_r, fmap_r = d(x=y)
y_d_g, fmap_g = d(x=y_hat)
y_d_rs.append(y_d_r)
fmap_rs.append(fmap_r)
y_d_gs.append(y_d_g)
fmap_gs.append(fmap_g)
return y_d_rs, y_d_gs, fmap_rs, fmap_gs
def get_mel(x):
return mel_spectrogram(x, h.n_fft, h.num_mels, h.sampling_rate, h.hop_size, h.win_size, h.fmin, h.fmax)
def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
if torch.min(y) < -1.:
print('min value is ', torch.min(y))
if torch.max(y) > 1.:
print('max value is ', torch.max(y))
global mel_basis, hann_window
if fmax not in mel_basis:
mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
mel_basis[str(fmax)+'_'+str(y.device)] = torch.from_numpy(mel).float().to(y.device)
hann_window[str(y.device)] = torch.hann_window(win_size).to(y.device)
y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
y = y.squeeze(1)
# complex tensor as default, then use view_as_real for future pytorch compatibility
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=True)
spec = torch.view_as_real(spec)
spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
spec = torch.matmul(mel_basis[str(fmax)+'_'+str(y.device)], spec)
spec = torch.nn.utils.spectral_normalize_torch(spec)
return spec
def feature_loss(fmap_r, fmap_g):
loss = 0
for dr, dg in zip(fmap_r, fmap_g):
for rl, gl in zip(dr, dg):
loss += torch.mean(torch.abs(rl - gl))
return loss * 2
def init_weights(m, mean=0.0, std=0.01):
classname = m.__class__.__name__
if classname.find("Conv") != -1:
m.weight.data.normal_(mean, std)
def get_padding(kernel_size, dilation=1):
return int((kernel_size * dilation - dilation) / 2)
def discriminator_loss(disc_real_outputs, disc_generated_outputs):
loss = 0
r_losses = []
g_losses = []
for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
r_loss = torch.mean((1 - dr) ** 2)
g_loss = torch.mean(dg ** 2)
loss += (r_loss + g_loss)
r_losses.append(r_loss.item())
g_losses.append(g_loss.item())
return loss, r_losses, g_losses
def generator_loss(disc_outputs):
loss = 0
gen_losses = []
for dg in disc_outputs:
l = torch.mean((1 - dg) ** 2)
gen_losses.append(l)
loss += l
return loss, gen_losses
if __name__ == '__main__':
model = BigVGAN()
c = torch.randn(3, 100, 10)
z = torch.randn(3, 64, 10)
print(c.shape)
y = model(c, z)
print(y.shape)
assert y.shape == torch.Size([3, 1, 2560])
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(pytorch_total_params)

View File

@ -3,7 +3,6 @@ import torch.nn as nn
from tortoise.models.arch_util import Upsample, Downsample, normalization, zero_module, AttentionBlock
import tortoise.utils.torch_intermediary as ml
class ResBlock(nn.Module):
def __init__(
@ -125,8 +124,7 @@ class AudioMiniEncoderWithClassifierHead(nn.Module):
def __init__(self, classes, distribute_zero_label=True, **kwargs):
super().__init__()
self.enc = AudioMiniEncoder(**kwargs)
# nn.Linear
self.head = ml.Linear(self.enc.dim, classes)
self.head = nn.Linear(self.enc.dim, classes)
self.num_classes = classes
self.distribute_zero_label = distribute_zero_label

View File

@ -7,9 +7,6 @@ from tortoise.models.arch_util import CheckpointedXTransformerEncoder
from tortoise.models.transformer import Transformer
from tortoise.models.xtransformers import Encoder
import tortoise.utils.torch_intermediary as ml
from tortoise.utils.device import print_stats, do_gc
def exists(val):
return val is not None
@ -47,15 +44,11 @@ class CLVP(nn.Module):
use_xformers=False,
):
super().__init__()
# nn.Embedding
self.text_emb = ml.Embedding(num_text_tokens, dim_text)
# nn.Linear
self.to_text_latent = ml.Linear(dim_text, dim_latent, bias=False)
self.text_emb = nn.Embedding(num_text_tokens, dim_text)
self.to_text_latent = nn.Linear(dim_text, dim_latent, bias=False)
# nn.Embedding
self.speech_emb = ml.Embedding(num_speech_tokens, dim_speech)
# nn.Linear
self.to_speech_latent = ml.Linear(dim_speech, dim_latent, bias=False)
self.speech_emb = nn.Embedding(num_speech_tokens, dim_speech)
self.to_speech_latent = nn.Linear(dim_speech, dim_latent, bias=False)
if use_xformers:
self.text_transformer = CheckpointedXTransformerEncoder(
@ -100,10 +93,8 @@ class CLVP(nn.Module):
self.wav_token_compression = wav_token_compression
self.xformers = use_xformers
if not use_xformers:
# nn.Embedding
self.text_pos_emb = ml.Embedding(text_seq_len, dim_text)
# nn.Embedding
self.speech_pos_emb = ml.Embedding(num_speech_tokens, dim_speech)
self.text_pos_emb = nn.Embedding(text_seq_len, dim_text)
self.speech_pos_emb = nn.Embedding(num_speech_tokens, dim_speech)
def forward(
self,
@ -126,13 +117,14 @@ class CLVP(nn.Module):
text_emb += self.text_pos_emb(torch.arange(text.shape[1], device=device))
speech_emb += self.speech_pos_emb(torch.arange(speech_emb.shape[1], device=device))
text_latents = self.to_text_latent(masked_mean(self.text_transformer(text_emb, mask=text_mask), text_mask, dim=1))
enc_text = self.text_transformer(text_emb, mask=text_mask)
enc_speech = self.speech_transformer(speech_emb, mask=voice_mask)
# on ROCm at least, allocated VRAM spikes here
do_gc()
speech_latents = self.to_speech_latent(masked_mean(self.speech_transformer(speech_emb, mask=voice_mask), voice_mask, dim=1))
do_gc()
text_latents = masked_mean(enc_text, text_mask, dim=1)
speech_latents = masked_mean(enc_speech, voice_mask, dim=1)
text_latents = self.to_text_latent(text_latents)
speech_latents = self.to_speech_latent(speech_latents)
text_latents, speech_latents = map(lambda t: F.normalize(t, p=2, dim=-1), (text_latents, speech_latents))

View File

@ -6,7 +6,6 @@ from torch import einsum
from tortoise.models.arch_util import AttentionBlock
from tortoise.models.xtransformers import ContinuousTransformerWrapper, Encoder
import tortoise.utils.torch_intermediary as ml
def exists(val):
return val is not None
@ -55,8 +54,7 @@ class CollapsingTransformer(nn.Module):
class ConvFormatEmbedding(nn.Module):
def __init__(self, *args, **kwargs):
super().__init__()
# nn.Embedding
self.emb = ml.Embedding(*args, **kwargs)
self.emb = nn.Embedding(*args, **kwargs)
def forward(self, x):
y = self.emb(x)
@ -85,8 +83,7 @@ class CVVP(nn.Module):
nn.Conv1d(model_dim//2, model_dim, kernel_size=3, stride=2, padding=1))
self.conditioning_transformer = CollapsingTransformer(
model_dim, model_dim, transformer_heads, dropout, conditioning_enc_depth, cond_mask_percentage)
# nn.Linear
self.to_conditioning_latent = ml.Linear(
self.to_conditioning_latent = nn.Linear(
latent_dim, latent_dim, bias=False)
if mel_codes is None:
@ -96,8 +93,7 @@ class CVVP(nn.Module):
self.speech_emb = ConvFormatEmbedding(mel_codes, model_dim)
self.speech_transformer = CollapsingTransformer(
model_dim, latent_dim, transformer_heads, dropout, speech_enc_depth, speech_mask_percentage)
# nn.Linear
self.to_speech_latent = ml.Linear(
self.to_speech_latent = nn.Linear(
latent_dim, latent_dim, bias=False)
def get_grad_norm_parameter_groups(self):

View File

@ -10,8 +10,6 @@ from torch import autocast
from tortoise.models.arch_util import normalization, AttentionBlock
from tortoise.utils.device import get_device_name
import tortoise.utils.torch_intermediary as ml
def is_latent(t):
return t.dtype == torch.float
@ -89,8 +87,7 @@ class ResBlock(TimestepBlock):
self.emb_layers = nn.Sequential(
nn.SiLU(),
# nn.Linear
ml.Linear(
nn.Linear(
emb_channels,
2 * self.out_channels if use_scale_shift_norm else self.out_channels,
),
@ -163,19 +160,16 @@ class DiffusionTts(nn.Module):
self.inp_block = nn.Conv1d(in_channels, model_channels, 3, 1, 1)
self.time_embed = nn.Sequential(
# nn.Linear
ml.Linear(model_channels, model_channels),
nn.Linear(model_channels, model_channels),
nn.SiLU(),
# nn.Linear
ml.Linear(model_channels, model_channels),
nn.Linear(model_channels, model_channels),
)
# Either code_converter or latent_converter is used, depending on what type of conditioning data is fed.
# This model is meant to be able to be trained on both for efficiency purposes - it is far less computationally
# complex to generate tokens, while generating latents will normally mean propagating through a deep autoregressive
# transformer network.
# nn.Embedding
self.code_embedding = ml.Embedding(in_tokens, model_channels)
self.code_embedding = nn.Embedding(in_tokens, model_channels)
self.code_converter = nn.Sequential(
AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),

View File

@ -4,7 +4,6 @@ import torch
import torch.nn as nn
import torch.nn.functional as F
import tortoise.utils.torch_intermediary as ml
def fused_leaky_relu(input, bias=None, negative_slope=0.2, scale=2 ** 0.5):
if bias is not None:
@ -42,8 +41,7 @@ class RandomLatentConverter(nn.Module):
def __init__(self, channels):
super().__init__()
self.layers = nn.Sequential(*[EqualLinear(channels, channels, lr_mul=.1) for _ in range(5)],
# nn.Linear
ml.Linear(channels, channels))
nn.Linear(channels, channels))
self.channels = channels
def forward(self, ref):

View File

@ -6,7 +6,6 @@ from einops import rearrange
from rotary_embedding_torch import RotaryEmbedding, broadcat
from torch import nn
import tortoise.utils.torch_intermediary as ml
# helpers
@ -121,12 +120,10 @@ class FeedForward(nn.Module):
def __init__(self, dim, dropout = 0., mult = 4.):
super().__init__()
self.net = nn.Sequential(
# nn.Linear
ml.Linear(dim, dim * mult * 2),
nn.Linear(dim, dim * mult * 2),
GEGLU(),
nn.Dropout(dropout),
# nn.Linear
ml.Linear(dim * mult, dim)
nn.Linear(dim * mult, dim)
)
def forward(self, x):
@ -145,11 +142,9 @@ class Attention(nn.Module):
self.causal = causal
# nn.Linear
self.to_qkv = ml.Linear(dim, inner_dim * 3, bias = False)
self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
self.to_out = nn.Sequential(
# nn.Linear
ml.Linear(inner_dim, dim),
nn.Linear(inner_dim, dim),
nn.Dropout(dropout)
)

View File

@ -8,8 +8,6 @@ import torch.nn.functional as F
from einops import rearrange, repeat
from torch import nn, einsum
import tortoise.utils.torch_intermediary as ml
DEFAULT_DIM_HEAD = 64
Intermediates = namedtuple('Intermediates', [
@ -123,8 +121,7 @@ class AbsolutePositionalEmbedding(nn.Module):
def __init__(self, dim, max_seq_len):
super().__init__()
self.scale = dim ** -0.5
# nn.Embedding
self.emb = ml.Embedding(max_seq_len, dim)
self.emb = nn.Embedding(max_seq_len, dim)
def forward(self, x):
n = torch.arange(x.shape[1], device=x.device)
@ -153,8 +150,7 @@ class RelativePositionBias(nn.Module):
self.causal = causal
self.num_buckets = num_buckets
self.max_distance = max_distance
# nn.Embedding
self.relative_attention_bias = ml.Embedding(num_buckets, heads)
self.relative_attention_bias = nn.Embedding(num_buckets, heads)
@staticmethod
def _relative_position_bucket(relative_position, causal=True, num_buckets=32, max_distance=128):
@ -354,8 +350,7 @@ class RMSScaleShiftNorm(nn.Module):
self.scale = dim ** -0.5
self.eps = eps
self.g = nn.Parameter(torch.ones(dim))
# nn.Linear
self.scale_shift_process = ml.Linear(dim * 2, dim * 2)
self.scale_shift_process = nn.Linear(dim * 2, dim * 2)
def forward(self, x, norm_scale_shift_inp):
norm = torch.norm(x, dim=-1, keepdim=True) * self.scale
@ -435,8 +430,7 @@ class GLU(nn.Module):
def __init__(self, dim_in, dim_out, activation):
super().__init__()
self.act = activation
# nn.Linear
self.proj = ml.Linear(dim_in, dim_out * 2)
self.proj = nn.Linear(dim_in, dim_out * 2)
def forward(self, x):
x, gate = self.proj(x).chunk(2, dim=-1)
@ -461,8 +455,7 @@ class FeedForward(nn.Module):
activation = ReluSquared() if relu_squared else nn.GELU()
project_in = nn.Sequential(
# nn.Linear
ml.Linear(dim, inner_dim),
nn.Linear(dim, inner_dim),
activation
) if not glu else GLU(dim, inner_dim, activation)
@ -470,8 +463,7 @@ class FeedForward(nn.Module):
project_in,
nn.LayerNorm(inner_dim) if post_act_ln else nn.Identity(),
nn.Dropout(dropout),
# nn.Linear
ml.Linear(inner_dim, dim_out)
nn.Linear(inner_dim, dim_out)
)
# init last linear layer to 0
@ -524,20 +516,16 @@ class Attention(nn.Module):
qk_dim = int(collab_compression * qk_dim)
self.collab_mixing = nn.Parameter(torch.randn(heads, qk_dim))
# nn.Linear
self.to_q = ml.Linear(dim, qk_dim, bias=False)
# nn.Linear
self.to_k = ml.Linear(dim, qk_dim, bias=False)
# nn.Linear
self.to_v = ml.Linear(dim, v_dim, bias=False)
self.to_q = nn.Linear(dim, qk_dim, bias=False)
self.to_k = nn.Linear(dim, qk_dim, bias=False)
self.to_v = nn.Linear(dim, v_dim, bias=False)
self.dropout = nn.Dropout(dropout)
# add GLU gating for aggregated values, from alphafold2
self.to_v_gate = None
if gate_values:
# nn.Linear
self.to_v_gate = ml.Linear(dim, v_dim)
self.to_v_gate = nn.Linear(dim, v_dim)
nn.init.constant_(self.to_v_gate.weight, 0)
nn.init.constant_(self.to_v_gate.bias, 1)
@ -573,8 +561,7 @@ class Attention(nn.Module):
# attention on attention
self.attn_on_attn = on_attn
# nn.Linear
self.to_out = nn.Sequential(ml.Linear(v_dim, dim * 2), nn.GLU()) if on_attn else ml.Linear(v_dim, dim)
self.to_out = nn.Sequential(nn.Linear(v_dim, dim * 2), nn.GLU()) if on_attn else nn.Linear(v_dim, dim)
self.rel_pos_bias = rel_pos_bias
if rel_pos_bias:
@ -1064,8 +1051,7 @@ class ViTransformerWrapper(nn.Module):
self.patch_size = patch_size
self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
# nn.Linear
self.patch_to_embedding = ml.Linear(patch_dim, dim)
self.patch_to_embedding = nn.Linear(patch_dim, dim)
self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
self.dropout = nn.Dropout(emb_dropout)
@ -1123,21 +1109,18 @@ class TransformerWrapper(nn.Module):
self.max_mem_len = max_mem_len
self.shift_mem_down = shift_mem_down
# nn.Embedding
self.token_emb = ml.Embedding(num_tokens, emb_dim)
self.token_emb = nn.Embedding(num_tokens, emb_dim)
self.pos_emb = AbsolutePositionalEmbedding(emb_dim, max_seq_len) if (
use_pos_emb and not attn_layers.has_pos_emb) else always(0)
self.emb_dropout = nn.Dropout(emb_dropout)
# nn.Linear
self.project_emb = ml.Linear(emb_dim, dim) if emb_dim != dim else nn.Identity()
self.project_emb = nn.Linear(emb_dim, dim) if emb_dim != dim else nn.Identity()
self.attn_layers = attn_layers
self.norm = nn.LayerNorm(dim)
self.init_()
# nn.Linear
self.to_logits = ml.Linear(dim, num_tokens) if not tie_embedding else lambda t: t @ self.token_emb.weight.t()
self.to_logits = nn.Linear(dim, num_tokens) if not tie_embedding else lambda t: t @ self.token_emb.weight.t()
# memory tokens (like [cls]) from Memory Transformers paper
num_memory_tokens = default(num_memory_tokens, 0)
@ -1224,14 +1207,12 @@ class ContinuousTransformerWrapper(nn.Module):
use_pos_emb and not attn_layers.has_pos_emb) else always(0)
self.emb_dropout = nn.Dropout(emb_dropout)
# nn.Linear
self.project_in = ml.Linear(dim_in, dim) if exists(dim_in) else nn.Identity()
self.project_in = nn.Linear(dim_in, dim) if exists(dim_in) else nn.Identity()
self.attn_layers = attn_layers
self.norm = nn.LayerNorm(dim)
# nn.Linear
self.project_out = ml.Linear(dim, dim_out) if exists(dim_out) else nn.Identity()
self.project_out = nn.Linear(dim, dim_out) if exists(dim_out) else nn.Identity()
def forward(
self,

View File

@ -17,7 +17,6 @@ if __name__ == '__main__':
'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='pat')
parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/')
parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
parser.add_argument('--use_deepspeed', type=bool, help='Use deepspeed for speed bump.', default=True)
parser.add_argument('--regenerate', type=str, help='Comma-separated list of clip numbers to re-generate, or nothing.', default=None)
parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice. Only the first candidate is actually used in the final product, the others can be used manually.', default=1)
parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'
@ -26,7 +25,7 @@ if __name__ == '__main__':
parser.add_argument('--produce_debug_state', type=bool, help='Whether or not to produce debug_state.pth, which can aid in reproducing problems. Defaults to true.', default=True)
args = parser.parse_args()
tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed)
tts = TextToSpeech(models_dir=args.model_dir)
outpath = args.output_path
selected_voices = args.voice.split(',')

View File

@ -2,7 +2,6 @@ import os
from glob import glob
import librosa
import soundfile as sf
import torch
import torchaudio
import numpy as np
@ -25,9 +24,6 @@ def load_audio(audiopath, sampling_rate):
elif audiopath[-4:] == '.mp3':
audio, lsr = librosa.load(audiopath, sr=sampling_rate)
audio = torch.FloatTensor(audio)
elif audiopath[-5:] == '.flac':
audio, lsr = sf.read(audiopath)
audio = torch.FloatTensor(audio)
else:
assert False, f"Unsupported audio format provided: {audiopath[-4:]}"
@ -89,94 +85,31 @@ def get_voices(extra_voice_dirs=[], load_latents=True):
for sub in subs:
subj = os.path.join(d, sub)
if os.path.isdir(subj):
voices[sub] = list(glob(f'{subj}/*.wav')) + list(glob(f'{subj}/*.mp3')) + list(glob(f'{subj}/*.flac'))
voices[sub] = list(glob(f'{subj}/*.wav')) + list(glob(f'{subj}/*.mp3'))
if load_latents:
voices[sub] = voices[sub] + list(glob(f'{subj}/*.pth'))
return voices
def get_voice( name, dir=get_voice_dir(), load_latents=True, extensions=["wav", "mp3", "flac"] ):
subj = f'{dir}/{name}/'
if not os.path.isdir(subj):
return
files = os.listdir(subj)
if load_latents:
extensions.append("pth")
voice = []
for file in files:
ext = os.path.splitext(file)[-1][1:]
if ext not in extensions:
continue
voice.append(f'{subj}/{file}')
return sorted( voice )
def get_voice_list(dir=get_voice_dir(), append_defaults=False, load_latents=True, extensions=["wav", "mp3", "flac"]):
defaults = [ "random", "microphone" ]
os.makedirs(dir, exist_ok=True)
#res = sorted([d for d in os.listdir(dir) if d not in defaults and os.path.isdir(os.path.join(dir, d)) and len(os.listdir(os.path.join(dir, d))) > 0 ])
res = []
for name in os.listdir(dir):
if name in defaults:
continue
if not os.path.isdir(f'{dir}/{name}'):
continue
if len(os.listdir(os.path.join(dir, name))) == 0:
continue
files = get_voice( name, dir=dir, extensions=extensions, load_latents=load_latents )
if len(files) > 0:
res.append(name)
else:
for subdir in os.listdir(f'{dir}/{name}'):
if not os.path.isdir(f'{dir}/{name}/{subdir}'):
continue
files = get_voice( f'{name}/{subdir}', dir=dir, extensions=extensions, load_latents=load_latents )
if len(files) == 0:
continue
res.append(f'{name}/{subdir}')
res = sorted(res)
if append_defaults:
res = res + defaults
return res
def _get_voices( dirs=[get_voice_dir()], load_latents=True ):
voices = {}
for dir in dirs:
voice_list = get_voice_list(dir=dir)
voices |= { name: get_voice(name=name, dir=dir, load_latents=load_latents) for name in voice_list }
return voices
def load_voice(voice, extra_voice_dirs=[], load_latents=True, sample_rate=22050, device='cpu', model_hash=None):
def load_voice(voice, extra_voice_dirs=[], load_latents=True, sample_rate=22050, device='cpu'):
if voice == 'random':
return None, None
voices = _get_voices(dirs=[get_voice_dir()] + extra_voice_dirs, load_latents=load_latents)
voices = get_voices(extra_voice_dirs=extra_voice_dirs, load_latents=load_latents)
paths = voices[voice]
mtime = 0
latent = None
voices = []
for path in paths:
filename = os.path.basename(path)
if filename[-4:] == ".pth" and filename[:12] == "cond_latents":
if not model_hash and filename == "cond_latents.pth":
latent = path
elif model_hash and filename == f"cond_latents_{model_hash[:8]}.pth":
latent = path
mtime = 0
voices = []
latent = None
for file in paths:
if file[-16:] == "cond_latents.pth":
latent = file
elif file[-4:] == ".pth":
{}
# noop
else:
voices.append(path)
mtime = max(mtime, os.path.getmtime(path))
voices.append(file)
mtime = max(mtime, os.path.getmtime(file))
if load_latents and latent is not None:
if os.path.getmtime(latent) > mtime:

View File

@ -1,130 +1,97 @@
import torch
import psutil
import importlib
DEVICE_OVERRIDE = None
DEVICE_BATCH_SIZE_MAP = [(14, 16), (10,8), (7,4)]
from inspect import currentframe, getframeinfo
import gc
def do_gc():
gc.collect()
try:
torch.cuda.empty_cache()
except Exception as e:
pass
def print_stats(collect=False):
cf = currentframe().f_back
msg = f'{getframeinfo(cf).filename}:{cf.f_lineno}'
if collect:
do_gc()
tot = torch.cuda.get_device_properties(0).total_memory / (1024 ** 3)
res = torch.cuda.memory_reserved(0) / (1024 ** 3)
alloc = torch.cuda.memory_allocated(0) / (1024 ** 3)
print("[{}] Total: {:.3f} | Reserved: {:.3f} | Allocated: {:.3f} | Free: {:.3f}".format( msg, tot, res, alloc, tot-res ))
def has_dml():
loader = importlib.find_loader('torch_directml')
if loader is None:
return False
import torch_directml
return torch_directml.is_available()
def set_device_name(name):
global DEVICE_OVERRIDE
DEVICE_OVERRIDE = name
def get_device_name(attempt_gc=True):
global DEVICE_OVERRIDE
if DEVICE_OVERRIDE is not None and DEVICE_OVERRIDE != "":
return DEVICE_OVERRIDE
name = 'cpu'
if torch.cuda.is_available():
name = 'cuda'
if attempt_gc:
torch.cuda.empty_cache() # may have performance implications
elif has_dml():
name = 'dml'
return name
def get_device(verbose=False):
name = get_device_name()
if verbose:
if name == 'cpu':
print("No hardware acceleration is available, falling back to CPU...")
else:
print(f"Hardware acceleration found: {name}")
if name == "dml":
import torch_directml
return torch_directml.device()
return torch.device(name)
def get_device_vram( name=get_device_name() ):
available = 1
if name == "cuda":
_, available = torch.cuda.mem_get_info()
elif name == "cpu":
available = psutil.virtual_memory()[4]
return available / (1024 ** 3)
def get_device_batch_size(name=get_device_name()):
vram = get_device_vram(name)
if vram > 14:
return 16
elif vram > 10:
return 8
elif vram > 7:
return 4
"""
for k, v in DEVICE_BATCH_SIZE_MAP:
if vram > k:
return v
"""
return 1
def get_device_count(name=get_device_name()):
if name == "cuda":
return torch.cuda.device_count()
if name == "dml":
import torch_directml
return torch_directml.device_count()
return 1
# if you're getting errors make sure you've updated your torch-directml, and if you're still getting errors then you can uncomment the below block
"""
if has_dml():
_cumsum = torch.cumsum
_repeat_interleave = torch.repeat_interleave
_multinomial = torch.multinomial
_Tensor_new = torch.Tensor.new
_Tensor_cumsum = torch.Tensor.cumsum
_Tensor_repeat_interleave = torch.Tensor.repeat_interleave
_Tensor_multinomial = torch.Tensor.multinomial
torch.cumsum = lambda input, *args, **kwargs: ( _cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
torch.repeat_interleave = lambda input, *args, **kwargs: ( _repeat_interleave(input.to("cpu"), *args, **kwargs).to(input.device) )
torch.multinomial = lambda input, *args, **kwargs: ( _multinomial(input.to("cpu"), *args, **kwargs).to(input.device) )
torch.Tensor.new = lambda self, *args, **kwargs: ( _Tensor_new(self.to("cpu"), *args, **kwargs).to(self.device) )
torch.Tensor.cumsum = lambda self, *args, **kwargs: ( _Tensor_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )
torch.Tensor.repeat_interleave = lambda self, *args, **kwargs: ( _Tensor_repeat_interleave(self.to("cpu"), *args, **kwargs).to(self.device) )
torch.Tensor.multinomial = lambda self, *args, **kwargs: ( _Tensor_multinomial(self.to("cpu"), *args, **kwargs).to(self.device) )
"""
import torch
import psutil
import importlib
DEVICE_OVERRIDE = None
def has_dml():
loader = importlib.find_loader('torch_directml')
if loader is None:
return False
import torch_directml
return torch_directml.is_available()
def set_device_name(name):
global DEVICE_OVERRIDE
DEVICE_OVERRIDE = name
def get_device_name():
global DEVICE_OVERRIDE
if DEVICE_OVERRIDE is not None and DEVICE_OVERRIDE != "":
return DEVICE_OVERRIDE
name = 'cpu'
if torch.cuda.is_available():
name = 'cuda'
elif has_dml():
name = 'dml'
return name
def get_device(verbose=False):
name = get_device_name()
if verbose:
if name == 'cpu':
print("No hardware acceleration is available, falling back to CPU...")
else:
print(f"Hardware acceleration found: {name}")
if name == "dml":
import torch_directml
return torch_directml.device()
return torch.device(name)
def get_device_batch_size():
available = 1
name = get_device_name()
if name == "dml":
# there's nothing publically accessible in the DML API that exposes this
# there's a method to get currently used RAM statistics... as tiles
available = 1
elif name == "cuda":
_, available = torch.cuda.mem_get_info()
elif name == "cpu":
available = psutil.virtual_memory()[4]
availableGb = available / (1024 ** 3)
if availableGb > 14:
return 16
elif availableGb > 10:
return 8
elif availableGb > 7:
return 4
return 1
def get_device_count(name=get_device_name()):
if name == "cuda":
return torch.cuda.device_count()
if name == "dml":
import torch_directml
return torch_directml.device_count()
return 1
if has_dml():
_cumsum = torch.cumsum
_repeat_interleave = torch.repeat_interleave
_multinomial = torch.multinomial
_Tensor_new = torch.Tensor.new
_Tensor_cumsum = torch.Tensor.cumsum
_Tensor_repeat_interleave = torch.Tensor.repeat_interleave
_Tensor_multinomial = torch.Tensor.multinomial
torch.cumsum = lambda input, *args, **kwargs: ( _cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
torch.repeat_interleave = lambda input, *args, **kwargs: ( _repeat_interleave(input.to("cpu"), *args, **kwargs).to(input.device) )
torch.multinomial = lambda input, *args, **kwargs: ( _multinomial(input.to("cpu"), *args, **kwargs).to(input.device) )
torch.Tensor.new = lambda self, *args, **kwargs: ( _Tensor_new(self.to("cpu"), *args, **kwargs).to(self.device) )
torch.Tensor.cumsum = lambda self, *args, **kwargs: ( _Tensor_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )
torch.Tensor.repeat_interleave = lambda self, *args, **kwargs: ( _Tensor_repeat_interleave(self.to("cpu"), *args, **kwargs).to(self.device) )
torch.Tensor.multinomial = lambda self, *args, **kwargs: ( _Tensor_multinomial(self.to("cpu"), *args, **kwargs).to(self.device) )

View File

@ -13,7 +13,15 @@ import math
import numpy as np
import torch
import torch as th
from tqdm.auto import tqdm
from tqdm import tqdm
def tqdm_override(arr, verbose=False, progress=None, desc=None):
if verbose and desc is not None:
print(desc)
if progress is None:
return tqdm(arr, disable=not verbose)
return progress.tqdm(arr, desc=f'{progress.msg_prefix} {desc}' if hasattr(progress, 'msg_prefix') else desc, track_tqdm=True)
def normal_kl(mean1, logvar1, mean2, logvar2):
"""
@ -548,6 +556,7 @@ class GaussianDiffusion:
model_kwargs=None,
device=None,
verbose=False,
progress=None,
desc=None
):
"""
@ -580,6 +589,7 @@ class GaussianDiffusion:
model_kwargs=model_kwargs,
device=device,
verbose=verbose,
progress=progress,
desc=desc
):
final = sample
@ -596,6 +606,7 @@ class GaussianDiffusion:
model_kwargs=None,
device=None,
verbose=False,
progress=None,
desc=None
):
"""
@ -615,7 +626,7 @@ class GaussianDiffusion:
img = th.randn(*shape, device=device)
indices = list(range(self.num_timesteps))[::-1]
for i in tqdm(indices, desc=desc):
for i in tqdm_override(indices, verbose=verbose, desc=desc, progress=progress):
t = th.tensor([i] * shape[0], device=device)
with th.no_grad():
out = self.p_sample(
@ -730,6 +741,7 @@ class GaussianDiffusion:
device=None,
verbose=False,
eta=0.0,
progress=None,
desc=None,
):
"""
@ -749,6 +761,7 @@ class GaussianDiffusion:
device=device,
verbose=verbose,
eta=eta,
progress=progress,
desc=desc
):
final = sample
@ -766,6 +779,7 @@ class GaussianDiffusion:
device=None,
verbose=False,
eta=0.0,
progress=None,
desc=None,
):
"""
@ -784,7 +798,10 @@ class GaussianDiffusion:
indices = list(range(self.num_timesteps))[::-1]
if verbose:
indices = tqdm(indices, desc=desc)
# Lazy import so that we don't depend on tqdm.
from tqdm.auto import tqdm
indices = tqdm_override(indices, verbose=verbose, desc=desc, progress=progress)
for i in indices:
t = th.tensor([i] * shape[0], device=device)

View File

@ -1,6 +1,5 @@
import os
import re
import json
import inflect
import torch
@ -171,39 +170,16 @@ DEFAULT_VOCAB_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), '
class VoiceBpeTokenizer:
def __init__(self, vocab_file=DEFAULT_VOCAB_FILE, preprocess=None):
with open(vocab_file, 'r', encoding='utf-8') as f:
vocab = json.load(f)
self.language = vocab['model']['language'] if 'language' in vocab['model'] else None
if preprocess is None:
self.preprocess = 'pre_tokenizer' in vocab and vocab['pre_tokenizer']
else:
self.preprocess = preprocess
def __init__(self, vocab_file=DEFAULT_VOCAB_FILE):
if vocab_file is not None:
self.tokenizer = Tokenizer.from_file(vocab_file)
def preprocess_text(self, txt):
if self.language == 'ja':
import pykakasi
kks = pykakasi.kakasi()
results = kks.convert(txt)
words = []
for result in results:
words.append(result['kana'])
txt = " ".join(words)
txt = basic_cleaners(txt)
else:
txt = english_cleaners(txt)
txt = english_cleaners(txt)
return txt
def encode(self, txt):
if self.preprocess:
txt = self.preprocess_text(txt)
txt = self.preprocess_text(txt)
txt = txt.replace(' ', '[SPACE]')
return self.tokenizer.encode(txt).ids

View File

@ -1,65 +0,0 @@
"""
from bitsandbytes.nn import Linear8bitLt as Linear
from bitsandbytes.nn import StableEmbedding as Embedding
from bitsandbytes.optim.adam import Adam8bit as Adam
from bitsandbytes.optim.adamw import AdamW8bit as AdamW
"""
"""
from torch.nn import Linear
from torch.nn import Embedding
from torch.optim.adam import Adam
from torch.optim.adamw import AdamW
"""
"""
OVERRIDE_LINEAR = False
OVERRIDE_EMBEDDING = False
OVERRIDE_ADAM = False # True
OVERRIDE_ADAMW = False # True
"""
import os
USE_STABLE_EMBEDDING = False
try:
OVERRIDE_LINEAR = False
OVERRIDE_EMBEDDING = False
OVERRIDE_ADAM = False
OVERRIDE_ADAMW = False
USE_STABLE_EMBEDDING = os.environ.get('BITSANDBYTES_USE_STABLE_EMBEDDING', '1' if USE_STABLE_EMBEDDING else '0') == '1'
OVERRIDE_LINEAR = os.environ.get('BITSANDBYTES_OVERRIDE_LINEAR', '1' if OVERRIDE_LINEAR else '0') == '1'
OVERRIDE_EMBEDDING = os.environ.get('BITSANDBYTES_OVERRIDE_EMBEDDING', '1' if OVERRIDE_EMBEDDING else '0') == '1'
OVERRIDE_ADAM = os.environ.get('BITSANDBYTES_OVERRIDE_ADAM', '1' if OVERRIDE_ADAM else '0') == '1'
OVERRIDE_ADAMW = os.environ.get('BITSANDBYTES_OVERRIDE_ADAMW', '1' if OVERRIDE_ADAMW else '0') == '1'
if OVERRIDE_LINEAR or OVERRIDE_EMBEDDING or OVERRIDE_ADAM or OVERRIDE_ADAMW:
import bitsandbytes as bnb
except Exception as e:
OVERRIDE_LINEAR = False
OVERRIDE_EMBEDDING = False
OVERRIDE_ADAM = False
OVERRIDE_ADAMW = False
if OVERRIDE_LINEAR:
from bitsandbytes.nn import Linear8bitLt as Linear
else:
from torch.nn import Linear
if OVERRIDE_EMBEDDING:
if USE_STABLE_EMBEDDING:
from bitsandbytes.nn import StableEmbedding as Embedding
else:
from bitsandbytes.nn.modules import Embedding as Embedding
else:
from torch.nn import Embedding
if OVERRIDE_ADAM:
from bitsandbytes.optim.adam import Adam8bit as Adam
else:
from torch.optim.adam import Adam
if OVERRIDE_ADAMW:
from bitsandbytes.optim.adamw import AdamW8bit as AdamW
else:
from torch.optim.adamw import AdamW

View File

@ -7,8 +7,6 @@ from transformers import Wav2Vec2ForCTC, Wav2Vec2FeatureExtractor, Wav2Vec2CTCTo
from tortoise.utils.audio import load_audio
from tortoise.utils.device import get_device
import tortoise.utils.torch_intermediary as ml
def max_alignment(s1, s2, skip_character='~', record=None):
"""
A clever function that aligns s1 to s2 as best it can. Wherever a character from s1 is not found in s2, a '~' is
@ -144,7 +142,7 @@ class Wav2VecAlignment:
non_redacted_intervals = []
last_point = 0
for i in range(len(fully_split)):
if i % 2 == 0 and fully_split[i] != "": # Check for empty string fixes index error
if i % 2 == 0:
end_interval = max(0, last_point + len(fully_split[i]) - 1)
non_redacted_intervals.append((last_point, end_interval))
last_point += len(fully_split[i])

137
tortoise_tts.ipynb Executable file
View File

@ -0,0 +1,137 @@
{
"nbformat":4,
"nbformat_minor":0,
"metadata":{
"colab":{
"private_outputs":true,
"provenance":[
]
},
"kernelspec":{
"name":"python3",
"display_name":"Python 3"
},
"language_info":{
"name":"python"
},
"accelerator":"GPU",
"gpuClass":"standard"
},
"cells":[
{
"cell_type":"markdown",
"source":[
"## Initialization"
],
"metadata":{
"id":"ni41hmE03DL6"
}
},
{
"cell_type":"code",
"execution_count":null,
"metadata":{
"id":"FtsMKKfH18iM"
},
"outputs":[
],
"source":[
"!git clone https://git.ecker.tech/mrq/ai-voice-cloning/\n",
"%cd ai-voice-cloning\n",
"!python -m pip install --upgrade pip\n",
"!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116\n",
"!python -m pip install -r ./requirements.txt"
]
},
{
"cell_type":"code",
"source":[
"# colab requires the runtime to restart before use\n",
"exit()"
],
"metadata":{
"id":"FVUOtSASCSJ8"
},
"execution_count":null,
"outputs":[
]
},
{
"cell_type":"markdown",
"source":[
"## Running"
],
"metadata":{
"id":"o1gkfw3B3JSk"
}
},
{
"cell_type":"code",
"source":[
"%cd /content/ai-voice-cloning\n",
"\n",
"import os\n",
"import sys\n",
"\n",
"sys.argv = [\"\"]\n",
"sys.path.append('./src/')\n",
"\n",
"if 'TORTOISE_MODELS_DIR' not in os.environ:\n",
"\tos.environ['TORTOISE_MODELS_DIR'] = os.path.realpath(os.path.join(os.getcwd(), './models/tortoise/'))\n",
"\n",
"if 'TRANSFORMERS_CACHE' not in os.environ:\n",
"\tos.environ['TRANSFORMERS_CACHE'] = os.path.realpath(os.path.join(os.getcwd(), './models/transformers/'))\n",
"\n",
"from utils import *\n",
"from webui import *\n",
"\n",
"args = setup_args()\n",
"\n",
"webui = setup_gradio()\n",
"tts = setup_tortoise()\n",
"webui.launch(share=True, prevent_thread_lock=True, height=1000)\n",
"webui.block_thread()"
],
"metadata":{
"id":"c_EQZLTA19c7"
},
"execution_count":null,
"outputs":[
]
},
{
"cell_type":"markdown",
"source":[
"## Exporting"
],
"metadata":{
"id":"2AnVQxEJx47p"
}
},
{
"cell_type":"code",
"source":[
"%cd /content/ai-voice-cloning\n",
"!apt install -y p7zip-full\n",
"from datetime import datetime\n",
"timestamp = datetime.now().strftime('%m-%d-%Y_%H:%M:%S')\n",
"!mkdir -p \"../{timestamp}\"\n",
"!mv ./results/* \"../{timestamp}/.\"\n",
"!7z a -t7z -m0=lzma2 -mx=9 -mfb=64 -md=32m -ms=on \"../{timestamp}.7z\" \"../{timestamp}/\"\n",
"!ls ~/\n",
"!echo \"Finished zipping, archive is available at {timestamp}.7z\""
],
"metadata":{
"id":"YOACiDCXx72G"
},
"execution_count":null,
"outputs":[
]
}
]
}

3
update-force.bat Executable file
View File

@ -0,0 +1,3 @@
git fetch --all
git reset --hard origin/main
call .\update.bat

3
update-force.sh Executable file
View File

@ -0,0 +1,3 @@
git fetch --all
git reset --hard origin/main
./update.sh

7
update.bat Executable file
View File

@ -0,0 +1,7 @@
git pull
python -m venv tortoise-venv
call .\tortoise-venv\Scripts\activate.bat
python -m pip install --upgrade pip
python -m pip install -r ./requirements.txt
deactivate
pause

6
update.sh Executable file
View File

@ -0,0 +1,6 @@
git pull
python -m venv tortoise-venv
source ./tortoise-venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r ./requirements.txt
deactivate

0
voices/.gitkeep Executable file
View File

1103
webui.py Executable file

File diff suppressed because it is too large Load Diff