Added small optimization with caching latents, dropped Anaconda for just a py3.9 + pip + venv setup, added helper install scripts for such, cleaned up app.py, added flag '--low-vram' to disable minor optimizations

This commit is contained in:
mrq 2023-02-04 01:50:57 +00:00
parent 061aa65ac4
commit 4274cce218
6 changed files with 152 additions and 96 deletions

View File

@ -26,21 +26,25 @@ Lots of available RAM seems to be a requirement, as I see Python eating up 8GiB
### Pre-Requirements ### Pre-Requirements
Anaconda: https://www.anaconda.com/products/distribution Python 3.9: https://www.python.org/downloads/release/python-3913/
Git (optional): https://git-scm.com/download/win Git (optional): https://git-scm.com/download/win
### Setup ### Setup
Download Anaconda and run the installer. Download Python and run the installer.
After installing `conda`, open the Start Menu and search for `Anaconda Powershell Prompt`. Type `cd `, then drag and drop the folder you want to work in (experienced users can just `cd <path>` directly). After installing python, open the Start Menu and search for `Command Prompt`. Type `cd `, then drag and drop the folder you want to work in (experienced users can just `cd <path>` directly).
Paste `git clone https://git.ecker.tech/mrq/tortoise-tts` to download TorToiSe and additional scripts. Inexperienced users can just download the repo as a ZIP, and extract. Paste `git clone https://git.ecker.tech/mrq/tortoise-tts` to download TorToiSe and additional scripts. Inexperienced users can just download the repo as a ZIP, and extract.
Then move into that folder with `cd tortoise-tts`. Afterwards, enter `setup.bat` to automatically enter all the remaining commands. Then move into that folder with `cd tortoise-tts`. Afterwards, enter `setup.bat` to automatically enter all the remaining commands.
If you've done everything right with installing Anaconda, you shouldn't have any errors. If you've done everything right, you shouldn't have any errors.
### Updating
To check for updates with the Web UI, simply enter `git pull` in the command prompt, while the TorToiSe workspace is the current working directory.
## Preparing Voice Samples ## Preparing Voice Samples
@ -64,7 +68,7 @@ After preparing your clips as WAV files at a sample rate of 22050 Hz, open up th
## Using the Software ## Using the Software
Now you're ready to generate clips. With the `conda` prompt still open, simply run the web UI with `python app.py`, and wait for it to print out a URL to open in your browser, something like `http://127.0.0.1:7861`. Now you're ready to generate clips. With the command prompt still open, simply enter `start.bat`, and wait for it to print out a URL to open in your browser, something like `http://127.0.0.1:7861`.
If you're looking to access your copy of TorToiSe from outside your local network, pass `--share` into the command (for example, `python app.py --share`). You'll get a temporary gradio link to use. If you're looking to access your copy of TorToiSe from outside your local network, pass `--share` into the command (for example, `python app.py --share`). You'll get a temporary gradio link to use.
@ -72,7 +76,7 @@ You'll be presented with a bunch of options, but do not be overwhelmed, as most
* `Text`: text you want to be read. You wrap text in `[brackets]` for "prompt engineering", where it'll affect the output, but those words won't actually be read. * `Text`: text you want to be read. You wrap text in `[brackets]` for "prompt engineering", where it'll affect the output, but those words won't actually be read.
* `Emotion`: the "emotion" used for the delivery. This is a shortcut to starting with `[I am really ${emotion}],` in your text box. I assume the emotion is deduced during the CLVP pass. * `Emotion`: the "emotion" used for the delivery. This is a shortcut to starting with `[I am really ${emotion}],` in your text box. I assume the emotion is deduced during the CLVP pass.
* `Voice`: the voice you want to clone. You can select `custom` if you want to use input from your microphone. * `Voice`: the voice you want to clone. You can select `custom` if you want to use input from your microphone.
* `Record voice`: Not required, unless you use `custom`. * `Microphone Source`: Not required, unless you use `custom`.
* `Preset`: shortcut values for sample count and iteration steps. Use `none` if you want to provide your own values. Better presets rresult in better quality at the cost of computation time. * `Preset`: shortcut values for sample count and iteration steps. Use `none` if you want to provide your own values. Better presets rresult in better quality at the cost of computation time.
* `Seed`: initializes the PRNG initially to this value, use this if you want to reproduce a generated voice. Currently, I don't have a way to expose the seed used. * `Seed`: initializes the PRNG initially to this value, use this if you want to reproduce a generated voice. Currently, I don't have a way to expose the seed used.
* `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates. * `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates.
@ -86,6 +90,8 @@ All outputs are saved under `./result/[voice name]/[timestamp]/` as `result.wav`
To save you from headaches, I strongly recommend playing around with shorter sentences first to find the right values for the voice you're using before generating longer sentences. To save you from headaches, I strongly recommend playing around with shorter sentences first to find the right values for the voice you're using before generating longer sentences.
As a quick optimization, I modified the script to where the `conditional_latents` are saved after loading voice samples. If there's voice samples that have a modification time newer than this cached file, it'll skip loading it and load the normal WAVs instead.
## Example(s) ## Example(s)
Below are some outputs I deem substantial enough to share. As I continue delving into TorToiSe, I'll supply more examples and the values I use. Below are some outputs I deem substantial enough to share. As I continue delving into TorToiSe, I'll supply more examples and the values I use.

116
app.py
View File

@ -1,8 +1,10 @@
import os import os
import argparse import argparse
import gradio as gr import gradio as gr
import torch
import torchaudio import torchaudio
import time import time
from datetime import datetime from datetime import datetime
from tortoise.api import TextToSpeech from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices from tortoise.utils.audio import load_audio, load_voice, load_voices
@ -18,61 +20,49 @@ def inference(text, emotion, prompt, voice, mic_audio, preset, seed, candidates,
elif emotion != "None": elif emotion != "None":
text = f"[I am really {emotion.lower()},] {text}" text = f"[I am really {emotion.lower()},] {text}"
c = None
if voice == "microphone": if voice == "microphone":
if mic_audio is None: if mic_audio is None:
raise gr.Error("Please provide audio from mic when choosing `microphone` as a voice input") raise gr.Error("Please provide audio from mic when choosing `microphone` as a voice input")
c = load_audio(mic_audio, 22050) mic = load_audio(mic_audio, 22050)
voice_samples, conditioning_latents = [mic], None
if len(voices) == 1 or len(voices) == 0:
if voice == "microphone":
voice_samples, conditioning_latents = [c], None
else:
voice_samples, conditioning_latents = load_voice(voice)
else: else:
voice_samples, conditioning_latents = load_voices(voices) voice_samples, conditioning_latents = load_voice(voice)
if voice == "microphone":
voice_samples.extend([c]) if voice_samples is not None:
sample_voice = voice_samples[0]
sample_voice = voice_samples[0] if len(voice_samples) else None conditioning_latents = tts.get_conditioning_latents(voice_samples)
torch.save(conditioning_latents, os.path.join(f'./tortoise/voices/{voice}/', f'latents.pth'))
voice_samples = None
else:
sample_voice = None
if seed == 0: if seed == 0:
seed = None seed = None
start_time = time.time() start_time = time.time()
# >b-buh why not set samples and iterations to nullllll presets = {
# shut up 'ultra_fast': {'num_autoregressive_samples': 16, 'diffusion_iterations': 30, 'cond_free': False},
'fast': {'num_autoregressive_samples': 96, 'diffusion_iterations': 80},
'standard': {'num_autoregressive_samples': 256, 'diffusion_iterations': 200},
'high_quality': {'num_autoregressive_samples': 256, 'diffusion_iterations': 400},
'none': {'num_autoregressive_samples': num_autoregressive_samples, 'diffusion_iterations': diffusion_iterations},
}
settings = {
'temperature': temperature, 'length_penalty': 1.0, 'repetition_penalty': 2.0,
'top_p': .8,
'cond_free_k': 2.0, 'diffusion_temperature': 1.0,
if preset == "none": 'voice_samples': voice_samples,
gen, additionals = tts.tts_with_preset( 'conditioning_latents': conditioning_latents,
text, 'use_deterministic_seed': seed,
voice_samples=voice_samples, 'return_deterministic_state': True,
conditioning_latents=conditioning_latents, 'k': candidates,
preset="standard", 'progress': progress,
use_deterministic_seed=seed, }
return_deterministic_state=True, settings.update(presets[preset])
k=candidates, gen, additionals = tts.tts( text, **settings )
num_autoregressive_samples=num_autoregressive_samples, seed = additionals[0]
diffusion_iterations=diffusion_iterations,
temperature=temperature,
progress=progress
)
seed = additionals[0]
else:
gen, additionals = tts.tts_with_preset(
text,
voice_samples=voice_samples,
conditioning_latents=conditioning_latents,
preset=preset,
use_deterministic_seed=seed,
return_deterministic_state=True,
k=candidates,
temperature=temperature,
progress=progress
)
seed = additionals[0]
info = f"{datetime.now()} | Voice: {','.join(voices)} | Text: {text} | Quality: {preset} preset / {num_autoregressive_samples} samples / {diffusion_iterations} iterations | Temperature: {temperature} | Time Taken (s): {time.time()-start_time} | Seed: {seed}\n" info = f"{datetime.now()} | Voice: {','.join(voices)} | Text: {text} | Quality: {preset} preset / {num_autoregressive_samples} samples / {diffusion_iterations} iterations | Temperature: {temperature} | Time Taken (s): {time.time()-start_time} | Seed: {seed}\n"
with open("results.log", "a") as f: with open("results.log", "a") as f:
@ -89,24 +79,24 @@ def inference(text, emotion, prompt, voice, mic_audio, preset, seed, candidates,
if isinstance(gen, list): if isinstance(gen, list):
for j, g in enumerate(gen): for j, g in enumerate(gen):
torchaudio.save(os.path.join(outdir, f'result_{j}.wav'), g.squeeze(0).cpu(), 24000) torchaudio.save(os.path.join(outdir, f'result_{j}.wav'), g.squeeze(0).cpu(), 24000)
return (
(22050, sample_voice.squeeze().cpu().numpy()), output_voice = gen[0]
(24000, gen[0].squeeze().cpu().numpy()),
seed
)
else: else:
torchaudio.save(os.path.join(outdir, f'result.wav'), gen.squeeze(0).cpu(), 24000) torchaudio.save(os.path.join(outdir, f'result.wav'), gen.squeeze(0).cpu(), 24000)
return ( output_voice = gen
(22050, sample_voice.squeeze().cpu().numpy()),
(24000, gen.squeeze().cpu().numpy()), output_voice = (24000, output_voice.squeeze().cpu().numpy())
seed
) if sample_voice is not None:
sample_voice = (22050, sample_voice.squeeze().cpu().numpy())
return (
sample_voice,
output_voice,
seed
)
def main(): def main():
parser = argparse.ArgumentParser()
parser.add_argument("--share", action='store_true', help="Lets Gradio return a public URL to use anywhere")
args = parser.parse_args()
text = gr.Textbox(lines=4, label="Prompt") text = gr.Textbox(lines=4, label="Prompt")
emotion = gr.Radio( emotion = gr.Radio(
["None", "Happy", "Sad", "Angry", "Disgusted", "Arrogant", "Custom"], ["None", "Happy", "Sad", "Angry", "Disgusted", "Arrogant", "Custom"],
@ -158,11 +148,17 @@ def main():
temperature temperature
], ],
outputs=[selected_voice, output_audio, usedSeed], outputs=[selected_voice, output_audio, usedSeed],
allow_flagging=False allow_flagging='never'
) )
interface.queue().launch(share=args.share) interface.queue().launch(share=args.share)
if __name__ == "__main__": if __name__ == "__main__":
tts = TextToSpeech() parser = argparse.ArgumentParser()
parser.add_argument("--share", action='store_true', help="Lets Gradio return a public URL to use anywhere")
parser.add_argument("--low-vram", action='store_true', help="Disables some optimizations that increases VRAM usage")
args = parser.parse_args()
tts = TextToSpeech(minor_optimizations=not args.low_vram)
main() main()

View File

@ -11,4 +11,6 @@ librosa
torchaudio torchaudio
threadpoolctl threadpoolctl
appdirs appdirs
numpy
numba
gradio gradio

3
start.bat Executable file
View File

@ -0,0 +1,3 @@
call .\tortoise-venv\Scripts\activate.bat
py .\app.py
deactivate

View File

@ -206,7 +206,7 @@ class TextToSpeech:
Main entry point into Tortoise. Main entry point into Tortoise.
""" """
def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, device=None): def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, device=None, minor_optimizations=True):
""" """
Constructor Constructor
:param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing
@ -218,6 +218,7 @@ class TextToSpeech:
Default is true. Default is true.
:param device: Device to use when running the model. If omitted, the device will be automatically chosen. :param device: Device to use when running the model. If omitted, the device will be automatically chosen.
""" """
self.minor_optimizations = minor_optimizations
self.models_dir = models_dir self.models_dir = models_dir
self.autoregressive_batch_size = pick_best_batch_size_for_gpu() if autoregressive_batch_size is None else autoregressive_batch_size self.autoregressive_batch_size = pick_best_batch_size_for_gpu() if autoregressive_batch_size is None else autoregressive_batch_size
self.enable_redaction = enable_redaction self.enable_redaction = enable_redaction
@ -243,6 +244,7 @@ class TextToSpeech:
layer_drop=0, unconditioned_percentage=0).cpu().eval() layer_drop=0, unconditioned_percentage=0).cpu().eval()
self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', models_dir))) self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', models_dir)))
self.clvp = CLVP(dim_text=768, dim_speech=768, dim_latent=768, num_text_tokens=256, text_enc_depth=20, self.clvp = CLVP(dim_text=768, dim_speech=768, dim_latent=768, num_text_tokens=256, text_enc_depth=20,
text_seq_len=350, text_heads=12, text_seq_len=350, text_heads=12,
num_speech_tokens=8192, speech_enc_depth=20, speech_heads=12, speech_seq_len=430, num_speech_tokens=8192, speech_enc_depth=20, speech_heads=12, speech_seq_len=430,
@ -258,11 +260,20 @@ class TextToSpeech:
self.rlg_auto = None self.rlg_auto = None
self.rlg_diffusion = None self.rlg_diffusion = None
if self.minor_optimizations:
self.autoregressive = self.autoregressive.to(self.device)
self.diffusion = self.diffusion.to(self.device)
self.clvp = self.clvp.to(self.device)
self.vocoder = self.vocoder.to(self.device)
def load_cvvp(self): def load_cvvp(self):
"""Load CVVP model.""" """Load CVVP model."""
self.cvvp = CVVP(model_dim=512, transformer_heads=8, dropout=0, mel_codes=8192, conditioning_enc_depth=8, cond_mask_percentage=0, self.cvvp = CVVP(model_dim=512, transformer_heads=8, dropout=0, mel_codes=8192, conditioning_enc_depth=8, cond_mask_percentage=0,
speech_enc_depth=8, speech_mask_percentage=0, latent_multiplier=1).cpu().eval() speech_enc_depth=8, speech_mask_percentage=0, latent_multiplier=1).cpu().eval()
self.cvvp.load_state_dict(torch.load(get_model_path('cvvp.pth', self.models_dir))) self.cvvp.load_state_dict(torch.load(get_model_path('cvvp.pth', self.models_dir)))
if self.minor_optimizations:
self.cvvp = self.cvvp.to(self.device)
def get_conditioning_latents(self, voice_samples, return_mels=False): def get_conditioning_latents(self, voice_samples, return_mels=False):
""" """
@ -279,11 +290,9 @@ class TextToSpeech:
voice_samples = [voice_samples] voice_samples = [voice_samples]
for vs in voice_samples: for vs in voice_samples:
auto_conds.append(format_conditioning(vs, device=self.device)) auto_conds.append(format_conditioning(vs, device=self.device))
auto_conds = torch.stack(auto_conds, dim=1)
self.autoregressive = self.autoregressive.to(self.device)
auto_latent = self.autoregressive.get_conditioning(auto_conds)
self.autoregressive = self.autoregressive.cpu()
auto_conds = torch.stack(auto_conds, dim=1)
diffusion_conds = [] diffusion_conds = []
for sample in voice_samples: for sample in voice_samples:
# The diffuser operates at a sample rate of 24000 (except for the latent inputs) # The diffuser operates at a sample rate of 24000 (except for the latent inputs)
@ -293,9 +302,18 @@ class TextToSpeech:
diffusion_conds.append(cond_mel) diffusion_conds.append(cond_mel)
diffusion_conds = torch.stack(diffusion_conds, dim=1) diffusion_conds = torch.stack(diffusion_conds, dim=1)
self.diffusion = self.diffusion.to(self.device)
diffusion_latent = self.diffusion.get_conditioning(diffusion_conds) if self.minor_optimizations:
self.diffusion = self.diffusion.cpu() auto_latent = self.autoregressive.get_conditioning(auto_conds)
diffusion_latent = self.diffusion.get_conditioning(diffusion_conds)
else:
self.autoregressive = self.autoregressive.to(self.device)
auto_latent = self.autoregressive.get_conditioning(auto_conds)
self.autoregressive = self.autoregressive.cpu()
self.diffusion = self.diffusion.to(self.device)
diffusion_latent = self.diffusion.get_conditioning(diffusion_conds)
self.diffusion = self.diffusion.cpu()
if return_mels: if return_mels:
return auto_latent, diffusion_latent, auto_conds, diffusion_conds return auto_latent, diffusion_latent, auto_conds, diffusion_conds
@ -413,7 +431,9 @@ class TextToSpeech:
num_batches = num_autoregressive_samples // self.autoregressive_batch_size num_batches = num_autoregressive_samples // self.autoregressive_batch_size
stop_mel_token = self.autoregressive.stop_mel_token stop_mel_token = self.autoregressive.stop_mel_token
calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output" calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
self.autoregressive = self.autoregressive.to(self.device)
if not self.minor_optimizations:
self.autoregressive = self.autoregressive.to(self.device)
for b in tqdm_override(range(num_batches), verbose=verbose, progress=progress, desc="Generating autoregressive samples"): for b in tqdm_override(range(num_batches), verbose=verbose, progress=progress, desc="Generating autoregressive samples"):
codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens, codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens,
@ -428,14 +448,18 @@ class TextToSpeech:
padding_needed = max_mel_tokens - codes.shape[1] padding_needed = max_mel_tokens - codes.shape[1]
codes = F.pad(codes, (0, padding_needed), value=stop_mel_token) codes = F.pad(codes, (0, padding_needed), value=stop_mel_token)
samples.append(codes) samples.append(codes)
self.autoregressive = self.autoregressive.cpu()
clip_results = [] clip_results = []
self.clvp = self.clvp.to(self.device)
if not self.minor_optimizations:
self.autoregressive = self.autoregressive.cpu()
self.clvp = self.clvp.to(self.device)
if cvvp_amount > 0: if cvvp_amount > 0:
if self.cvvp is None: if self.cvvp is None:
self.load_cvvp() self.load_cvvp()
self.cvvp = self.cvvp.to(self.device) if not self.minor_optimizations:
self.cvvp = self.cvvp.to(self.device)
desc="Computing best candidates" desc="Computing best candidates"
if verbose: if verbose:
@ -463,25 +487,34 @@ class TextToSpeech:
clip_results = torch.cat(clip_results, dim=0) clip_results = torch.cat(clip_results, dim=0)
samples = torch.cat(samples, dim=0) samples = torch.cat(samples, dim=0)
best_results = samples[torch.topk(clip_results, k=k).indices] best_results = samples[torch.topk(clip_results, k=k).indices]
self.clvp = self.clvp.cpu()
if self.cvvp is not None:
self.cvvp = self.cvvp.cpu() if not self.minor_optimizations:
self.clvp = self.clvp.cpu()
if self.cvvp is not None:
self.cvvp = self.cvvp.cpu()
del samples del samples
# The diffusion model actually wants the last hidden layer from the autoregressive model as conditioning # The diffusion model actually wants the last hidden layer from the autoregressive model as conditioning
# inputs. Re-produce those for the top results. This could be made more efficient by storing all of these # inputs. Re-produce those for the top results. This could be made more efficient by storing all of these
# results, but will increase memory usage. # results, but will increase memory usage.
self.autoregressive = self.autoregressive.to(self.device) if not self.minor_optimizations:
self.autoregressive = self.autoregressive.to(self.device)
best_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1), best_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results, torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results,
torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device), torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
return_latent=True, clip_inputs=False) return_latent=True, clip_inputs=False)
self.autoregressive = self.autoregressive.cpu()
if not self.minor_optimizations:
self.autoregressive = self.autoregressive.cpu()
self.diffusion = self.diffusion.to(self.device)
self.vocoder = self.vocoder.to(self.device)
del auto_conditioning del auto_conditioning
wav_candidates = [] wav_candidates = []
self.diffusion = self.diffusion.to(self.device)
self.vocoder = self.vocoder.to(self.device)
for b in range(best_results.shape[0]): for b in range(best_results.shape[0]):
codes = best_results[b].unsqueeze(0) codes = best_results[b].unsqueeze(0)
latents = best_latents[b].unsqueeze(0) latents = best_latents[b].unsqueeze(0)
@ -501,8 +534,10 @@ class TextToSpeech:
temperature=diffusion_temperature, verbose=verbose, progress=progress, desc="Transforming autoregressive outputs into audio..") temperature=diffusion_temperature, verbose=verbose, progress=progress, desc="Transforming autoregressive outputs into audio..")
wav = self.vocoder.inference(mel) wav = self.vocoder.inference(mel)
wav_candidates.append(wav.cpu()) wav_candidates.append(wav.cpu())
self.diffusion = self.diffusion.cpu()
self.vocoder = self.vocoder.cpu() if not self.minor_optimizations:
self.diffusion = self.diffusion.cpu()
self.vocoder = self.vocoder.cpu()
def potentially_redact(clip, text): def potentially_redact(clip, text):
if self.enable_redaction: if self.enable_redaction:

32
tortoise/utils/audio.py Normal file → Executable file
View File

@ -97,20 +97,34 @@ def get_voices(extra_voice_dirs=[]):
return voices return voices
def load_voice(voice, extra_voice_dirs=[]): def load_voice(voice, extra_voice_dirs=[], load_latents=True):
if voice == 'random': if voice == 'random':
return None, None return None, None
voices = get_voices(extra_voice_dirs) voices = get_voices(extra_voice_dirs)
paths = voices[voice] paths = voices[voice]
if len(paths) == 1 and paths[0].endswith('.pth'):
return None, torch.load(paths[0]) mtime = 0
else: voices = []
conds = [] latent = None
for cond_path in paths: for file in paths:
c = load_audio(cond_path, 22050) if file[-4:] == ".pth":
conds.append(c) latent = file
return conds, None else:
voices.append(file)
mtime = max(mtime, os.path.getmtime(file))
if load_latents and latent is not None:
if os.path.getmtime(latent) > mtime:
print(f"Reading from latent: {latent}")
return None, torch.load(latent)
print(f"Latent file out of date: {latent}")
conds = []
for cond_path in voices:
c = load_audio(cond_path, 22050)
conds.append(c)
return conds, None
def load_voices(voices, extra_voice_dirs=[]): def load_voices(voices, extra_voice_dirs=[]):