diff --git a/README.md b/README.md index ab3ce80..1ea6b8c 100755 --- a/README.md +++ b/README.md @@ -161,11 +161,14 @@ However, keep in mind how you combine/separate your clips; depending on the mode For safety, try to keep your clips within the same length, or increase your `Voice Latents Max Chunk Size`, if console output alerts the best fit size exceeds this. -If you're looking to trim your clips, in my opinion, ~~Audacity~~ Tenacity works good enough, as you can easily output your clips into the proper format (22050 Hz sampling rate). +If you're looking to trim your clips, in my opinion, ~~Audacity~~ Tenacity works good enough, as you can also easily output your clips as a WAV. Power users with FFMPEG already installed can simply used the provided conversion script in `.\convert\`. -After preparing your clips as WAV files at a sample rate of 22050 Hz, open up the `tortoise-tts` folder you're working in, navigate to the `voices` folder, create a new folder in whatever name you want, then dump your clips into that folder. While you're in the `voice` folder, you can take a look at the other provided voices. +After preparing your clips as WAV files, you can use the web UI's import feature under `Utilities`, or: +* navigate to the `voices` folder +* create a new folder in whatever name you want +* dump your clips into that folder. **!**NOTE**!**: Before 2023.02.10, voices used to be stored under `.\tortoise\voices\`, but has been moved up one folder. Compatibily is maintained with the old voice folder, but will take priority. @@ -187,8 +190,8 @@ You'll be presented with a bunch of options in the default `Generate` tab, but d * `Microphone Source`: Use your own voice from a line-in source. * `Reload Voice List`: refreshes the voice list and updates. ***Click this*** after adding or removing a new voice. * `(Re)Compute Voice Latents`: regenerates a voice's cached latents. -* `Experimental Compute Latents Mode`: this mode will adjust the behavior for computing voice latents. leave this checked if you're unsure - - I've left my comments on either modes in `./tortoise/api.py`, if you're curious +* `Experimental Compute Latents Mode`: this mode will adjust the behavior for computing voice latents. Leave this checked if you're unsure, as this helps boost replicating a voice. + - if you're curious, feel free to play around with it by regenerating latents with and without it. Below are a list of generation settings: * `Candidates`: number of outputs to generate, starting from the best candidate. Depending on your iteration steps, generating the final sound files could be cheap, but they only offer alternatives to the samples generated to pull from (in other words, the later candidates perform worse), so don't be compelled to generate a ton of candidates. @@ -205,18 +208,17 @@ Below are a list of generation settings: * `Diffusion Sampler`: sampler method during the diffusion pass. Currently, only `P` and `DDIM` are added, but does not seem to offer any substantial differences in my short tests. `P` refers to the default, vanilla sampling method in `diffusion.py`. To reiterate, this ***only*** is useful for the diffusion decoding path, after the autoregressive outputs are generated. - -Below are an explanation of experimental flags. Messing with these might impact performance, as these are exposed only if you know what you are doing. -* `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower. -* `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end. -* `CVVP Weight`: governs how much weight the CVVP model should influence candidates. The original documentation mentions this is deprecated as it does not really influence things, but you're still free to play around with it. - Currently, setting requires regenerating your voice latents, as I forgot to have it return some extra data that weighing against the CVVP model uses. Oops. - Setting this to 1 leads to bad behavior. -* `Top P`: P value used in nucleus sampling; lower values mean the decoder produces more "likely" (aka boring) outputs. -* `Diffusion Temperature`: the variance of the noise fed into the diffusion model; values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared. -* `Length Penalty`: a length penalty applied to the autoregressive decoder; higher settings causes the model to produce more terse outputs. -* `Repetition Penalty`: a penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. -* `Conditioning-Free K`: determintes balancing the conditioning free signal with the conditioning-present signal. +* `Show Experimental Settings`: reveal a list of additional parameters you can play around with. These are hidden by default as I really need to play around with them some more (and the remarks are mostly from the official documentation): + - `Half-Precision`: (attempts to) hint to PyTorch to auto-cast to float16 (half precision) for compute. Disabled by default, due to it making computations slower. + - `Conditional Free`: a quality boosting improvement at the cost of some performance. Enabled by default, as I think the penaly is negligible in the end. + - `CVVP Weight`: governs how much weight the CVVP model should influence candidates. The original documentation mentions this is deprecated as it does not really influence things, but you're still free to play around with it. + Currently, setting requires regenerating your voice latents, as I forgot to have it return some extra data that weighing against the CVVP model uses. Oops. + Setting this to 1 leads to bad behavior. + - `Top P`: P value used in nucleus sampling; lower values mean the decoder produces more "likely" (aka boring) outputs. + - `Diffusion Temperature`: the variance of the noise fed into the diffusion model; values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared. + - `Length Penalty`: a length penalty applied to the autoregressive decoder; higher settings causes the model to produce more terse outputs. + - `Repetition Penalty`: a penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. + - `Conditioning-Free K`: determintes balancing the conditioning free signal with the conditioning-present signal. After you fill everything out, click `Run`, and wait for your output in the output window. The sampled voice is also returned, but if you're using multiple files, it'll return the first file, rather than a combined file. @@ -242,11 +244,13 @@ To reuse a voice file's settings, click `Copy Settings`. In this tab, you can find some helper utilities that might be of assistance. -For now, an analog to the PNG info found in Voldy's Stable Diffusion Web UI resides here. With it, you can upload an audio file generated with this web UI to view the settings used to generate that output. Additionally, the voice latents used to generate the uploaded audio clip can be extracted. - -If you want to reuse its generation settings, simply click `Copy Settings`. - -To import a voice, click `Import Voice`. Remember to click `Refresh Voice List` in the `Generate` panel afterwards, if it's a new voice. +This serves two purposes: +* as a voice importer for normal WAVs: + - simply drag an audio file you want to add as a voice, specify the voice name you want to save it to, then click `Import Voice`. + - if enabled and available, this will also attempt to clean up a voice sample by running it through `voicefixer` (for some reason if you need this) +* as an analog to Voldy's Stable Diffusion Web UI for viewing generation metadata from a sample generated with my fork. + - simply drag a sound file generated through this fork, and it'll automatically grab the metadata and the voice latents used (if exported) + - to use that file's voice latents, simply click `Import Voice`, and it'll save to the voice folder specified (or the original voice, if not specified) ### Settings diff --git a/tortoise/utils/audio.py b/tortoise/utils/audio.py index 7f3c8a9..6de0858 100755 --- a/tortoise/utils/audio.py +++ b/tortoise/utils/audio.py @@ -9,38 +9,18 @@ from scipy.io.wavfile import read from tortoise.utils.stft import STFT - -if 'TORTOISE_VOICES_DIR' not in os.environ: - voice_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../voices') - - if not os.path.exists(voice_dir): - voice_dir = os.path.dirname('./voices/') - - os.environ['TORTOISE_VOICES_DIR'] = voice_dir - -BUILTIN_VOICES_DIR = os.environ.get('TORTOISE_VOICES_DIR') - -os.makedirs(BUILTIN_VOICES_DIR, exist_ok=True) - def get_voice_dir(): - return BUILTIN_VOICES_DIR + target = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../voices') + if not os.path.exists(target): + target = os.path.dirname('./voices/') -def load_wav_to_torch(full_path): - sampling_rate, data = read(full_path) - if data.dtype == np.int32: - norm_fix = 2 ** 31 - elif data.dtype == np.int16: - norm_fix = 2 ** 15 - elif data.dtype == np.float16 or data.dtype == np.float32: - norm_fix = 1. - else: - raise NotImplemented(f"Provided data dtype not supported: {data.dtype}") - return (torch.FloatTensor(data.astype(np.float32)) / norm_fix, sampling_rate) + os.makedirs(target, exist_ok=True) + return target def load_audio(audiopath, sampling_rate): if audiopath[-4:] == '.wav': - audio, lsr = load_wav_to_torch(audiopath) + audio, lsr = torchaudio.load(audiopath) elif audiopath[-4:] == '.mp3': audio, lsr = librosa.load(audiopath, sr=sampling_rate) audio = torch.FloatTensor(audio) @@ -98,7 +78,7 @@ def dynamic_range_decompression(x, C=1): def get_voices(extra_voice_dirs=[]): - dirs = [BUILTIN_VOICES_DIR] + extra_voice_dirs + dirs = [get_voice_dir()] + extra_voice_dirs voices = {} for d in dirs: subs = os.listdir(d) @@ -135,11 +115,11 @@ def load_voice(voice, extra_voice_dirs=[], load_latents=True, sample_rate=22050, return None, torch.load(latent, map_location=device) print(f"Latent file out of date: {latent}") - conds = [] - for cond_path in voices: - c = load_audio(cond_path, sample_rate) - conds.append(c) - return conds, None + samples = [] + for path in voices: + c = load_audio(path, sample_rate) + samples.append(c) + return samples, None def load_voices(voices, extra_voice_dirs=[]): diff --git a/webui.py b/webui.py index 8f9a84c..10619ec 100755 --- a/webui.py +++ b/webui.py @@ -71,7 +71,7 @@ def generate( voice_samples, conditioning_latents = load_voice(voice) if voice_samples is not None: - sample_voice = voice_samples[0].squeeze().cpu() + sample_voice = torch.cat(voice_samples, dim=-1).squeeze().cpu() conditioning_latents = tts.get_conditioning_latents(voice_samples, return_mels=not args.latents_lean_and_mean, progress=progress, max_chunk_size=args.cond_latent_max_chunk_size) if len(conditioning_latents) == 4: @@ -81,7 +81,11 @@ def generate( torch.save(conditioning_latents, f'{get_voice_dir()}/{voice}/cond_latents.pth') voice_samples = None else: - sample_voice = None + if conditioning_latents is not None: + sample_voice, _ = load_voice(voice, load_latents=False) + sample_voice = torch.cat(sample_voice, dim=-1).squeeze().cpu() + else: + sample_voice = None if seed == 0: seed = None @@ -151,9 +155,13 @@ def generate( if file[-5:] == ".json": idx = idx + 1 - # reserve, if for whatever reason you manage to concurrently generate - with open(f'{outdir}/input_{idx}.json', 'w', encoding="utf-8") as f: - f.write(" ") + # I know there's something to pad I don't care + pad = "" + if idx < 100: + pad = f"{pad}0" + if idx < 10: + pad = f"{pad}0" + idx = f"{pad}{idx}" def get_name(line=0, candidate=0, combined=False): name = f"{idx}" @@ -206,7 +214,6 @@ def generate( audio_cache[k]['audio'] = audio torchaudio.save(f'{outdir}/{voice}_{k}.wav', audio, args.output_sample_rate) - output_voice = None output_voices = [] for candidate in range(candidates): if len(texts) > 1: @@ -224,15 +231,12 @@ def generate( audio_cache[name] = { 'audio': audio, 'text': text, - 'time': time.time()-full_start_time + 'time': time.time()-full_start_time, + 'output': True } - - output_voices.append(f'{outdir}/{voice}_{name}.wav') - if output_voice is None: - output_voice = f'{outdir}/{voice}_{name}.wav' else: name = get_name(candidate=candidate) - output_voices.append(f'{outdir}/{voice}_{name}.wav') + audio_cache[name]['output'] = True info = { 'text': text, @@ -257,13 +261,15 @@ def generate( 'experimentals': experimental_checkboxes, 'time': time.time()-full_start_time, } - - with open(f'{outdir}/input_{idx}.json', 'w', encoding="utf-8") as f: - f.write(json.dumps(info, indent='\t') ) - if voice is not None and conditioning_latents is not None: - with open(f'{get_voice_dir()}/{voice}/cond_latents.pth', 'rb') as f: - info['latents'] = base64.b64encode(f.read()).decode("ascii") + # kludgy yucky codesmells + for name in audio_cache: + if 'output' not in audio_cache[name]: + continue + + output_voices.append(f'{outdir}/{voice}_{name}.wav') + with open(f'{outdir}/{voice}_{name}.json', 'w', encoding="utf-8") as f: + f.write(json.dumps(info, indent='\t') ) if args.voice_fixer and voicefixer: # we could do this on the pieces before they get stiched up anyways to save some compute @@ -276,6 +282,10 @@ def generate( #mode=mode, ) + if voice is not None and conditioning_latents is not None: + with open(f'{get_voice_dir()}/{voice}/cond_latents.pth', 'rb') as f: + info['latents'] = base64.b64encode(f.read()).decode("ascii") + if args.embed_output_metadata: for path in progress.tqdm(audio_cache, desc="Embedding metadata..."): info['text'] = audio_cache[path]['text'] @@ -284,13 +294,12 @@ def generate( metadata = music_tag.load_file(f"{outdir}/{voice}_{path}.wav") metadata['lyrics'] = json.dumps(info) metadata.save() - - #if output_voice is not None: - # output_voice = (args.output_sample_rate, output_voice.numpy()) if sample_voice is not None: sample_voice = (tts.input_sample_rate, sample_voice.numpy()) + print(info['time']) + print(output_voices) print(f"Generation took {info['time']} seconds, saved to '{output_voices[0]}'\n") info['seed'] = settings['use_deterministic_seed'] @@ -345,7 +354,7 @@ def update_presets(value): else: return (gr.update(), gr.update()) -def read_generate_settings(file, save_latents=True, save_as_temp=True): +def read_generate_settings(file, read_latents=True): j = None latents = None @@ -362,31 +371,78 @@ def read_generate_settings(file, save_latents=True, save_as_temp=True): j = json.load(f) if j is None: - raise gr.Error("No metadata found in audio file to read") - - if 'latents' in j and save_latents: - latents = base64.b64decode(j['latents']) - del j['latents'] + gr.Error("No metadata found in audio file to read") + else: + if 'latents' in j: + if read_latents: + latents = base64.b64decode(j['latents']) + del j['latents'] + - if latents and save_latents: - outdir=f'{get_voice_dir()}/{".temp" if save_as_temp else j["voice"]}/' - os.makedirs(outdir, exist_ok=True) - with open(f'{outdir}/cond_latents.pth', 'wb') as f: - f.write(latents) - latents = f'{outdir}/cond_latents.pth' - - if "time" in j: - j["time"] = "{:.3f}".format(j["time"]) + if "time" in j: + j["time"] = "{:.3f}".format(j["time"]) return ( j, - latents + latents, ) -def save_latents(file): - read_generate_settings(file, save_latents=True, save_as_temp=False) + +def import_voice(file, saveAs = None): + j, latents = read_generate_settings(file, read_latents=True) + + if j is not None and saveAs is None: + saveAs = j['voice'] + if saveAs is None or saveAs == "": + raise gr.Error("Specify a voice name") + + outdir = f'{get_voice_dir()}/{saveAs}/' + os.makedirs(outdir, exist_ok=True) + if latents: + with open(f'{outdir}/cond_latents.pth', 'wb') as f: + f.write(latents) + latents = f'{outdir}/cond_latents.pth' + print(f"Imported latents to {latents}") + else: + filename = file.name + if filename[-4:] != ".wav": + raise gr.Error("Please convert to a WAV first") + + path = f"{outdir}/{os.path.basename(filename)}" + waveform, sampling_rate = torchaudio.load(filename) + + if args.voice_fixer: + # resample to best bandwidth since voicefixer will do it anyways through librosa + if sampling_rate != 44100: + print(f"Resampling imported voice sample: {path}") + resampler = torchaudio.transforms.Resample( + sampling_rate, + 44100, + lowpass_filter_width=16, + rolloff=0.85, + resampling_method="kaiser_window", + beta=8.555504641634386, + ) + waveform = resampler(waveform) + sampling_rate = 44100 + + torchaudio.save(path, waveform, sampling_rate) + + print(f"Running 'voicefixer' on voice sample: {path}") + voicefixer.restore( + input = path, + output = path, + cuda=get_device_name() == "cuda" and args.voice_fixer_use_cuda, + #mode=mode, + ) + else: + torchaudio.save(path, waveform, sampling_rate) + + + print(f"Imported voice to {path}") + def import_generate_settings(file="./config/generate.json"): - settings, _ = read_generate_settings(file, save_latents=False) + settings, _ = read_generate_settings(file, read_latents=False) if settings is None: return None @@ -688,6 +744,7 @@ def setup_gradio(): ) show_experimental_settings = gr.Checkbox(label="Show Experimental Settings") + reset_generation_settings_button = gr.Button(value="Reset to Default") with gr.Column(visible=False) as col: experimental_column = col @@ -756,7 +813,7 @@ def setup_gradio(): if file[-4:] != ".wav": continue - metadata, _ = read_generate_settings(f"{outdir}/{file}", save_latents=False) + metadata, _ = read_generate_settings(f"{outdir}/{file}", read_latents=False) if metadata is None: continue @@ -797,23 +854,45 @@ def setup_gradio(): with gr.Column(): audio_in = gr.File(type="file", label="Audio Input", file_types=["audio"]) copy_button = gr.Button(value="Copy Settings") - import_voice = gr.Button(value="Import Voice") + import_voice_name = gr.Textbox(label="Voice Name") + import_voice_button = gr.Button(value="Import Voice") with gr.Column(): metadata_out = gr.JSON(label="Audio Metadata") latents_out = gr.File(type="binary", label="Voice Latents") + def read_generate_settings_proxy(file, saveAs='.temp'): + j, latents = read_generate_settings(file) + + if latents: + outdir = f'{get_voice_dir()}/{saveAs}/' + os.makedirs(outdir, exist_ok=True) + with open(f'{outdir}/cond_latents.pth', 'wb') as f: + f.write(latents) + + latents = f'{outdir}/cond_latents.pth' + + return ( + j, + gr.update(value=latents, visible=latents is not None), + None if j is None else j['voice'] + ) + audio_in.upload( - fn=read_generate_settings, + fn=read_generate_settings_proxy, inputs=audio_in, outputs=[ metadata_out, - latents_out + latents_out, + import_voice_name ] ) - import_voice.click( - fn=save_latents, - inputs=audio_in, + import_voice_button.click( + fn=import_voice, + inputs=[ + audio_in, + import_voice_name, + ] ) with gr.Tab("Settings"): with gr.Row(): @@ -956,6 +1035,17 @@ def setup_gradio(): outputs=input_settings ) + def reset_generation_settings(): + with open(f'./config/generate.json', 'w', encoding="utf-8") as f: + f.write(json.dumps({}, indent='\t') ) + return import_generate_settings() + + reset_generation_settings_button.click( + fn=reset_generation_settings, + inputs=None, + outputs=input_settings + ) + def history_copy_settings( voice, file ): settings = import_generate_settings( f"./results/{voice}/{file}" ) return settings