James Betker
58019a2ce3
audio diffusion fid updates
2022-03-03 21:53:32 -07:00
James Betker
998c53ad4f
w2v_matcher mods
2022-03-03 21:52:51 -07:00
James Betker
9029e4f20c
Add a base-wrapper
2022-03-03 21:52:28 -07:00
James Betker
6873ad6660
Support functionality
2022-03-03 21:52:16 -07:00
James Betker
6af5d129ce
Add experimental gradient boosting into tts7
2022-03-03 21:51:40 -07:00
James Betker
7ea84f1ac3
asdf
2022-03-03 13:43:44 -07:00
James Betker
3cd6c7f428
Get rid of unused codes in vq
2022-03-03 13:41:38 -07:00
James Betker
619da9ea28
Get rid of discretization loss
2022-03-03 13:36:25 -07:00
James Betker
beb7c8a39d
asdf
2022-03-01 21:41:31 -07:00
James Betker
70fa780edb
Add mechanism to export grad norms
2022-03-01 20:19:52 -07:00
James Betker
d9f8f92840
Codified fp16
2022-03-01 15:46:04 -07:00
James Betker
45ab444c04
Rework minicoder to always checkpoint
2022-03-01 14:09:18 -07:00
James Betker
db0c3340ac
Implement guidance-free diffusion in eval
...
And a few other fixes
2022-03-01 11:49:36 -07:00
James Betker
2134f06516
Implement conditioning-free diffusion at the eval level
2022-02-27 15:11:42 -07:00
James Betker
436fe24822
Add conditioning-free guidance
2022-02-27 15:00:06 -07:00
James Betker
ac920798bb
misc
2022-02-27 14:49:11 -07:00
James Betker
dbc74e96b2
w2v_matcher
2022-02-27 14:48:23 -07:00
James Betker
42879d7296
w2v_wrapper ramping dropout mode
...
this is an experimental feature that needs some testing
2022-02-27 14:47:51 -07:00
James Betker
c375287db9
Re-instate autocasting
2022-02-25 11:06:18 -07:00
James Betker
34ee32a90e
get rid of autocasting in tts7
2022-02-24 21:53:51 -07:00
James Betker
ea500ad42a
Use clustered masking in udtts7
2022-02-24 07:57:26 -07:00
James Betker
7201b4500c
default text_to_sequence cleaners
2022-02-21 19:14:22 -07:00
James Betker
ba7f54c162
w2v: new inference function
2022-02-21 19:13:03 -07:00
James Betker
38802a96c8
remove timesteps from cond calculation
2022-02-21 12:32:21 -07:00
James Betker
668876799d
unet_diffusion_tts7
2022-02-20 15:22:38 -07:00
James Betker
0872e17e60
unified_voice mods
2022-02-19 20:37:35 -07:00
James Betker
7b12799370
Reformat mel_text_clip for use in eval
2022-02-19 20:37:26 -07:00
James Betker
baf7b65566
Attempt to make w2v play with DDP AND checkpointing
2022-02-18 18:47:11 -07:00
James Betker
f3776f1992
reset ctc loss from "mean" to "sum"
2022-02-17 22:00:58 -07:00
James Betker
2b20da679c
make spec_augment a parameter
2022-02-17 20:22:05 -07:00
James Betker
e1d71e1bd5
w2v_wrapper: get rid of ctc attention mask
2022-02-15 20:54:40 -07:00
James Betker
79e8f36d30
Convert CLIP models into new folder
2022-02-15 20:53:07 -07:00
James Betker
2bdb515068
A few mods to make wav2vec2 trainable with DDP on DLAS
2022-02-15 06:28:54 -07:00
James Betker
52b61b9f77
Update scripts and attempt to figure out how UnifiedVoice could be used to produce CTC codes
2022-02-13 20:48:06 -07:00
James Betker
a4f1641eea
Add & refine WER evaluator for w2v
2022-02-13 20:47:29 -07:00
James Betker
29534180b2
w2v fine tuner
2022-02-12 20:00:59 -07:00
James Betker
3252972057
ctc_code_gen mods
2022-02-12 19:59:54 -07:00
James Betker
302ac8652d
Undo mask during training
2022-02-11 09:35:12 -07:00
James Betker
618a20412a
new rev of ctc_code_gen with surrogate LM loss
2022-02-10 23:09:57 -07:00
James Betker
820a29f81e
ctc code gen mods
2022-02-10 09:44:01 -07:00
James Betker
ac9417b956
ctc_code_gen: mask out all padding tokens
2022-02-09 17:26:30 -07:00
James Betker
ddb77ef502
ctc_code_gen: use a mean() on the ConditioningEncoder
2022-02-09 14:26:44 -07:00
James Betker
9e9ae328f2
mild updates
2022-02-08 23:51:17 -07:00
James Betker
ff35d13b99
Use non-uniform noise in diffusion_tts6
2022-02-08 07:27:41 -07:00
James Betker
34fbb78671
Straight CtcCodeGenerator as an encoder
2022-02-07 15:46:46 -07:00
James Betker
65a546c4d7
Fix for tts6
2022-02-05 16:00:14 -07:00
James Betker
5ae816bead
ctc gen checkin
2022-02-05 15:59:53 -07:00
James Betker
bb3d1ab03d
More cleanup
2022-02-04 11:06:17 -07:00
James Betker
5cc342de66
Clean up
2022-02-04 11:00:42 -07:00
James Betker
8fb147e8ab
add an autoregressive ctc code generator
2022-02-04 11:00:15 -07:00
James Betker
7f4fc55344
Update SR model
2022-02-03 21:42:53 -07:00
James Betker
bc506d4bcd
Mods to unet_diffusion_tts6 to support super resolution mode
2022-02-03 19:59:39 -07:00
James Betker
4249681c4b
Mods to support a autoregressive CTC code generator
2022-02-03 19:58:54 -07:00
James Betker
8132766d38
tts6
2022-01-31 20:15:06 -07:00
James Betker
fbea6e8eac
Adjustments to diffusion networks
2022-01-30 16:14:06 -07:00
James Betker
e58dab14c3
new diffusion updates from testing
2022-01-29 11:01:01 -07:00
James Betker
935a4e853e
get rid of nil tokens in <2>
2022-01-27 22:45:57 -07:00
James Betker
a77d376ad2
rename unet diffusion tts and add 3
2022-01-27 19:56:24 -07:00
James Betker
8c255811ad
more fixes
2022-01-25 17:57:16 -07:00
James Betker
0f3ca28e39
Allow diffusion model to be trained with masking tokens
2022-01-25 14:26:21 -07:00
James Betker
d18aec793a
Revert "(re) attempt diffusion checkpointing logic"
...
This reverts commit b22eec8fe3
.
2022-01-22 09:14:50 -07:00
James Betker
b22eec8fe3
(re) attempt diffusion checkpointing logic
2022-01-22 08:34:40 -07:00
James Betker
8f48848f91
misc
2022-01-22 08:23:29 -07:00
James Betker
851070075a
text<->cond clip
...
I need that universal clip..
2022-01-22 08:23:14 -07:00
James Betker
8ada52ccdc
Update LR layers to checkpoint better
2022-01-22 08:22:57 -07:00
James Betker
8e2439f50d
Decrease resolution requirements to 2048
2022-01-20 11:27:49 -07:00
James Betker
4af8525dc3
Adjust diffusion vocoder to allow training individual levels
2022-01-19 13:37:59 -07:00
James Betker
ac13bfefe8
use_diffuse_tts
2022-01-19 00:35:24 -07:00
James Betker
bcd8cc51e1
Enable collated data for diffusion purposes
2022-01-19 00:35:08 -07:00
James Betker
dc9cd8c206
Update use_gpt_tts to be usable with unified_voice2
2022-01-18 21:14:17 -07:00
James Betker
7b4544b83a
Add an experimental unet_diffusion_tts to perform experiments on
2022-01-18 08:38:24 -07:00
James Betker
37e4e737b5
a few fixes
2022-01-16 15:17:17 -07:00
James Betker
9100e7fa9b
Add a diffusion network that takes aligned text instead of MELs
2022-01-15 17:28:02 -07:00
James Betker
009a1e8404
Add a new diffusion_vocoder that should be trainable faster
...
This new one has a "cheating" top layer, that does not feed down into the unet encoder,
but does consume the outputs of the unet. This cheater only operates on half of the input,
while the rest of the unet operates on the full input. This limits the dimensionality of this last
layer, on the assumption that these last layers consume by far the most computation and memory,
but do not require the full input context.
Losses are only computed on half of the aggregate input.
2022-01-11 17:26:07 -07:00
James Betker
91f28580e2
fix unified_voice
2022-01-10 16:17:31 -07:00
James Betker
136744dc1d
Fixes
2022-01-10 14:32:04 -07:00
James Betker
ee3dfac2ae
unified_voice2: decouple positional embeddings and token embeddings from underlying gpt model
2022-01-10 08:14:41 -07:00
James Betker
f503d8d96b
Partially implement performers in transformer_builders
2022-01-09 22:35:03 -07:00
James Betker
ec456b6733
Revert unified_voice back to beginning
...
I'll be doing my work within unified_voice2
2022-01-09 22:34:30 -07:00
James Betker
432073c5ca
Make performer code functional
2022-01-09 22:32:50 -07:00
James Betker
f474a7ac65
unified_voice2
2022-01-09 22:32:34 -07:00
James Betker
c075fe72e2
import performer repo
2022-01-09 22:10:07 -07:00
James Betker
7de3874f15
Make dalle transformer checkpointable
2022-01-09 19:14:35 -07:00
James Betker
70b17da193
Alter unified_voice to use extensible transformer (still WIP)
2022-01-08 22:18:25 -07:00
James Betker
15d9517e26
Allow bi-directional clipping
2022-01-08 22:18:04 -07:00
James Betker
8bade38180
Add generic CLIP model based off of x_clip
2022-01-08 19:08:01 -07:00
James Betker
438dd9ed33
fix text-voice-clip bug
2022-01-08 08:55:00 -07:00
James Betker
34774f9948
unified_voice: begin decoupling from HF GPT
...
I'd like to try some different (newer) transformer variants. The way to get
there is softly decoupling the transformer portion of this architecture
from GPT. This actually should be fairly easy.
2022-01-07 22:51:24 -07:00
James Betker
68090ac3e9
Finish up the text->voice clip model
2022-01-07 22:28:45 -07:00
James Betker
65ffe38fce
misc
2022-01-06 22:16:17 -07:00
James Betker
e7a705fe6e
Make gpt_asr_hf2 more efficient at inference
2022-01-06 10:27:10 -07:00
James Betker
525addffab
Unified: automatically clip inputs according to specified max length to improve inference time
2022-01-06 10:13:45 -07:00
James Betker
61cd351b71
update unified
2022-01-06 09:48:11 -07:00
James Betker
10fd1110be
Fix (?) use_gpt_tts for unified_voice
2022-01-05 20:09:31 -07:00
James Betker
3c4301f085
Remove dvae_arch_playground
2022-01-05 17:06:45 -07:00
James Betker
a63a17e48f
Remove deepspeech models
2022-01-05 17:05:13 -07:00
James Betker
c584ba05ee
unified_voice improvements
...
- Rename max_symbols_per_phrase to max_text_tokens
- Remove max_total_tokens (no longer necessary)
- Fix integration with MelEncoder
2022-01-05 17:03:53 -07:00
James Betker
38aba6f88d
Another dumdum fix
2022-01-04 15:18:25 -07:00
James Betker
963c6072bb
Add mel_encoder and solo embeddings to unified_voice
2022-01-04 15:15:58 -07:00
James Betker
2165124f19
Add GPT documentation
2022-01-01 21:00:07 -07:00
James Betker
2635412291
doh
2022-01-01 14:29:59 -07:00
James Betker
d4a6298658
more debugging
2022-01-01 14:25:27 -07:00
James Betker
d8111e0477
misc
2022-01-01 14:05:33 -07:00
James Betker
dc535b5358
better bounds
2022-01-01 14:05:22 -07:00
James Betker
fe9ea4e01a
auto-fix text_inputs too big
2022-01-01 13:25:47 -07:00
James Betker
bbacffb790
dataset improvements and fix to unified_voice_Bilevel
2022-01-01 00:16:30 -07:00
James Betker
eda753e776
Allow conditioning shuffling to be disabled
2021-12-31 23:32:08 -07:00
James Betker
9aa06542cd
Further reduce the complexity of the MEL encoder in GptAsrHf
2021-12-30 09:10:40 -07:00
James Betker
5ae7e0d9b0
Fix gapping bug in voice2voice clip
2021-12-29 14:44:46 -07:00
James Betker
b12f47b36d
Add some noise to voice_voice_clip
2021-12-29 13:56:30 -07:00
James Betker
b24a51f0aa
Check in speech2speech CLIP inference tool
2021-12-29 00:19:44 -07:00
James Betker
c1bef01dfa
GptAsrHf2 checkin
2021-12-28 20:48:38 -07:00
James Betker
07c2b9907c
Add voice2voice clip model
2021-12-28 16:18:12 -07:00
James Betker
a9ee5b624f
Simplify and conform gpt_asr_hf2
2021-12-28 11:54:33 -07:00
James Betker
a5b4bee719
Improve asr_eval
2021-12-28 11:45:15 -07:00
James Betker
312f631c5b
gpt_asr_hf2: remove dual positional embeddings
2021-12-28 10:57:45 -07:00
James Betker
a12042ea99
Allow multi-embeddings to be disabled
2021-12-28 09:00:53 -07:00
James Betker
a698d3f525
unified_voice: introduce paired embeddings
2021-12-26 15:33:05 -07:00
James Betker
6996dfd9d5
asr_hf2: add independent position embedders
2021-12-26 15:17:24 -07:00
James Betker
5b5cbc057c
Work checkpoint for gpt asr hf2
2021-12-26 10:29:12 -07:00
James Betker
cd89e6b42e
Initialize our embeddings the same way GPT-2 initializes theirs.
2021-12-26 00:20:30 -07:00
James Betker
8d01f7685c
Get rid of absolute positional embeddings in unifiedvoice
2021-12-26 00:10:24 -07:00
James Betker
6700f8851d
moar verbosity
2021-12-25 23:23:21 -07:00
James Betker
8acf3b3097
Better dimensional asserting
2021-12-25 23:18:25 -07:00
James Betker
e959541494
Add position embeddings back into unified_voice
...
I think this may be the solution behind the days problems.
2021-12-25 23:10:56 -07:00
James Betker
ab9cafa572
Make tokenization configs more configurable
2021-12-25 12:17:50 -07:00
James Betker
52410fd9d9
256-bpe tokenizer
2021-12-25 08:52:08 -07:00
James Betker
8e26400ce2
Add inference for unified gpt
2021-12-24 13:27:06 -07:00
James Betker
8b19c37409
UnifiedGptVoice!
2021-12-23 15:20:26 -07:00
James Betker
e55d949855
GrandConjoinedDataset
2021-12-23 14:32:33 -07:00
James Betker
c737632eae
Train and use a bespoke tokenizer
2021-12-22 15:06:14 -07:00
James Betker
66bc60aeff
Re-add start_text_token
2021-12-22 14:10:35 -07:00
James Betker
a9629f7022
Try out using the GPT tokenizer rather than nv_tacotron
...
This results in a significant compression of the text domain, I'm curious what the
effect on speech quality will be.
2021-12-22 14:03:18 -07:00
James Betker
7ae7d423af
VoiceCLIP model
2021-12-22 13:44:11 -07:00
James Betker
09f7f3e615
Remove obsolete lucidrains DALLE stuff, re-create it in a dedicated folder
2021-12-22 13:44:02 -07:00
James Betker
a42b94ab72
gpt_tts_hf inference fixes
2021-12-22 13:22:15 -07:00
James Betker
48e3ee9a5b
Shuffle conditioning inputs along the positional axis to reduce fitting on prosody and other positional information
...
The mels should still retain some short-range positional information the model can use
for tone and frequencies, for example.
2021-12-20 19:05:56 -07:00
James Betker
53858b2055
Fix gpt_tts_hf inference
2021-12-20 17:45:26 -07:00
James Betker
712d746e9b
gpt_tts: format conditioning inputs more for contextual voice clues and less for prosidy
...
also support single conditional inputs
2021-12-19 17:42:29 -07:00
James Betker
c813befd53
Remove dedicated positioning embeddings
2021-12-19 09:01:31 -07:00
James Betker
b4ddcd7111
More inference improvements
2021-12-19 09:01:19 -07:00
James Betker
f9c45d70f0
Fix mel terminator
2021-12-18 17:18:06 -07:00
James Betker
937045cb63
Fixes
2021-12-18 16:45:38 -07:00
James Betker
9b9f7ea61b
GptTtsHf: Make the input/target placement easier to reason about
2021-12-17 10:24:14 -07:00
James Betker
2fb4213a3e
More lossy fixes
2021-12-17 10:01:42 -07:00
James Betker
9e8a9bf6ca
Various fixes to gpt_tts_hf
2021-12-16 23:28:44 -07:00
James Betker
62c8ed9a29
move speech utils
2021-12-16 20:47:37 -07:00
James Betker
4f8c4d130c
gpt_tts_hf: pad mel tokens with an <end_of_sequence> token.
2021-12-12 20:04:50 -07:00
James Betker
8917c02a4d
gpt_tts_hf inference first pass
2021-12-12 19:51:44 -07:00
James Betker
5a664aa56e
misc
2021-12-11 08:17:26 -07:00