James Betker
935a4e853e
get rid of nil tokens in <2>
2022-01-27 22:45:57 -07:00
James Betker
a77d376ad2
rename unet diffusion tts and add 3
2022-01-27 19:56:24 -07:00
James Betker
8c255811ad
more fixes
2022-01-25 17:57:16 -07:00
James Betker
0f3ca28e39
Allow diffusion model to be trained with masking tokens
2022-01-25 14:26:21 -07:00
James Betker
d18aec793a
Revert "(re) attempt diffusion checkpointing logic"
...
This reverts commit b22eec8fe3
.
2022-01-22 09:14:50 -07:00
James Betker
b22eec8fe3
(re) attempt diffusion checkpointing logic
2022-01-22 08:34:40 -07:00
James Betker
8f48848f91
misc
2022-01-22 08:23:29 -07:00
James Betker
851070075a
text<->cond clip
...
I need that universal clip..
2022-01-22 08:23:14 -07:00
James Betker
8e2439f50d
Decrease resolution requirements to 2048
2022-01-20 11:27:49 -07:00
James Betker
4af8525dc3
Adjust diffusion vocoder to allow training individual levels
2022-01-19 13:37:59 -07:00
James Betker
ac13bfefe8
use_diffuse_tts
2022-01-19 00:35:24 -07:00
James Betker
bcd8cc51e1
Enable collated data for diffusion purposes
2022-01-19 00:35:08 -07:00
James Betker
dc9cd8c206
Update use_gpt_tts to be usable with unified_voice2
2022-01-18 21:14:17 -07:00
James Betker
7b4544b83a
Add an experimental unet_diffusion_tts to perform experiments on
2022-01-18 08:38:24 -07:00
James Betker
37e4e737b5
a few fixes
2022-01-16 15:17:17 -07:00
James Betker
9100e7fa9b
Add a diffusion network that takes aligned text instead of MELs
2022-01-15 17:28:02 -07:00
James Betker
009a1e8404
Add a new diffusion_vocoder that should be trainable faster
...
This new one has a "cheating" top layer, that does not feed down into the unet encoder,
but does consume the outputs of the unet. This cheater only operates on half of the input,
while the rest of the unet operates on the full input. This limits the dimensionality of this last
layer, on the assumption that these last layers consume by far the most computation and memory,
but do not require the full input context.
Losses are only computed on half of the aggregate input.
2022-01-11 17:26:07 -07:00
James Betker
91f28580e2
fix unified_voice
2022-01-10 16:17:31 -07:00
James Betker
136744dc1d
Fixes
2022-01-10 14:32:04 -07:00
James Betker
ee3dfac2ae
unified_voice2: decouple positional embeddings and token embeddings from underlying gpt model
2022-01-10 08:14:41 -07:00
James Betker
f503d8d96b
Partially implement performers in transformer_builders
2022-01-09 22:35:03 -07:00
James Betker
ec456b6733
Revert unified_voice back to beginning
...
I'll be doing my work within unified_voice2
2022-01-09 22:34:30 -07:00
James Betker
f474a7ac65
unified_voice2
2022-01-09 22:32:34 -07:00
James Betker
70b17da193
Alter unified_voice to use extensible transformer (still WIP)
2022-01-08 22:18:25 -07:00
James Betker
15d9517e26
Allow bi-directional clipping
2022-01-08 22:18:04 -07:00
James Betker
438dd9ed33
fix text-voice-clip bug
2022-01-08 08:55:00 -07:00
James Betker
34774f9948
unified_voice: begin decoupling from HF GPT
...
I'd like to try some different (newer) transformer variants. The way to get
there is softly decoupling the transformer portion of this architecture
from GPT. This actually should be fairly easy.
2022-01-07 22:51:24 -07:00
James Betker
68090ac3e9
Finish up the text->voice clip model
2022-01-07 22:28:45 -07:00
James Betker
65ffe38fce
misc
2022-01-06 22:16:17 -07:00
James Betker
e7a705fe6e
Make gpt_asr_hf2 more efficient at inference
2022-01-06 10:27:10 -07:00
James Betker
525addffab
Unified: automatically clip inputs according to specified max length to improve inference time
2022-01-06 10:13:45 -07:00
James Betker
61cd351b71
update unified
2022-01-06 09:48:11 -07:00
James Betker
10fd1110be
Fix (?) use_gpt_tts for unified_voice
2022-01-05 20:09:31 -07:00
James Betker
3c4301f085
Remove dvae_arch_playground
2022-01-05 17:06:45 -07:00
James Betker
c584ba05ee
unified_voice improvements
...
- Rename max_symbols_per_phrase to max_text_tokens
- Remove max_total_tokens (no longer necessary)
- Fix integration with MelEncoder
2022-01-05 17:03:53 -07:00
James Betker
38aba6f88d
Another dumdum fix
2022-01-04 15:18:25 -07:00
James Betker
963c6072bb
Add mel_encoder and solo embeddings to unified_voice
2022-01-04 15:15:58 -07:00
James Betker
2165124f19
Add GPT documentation
2022-01-01 21:00:07 -07:00
James Betker
2635412291
doh
2022-01-01 14:29:59 -07:00
James Betker
d4a6298658
more debugging
2022-01-01 14:25:27 -07:00
James Betker
d8111e0477
misc
2022-01-01 14:05:33 -07:00
James Betker
dc535b5358
better bounds
2022-01-01 14:05:22 -07:00
James Betker
fe9ea4e01a
auto-fix text_inputs too big
2022-01-01 13:25:47 -07:00
James Betker
bbacffb790
dataset improvements and fix to unified_voice_Bilevel
2022-01-01 00:16:30 -07:00
James Betker
eda753e776
Allow conditioning shuffling to be disabled
2021-12-31 23:32:08 -07:00
James Betker
9aa06542cd
Further reduce the complexity of the MEL encoder in GptAsrHf
2021-12-30 09:10:40 -07:00
James Betker
5ae7e0d9b0
Fix gapping bug in voice2voice clip
2021-12-29 14:44:46 -07:00
James Betker
b12f47b36d
Add some noise to voice_voice_clip
2021-12-29 13:56:30 -07:00
James Betker
b24a51f0aa
Check in speech2speech CLIP inference tool
2021-12-29 00:19:44 -07:00
James Betker
c1bef01dfa
GptAsrHf2 checkin
2021-12-28 20:48:38 -07:00
James Betker
07c2b9907c
Add voice2voice clip model
2021-12-28 16:18:12 -07:00
James Betker
a9ee5b624f
Simplify and conform gpt_asr_hf2
2021-12-28 11:54:33 -07:00
James Betker
a5b4bee719
Improve asr_eval
2021-12-28 11:45:15 -07:00
James Betker
312f631c5b
gpt_asr_hf2: remove dual positional embeddings
2021-12-28 10:57:45 -07:00
James Betker
a12042ea99
Allow multi-embeddings to be disabled
2021-12-28 09:00:53 -07:00
James Betker
a698d3f525
unified_voice: introduce paired embeddings
2021-12-26 15:33:05 -07:00
James Betker
6996dfd9d5
asr_hf2: add independent position embedders
2021-12-26 15:17:24 -07:00
James Betker
5b5cbc057c
Work checkpoint for gpt asr hf2
2021-12-26 10:29:12 -07:00
James Betker
cd89e6b42e
Initialize our embeddings the same way GPT-2 initializes theirs.
2021-12-26 00:20:30 -07:00
James Betker
8d01f7685c
Get rid of absolute positional embeddings in unifiedvoice
2021-12-26 00:10:24 -07:00
James Betker
6700f8851d
moar verbosity
2021-12-25 23:23:21 -07:00
James Betker
8acf3b3097
Better dimensional asserting
2021-12-25 23:18:25 -07:00
James Betker
e959541494
Add position embeddings back into unified_voice
...
I think this may be the solution behind the days problems.
2021-12-25 23:10:56 -07:00
James Betker
ab9cafa572
Make tokenization configs more configurable
2021-12-25 12:17:50 -07:00
James Betker
52410fd9d9
256-bpe tokenizer
2021-12-25 08:52:08 -07:00
James Betker
8e26400ce2
Add inference for unified gpt
2021-12-24 13:27:06 -07:00
James Betker
8b19c37409
UnifiedGptVoice!
2021-12-23 15:20:26 -07:00
James Betker
e55d949855
GrandConjoinedDataset
2021-12-23 14:32:33 -07:00
James Betker
c737632eae
Train and use a bespoke tokenizer
2021-12-22 15:06:14 -07:00
James Betker
66bc60aeff
Re-add start_text_token
2021-12-22 14:10:35 -07:00
James Betker
a9629f7022
Try out using the GPT tokenizer rather than nv_tacotron
...
This results in a significant compression of the text domain, I'm curious what the
effect on speech quality will be.
2021-12-22 14:03:18 -07:00
James Betker
7ae7d423af
VoiceCLIP model
2021-12-22 13:44:11 -07:00
James Betker
09f7f3e615
Remove obsolete lucidrains DALLE stuff, re-create it in a dedicated folder
2021-12-22 13:44:02 -07:00
James Betker
a42b94ab72
gpt_tts_hf inference fixes
2021-12-22 13:22:15 -07:00
James Betker
48e3ee9a5b
Shuffle conditioning inputs along the positional axis to reduce fitting on prosody and other positional information
...
The mels should still retain some short-range positional information the model can use
for tone and frequencies, for example.
2021-12-20 19:05:56 -07:00
James Betker
53858b2055
Fix gpt_tts_hf inference
2021-12-20 17:45:26 -07:00
James Betker
712d746e9b
gpt_tts: format conditioning inputs more for contextual voice clues and less for prosidy
...
also support single conditional inputs
2021-12-19 17:42:29 -07:00
James Betker
c813befd53
Remove dedicated positioning embeddings
2021-12-19 09:01:31 -07:00
James Betker
b4ddcd7111
More inference improvements
2021-12-19 09:01:19 -07:00
James Betker
f9c45d70f0
Fix mel terminator
2021-12-18 17:18:06 -07:00
James Betker
937045cb63
Fixes
2021-12-18 16:45:38 -07:00
James Betker
9b9f7ea61b
GptTtsHf: Make the input/target placement easier to reason about
2021-12-17 10:24:14 -07:00
James Betker
2fb4213a3e
More lossy fixes
2021-12-17 10:01:42 -07:00
James Betker
9e8a9bf6ca
Various fixes to gpt_tts_hf
2021-12-16 23:28:44 -07:00
James Betker
62c8ed9a29
move speech utils
2021-12-16 20:47:37 -07:00
James Betker
4f8c4d130c
gpt_tts_hf: pad mel tokens with an <end_of_sequence> token.
2021-12-12 20:04:50 -07:00
James Betker
8917c02a4d
gpt_tts_hf inference first pass
2021-12-12 19:51:44 -07:00
James Betker
5a664aa56e
misc
2021-12-11 08:17:26 -07:00
James Betker
6ccff3f49f
Record codes more often
2021-12-07 09:22:45 -07:00
James Betker
d0b2f931bf
Add feature to diffusion vocoder where the spectrogram conditioning layers can be re-trained apart from the rest of the model
2021-12-07 09:22:30 -07:00
James Betker
662920bde3
Log codes when simply fetching codebook_indices
2021-12-06 09:21:43 -07:00
James Betker
380a5d5475
gdi..
2021-12-03 08:53:09 -07:00
James Betker
101a01f744
Fix dvae codes issue
2021-12-02 23:28:36 -07:00
James Betker
07b0124712
GptTtsHf!
2021-12-02 21:48:42 -07:00
James Betker
85542ec547
One last fix for gpt_asr_hf2
2021-12-02 21:19:28 -07:00
James Betker
04454ee63a
Add evaluation logic for gpt_asr_hf2
2021-12-02 21:04:36 -07:00
James Betker
5956eb757c
ffffff
2021-11-24 00:19:47 -07:00
James Betker
f1ed0588e3
another fix
2021-11-24 00:11:21 -07:00
James Betker
7a3c4a4fc6
Fix lr quantizer decode
2021-11-24 00:01:26 -07:00
James Betker
3f6ecfe0db
q fix
2021-11-23 23:50:27 -07:00