|
95da4e9405
|
made muon actually work by actually utilizing param groups (thanks APOLLO for reminding me this is the sane way to handle this split)
|
2025-02-26 10:39:13 -06:00 |
|
|
de27115bb7
|
there's something wrong with it on my 4xV100 rig......
|
2025-02-25 15:14:08 -06:00 |
|
|
db181f8e88
|
only do auto=equal for nemo as its an FSQ
|
2025-02-24 21:07:44 -06:00 |
|
|
a5a04c39ef
|
when the
|
2025-02-24 21:03:23 -06:00 |
|
|
918e0dbac1
|
small slop cleanup
|
2025-02-24 19:03:53 -06:00 |
|
|
3330b5bb00
|
maybe fix NaNs being thrown for immature models at fp16 for training evals
|
2025-02-24 18:25:54 -06:00 |
|
|
0f39f4d7a1
|
lol
|
2025-02-24 17:51:35 -06:00 |
|
|
33d5a7109a
|
its a miracle i was able to get a semblance of audio with the naive AudioEncoder (now it interleaves properly)
|
2025-02-24 14:39:12 -06:00 |
|
|
6e7b269147
|
ugh
|
2025-02-24 13:54:21 -06:00 |
|
|
8f5a3997bd
|
another experimental flag
|
2025-02-24 13:50:41 -06:00 |
|
|
f593ee98fc
|
ugh
|
2025-02-23 21:20:36 -06:00 |
|
|
cbf6b84e27
|
fixed grad norm and loss scale not reporting for local trainer
|
2025-02-23 19:08:26 -06:00 |
|
|
b640fabab5
|
borrowed muon since it might better work under deepspeed and not require cruft (even though it really does not like the masked-NAR, also make the masked-NAR faux-causal since it might better help out for cfg.model.version >= 7
|
2025-02-23 17:23:24 -06:00 |
|
|
d33ccd188a
|
ugh
|
2025-02-23 12:31:07 -06:00 |
|
|
8f3c3e01ee
|
oops
|
2025-02-23 12:09:56 -06:00 |
|
|
b39aaacd77
|
oops
|
2025-02-23 11:55:43 -06:00 |
|
|
3019c88799
|
separate mask token and stop token because this might cause issues
|
2025-02-23 11:36:32 -06:00 |
|
|
6634d07576
|
added muon optimizer through kludge hacks because it necessitates a second optimizer in tandum that seems to only sometimes work with deepspeed
|
2025-02-23 11:22:13 -06:00 |
|
|
67a6009555
|
(finally) added parallel AR for cfg.model.version >= 7 (nvidia/audio-codec-44khz is being a pain and it might require training purely AR first......)
|
2025-02-23 08:31:03 -06:00 |
|
|
15b3c20e19
|
also throw exception for zero'd out tensor during training (I am very paranoid now)
|
2025-02-22 14:09:41 -06:00 |
|
|
ab0abd2b12
|
fixes fixes fixes (a quarter of my recently processed audio returned zero'd tensors......)
|
2025-02-22 09:07:33 -06:00 |
|
|
50506e5ebc
|
oops
|
2025-02-20 20:55:58 -06:00 |
|
|
fc1ec2019d
|
added option to buffer process jobs across multiple speakers to maybe squeeze out some throughput speeds for vall_e.emb.process (in the event of lots of speakers with low file counts, such as Emilia)
|
2025-02-20 14:56:32 -06:00 |
|
|
ce1ca0124a
|
lol...
|
2025-02-20 13:40:36 -06:00 |
|
|
92139b6da9
|
additional cruft, added a note in documentation to be aware of NUMA node topology when running vall_e.emb.process with more than one process
|
2025-02-18 19:56:30 -06:00 |
|
|
596c2df11c
|
added arg to skip processing speakers with not enough utterances for whenever I get around to processing my subest of Emilia for nvidia/audio-codec-44khz (because Emilia has a ton of low-utternace speaker counts and right now my focus with the nemo model is on getting it to actually speak without much problems rather than feed it a gorillion speakers)
|
2025-02-18 10:49:21 -06:00 |
|
|
8331eee6fa
|
added arg to limit vall_e.emb.process batch size since there's some speaker groups in LibriLight/Speech/whatever that have 10K utterances and I'm going impatient
|
2025-02-18 10:19:17 -06:00 |
|
|
8f86cf0e4e
|
possible logic optimization so I don't spend another 15 minutes simply iterating back to the point I was at in vall_e.emb.process
|
2025-02-16 11:34:05 -06:00 |
|
|
13c3a08853
|
nevermind thats slow
|
2025-02-14 16:35:17 -06:00 |
|
|
285e493b12
|
ugh..........
|
2025-02-14 16:24:34 -06:00 |
|
|
a65c8144f4
|
with the amount of tweaks I keep making I could have probably had the nvidia/audio-codec-44khz model realized already......
|
2025-02-13 18:38:40 -06:00 |
|
|
e3becec0e8
|
more better-er loss calc I suppose
|
2025-02-13 12:49:53 -06:00 |
|
|
e8f182b634
|
cleaned up loss calc code (it REALLY hates ignore_loss_for_inputs, but is fine with splitting with loss factors)
|
2025-02-13 09:35:27 -06:00 |
|
|
319ca09a4f
|
cleanup
|
2025-02-12 23:36:32 -06:00 |
|
|
b52c5c5d80
|
this seems to work in testing
|
2025-02-12 16:16:04 -06:00 |
|
|
e029a8804d
|
ironically none of this cruft gets the loss lower than the original way
|
2025-02-12 11:17:00 -06:00 |
|
|
4b31f5c808
|
this seems preferable
|
2025-02-12 00:36:50 -06:00 |
|
|
04fef5dad5
|
agony
|
2025-02-12 00:18:24 -06:00 |
|
|
e5916ea519
|
for my sanity it seems having extraneous tokens in the embedding/classifier has the loss/acc a little higher than it should
|
2025-02-11 14:47:35 -06:00 |
|
|
d4a6709fb4
|
stopgap cringe to get this training session working (it does not seem fruitful)
|
2025-02-11 13:45:09 -06:00 |
|
|
c0b46b82eb
|
tweaks
|
2025-02-10 21:48:29 -06:00 |
|
|
d6a679ca5c
|
tweaks
|
2025-02-10 20:53:08 -06:00 |
|
|
276a2342a4
|
tweaks to processing script
|
2025-02-10 19:18:13 -06:00 |
|
|
b3f9b76fd9
|
invalidate a path if loading via metadata and entry is not in hdf5 (to avoid reparsing my metadata since I'm using a partial copy of my dataset at the moment)
|
2025-02-10 14:43:15 -06:00 |
|
|
075ffef68a
|
ugh
|
2025-02-09 13:02:51 -06:00 |
|
|
953015748f
|
ugh
|
2025-02-07 20:49:28 -06:00 |
|
|
ed94b261dc
|
could have sworn i had 'vall_e.emb.process --dtype' working, also possible RAM optimization so I can stop locking up my server when firing four encoding processes
|
2025-02-07 18:52:19 -06:00 |
|
|
47eb498046
|
more tweaks
|
2025-02-06 23:26:26 -06:00 |
|
|
67a9401cce
|
oops
|
2025-02-06 15:14:14 -06:00 |
|
|
712ce4af5d
|
maybe fixed errors with DAC backend, added option to limit by duration in emb.process (because I only really need short utternaces right now and I'm not ready to spend a week on processing everything again)
|
2025-02-06 12:37:18 -06:00 |
|
|
299cc88821
|
re-added amp encoding/decoding for audio, possible bad idea to ignore using amp instead if requested
|
2025-02-05 21:55:06 -06:00 |
|
|
7592befc53
|
updated vall_e.emb.process to allow for batched processing, some typo fixes (it's painfully slow on my 7900XTX...)
|
2025-02-05 21:13:20 -06:00 |
|
|
79c504c278
|
cleaned up encode/decode functions to make them a little more coherent, added option to batch encode/decode (would have been very nice in the past, but this should speed things up for me when i fall for the latest meme codec)
|
2025-02-05 20:54:31 -06:00 |
|
|
84174c1c1b
|
oops
|
2025-02-05 10:25:03 -06:00 |
|
|
bb2ebe1ca2
|
fixed issues that may rise from updating transformers with attention, added nvidia/audio-codec-44khz backend support (by gutting everything necessary because I do NOT want to install more dependencies
|
2025-02-04 20:30:07 -06:00 |
|
|
0841f366e8
|
I should really just grab modelling_llama wholesale (fix for the adapted attention class)
|
2025-01-28 21:55:05 -06:00 |
|
|
e5f9da2221
|
oops
|
2025-01-21 11:59:24 -06:00 |
|
|
69c1d2991f
|
updated mixtral backend (need this for something else)
|
2025-01-20 21:50:56 -06:00 |
|
|
1a26f789a5
|
added option to playback audio directly, removed no-phonemize option since I swear it worked in testing but it doesn't actually work
|
2025-01-12 21:52:49 -06:00 |
|
|
9fa87c417a
|
added option to use raw text rather than the IPA phonemes (it requires a model trained on raw text)
|
2025-01-06 00:10:43 -06:00 |
|
|
3ab11bdc7b
|
oops
|
2025-01-05 23:53:17 -06:00 |
|
|
b445f4abb6
|
experimental
|
2025-01-05 19:05:00 -06:00 |
|
|
2e6a7625e4
|
experimental
|
2025-01-05 12:47:03 -06:00 |
|
|
31cfef59c4
|
when you do more training thinking the original model that can do NS/SR got deleted but it was actually a string not having its quotes in the right place.......
|
2024-12-27 18:16:57 -06:00 |
|
|
9b0d2ccbe1
|
|
2024-12-26 21:42:17 -06:00 |
|
|
59f56ad099
|
cleaup
|
2024-12-24 23:14:32 -06:00 |
|
|
82e8592f2a
|
working vall_e.cpp
|
2024-12-24 17:54:48 -06:00 |
|
|
497bdfc67b
|
more work (the wall is non-causal decoding......)
|
2024-12-22 20:11:31 -06:00 |
|
|
5f289db275
|
ugh
|
2024-12-22 16:15:24 -06:00 |
|
|
0d4329d2e3
|
sanity cleanup
|
2024-12-22 15:05:45 -06:00 |
|
|
353e478e68
|
agony
|
2024-12-21 22:52:10 -06:00 |
|
|
5788db849b
|
added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much
|
2024-12-21 10:57:02 -06:00 |
|
|
91caf00212
|
ugh
|
2024-12-20 17:13:37 -06:00 |
|
|
d85273609e
|
corrected export.py's --hf
|
2024-12-20 15:17:13 -06:00 |
|
|
59bf6b8b33
|
exposed additional task (ns, sr, vc) (vc is experimental)
|
2024-12-20 11:15:29 -06:00 |
|
|
53230efd74
|
changed prompt_inject_noise to prompt_inject_noise_p so I can have another reason to do this post-training
|
2024-12-19 19:28:50 -06:00 |
|
|
e7e7f48043
|
livid
|
2024-12-19 19:25:27 -06:00 |
|
|
8838babcba
|
sanity checks (and I realized that the model actually had langs set to 4 in the yaml for KO/ZH so................
|
2024-12-19 19:08:57 -06:00 |
|
|
7617b6485f
|
instead just compute a bunch of stuff on the transcriptions to store later in different names so I can just retrieve what I want, also added tongue twisters for nefarious reasons
|
2024-12-18 23:43:11 -06:00 |
|
|
4775edaa41
|
added text cleaning/normalization for wer purposes but it amounts to nothing desu
|
2024-12-18 19:58:53 -06:00 |
|
|
9090c34f10
|
cringe script to process seed-tts-eval's eval dataset into something i can easily use
|
2024-12-17 22:47:12 -06:00 |
|
|
ed152f78df
|
tweaks to prompt duration to allow me to divorce how i use it for training with how I'm using it for the demo page, and demo page tweaks to make my life easier
|
2024-12-17 19:33:04 -06:00 |
|
|
7129582303
|
actually do proper wer/cer calculation by un-normalizing the scores
|
2024-12-17 14:22:30 -06:00 |
|
|
c2c6d912ac
|
actually do speaker verification
|
2024-12-17 10:11:14 -06:00 |
|
|
c2e17e287b
|
really shoddy voice conversion implementation (it sort of works...)
|
2024-12-16 22:54:53 -06:00 |
|
|
8515038968
|
imagine my disappointment when the epoch finished just for it to throw an exception
|
2024-12-16 18:28:01 -06:00 |
|
|
4a65ac9eb7
|
oops
|
2024-12-15 17:21:51 -06:00 |
|
|
cd4a5f427c
|
KO/ZH model soon
|
2024-12-15 17:01:14 -06:00 |
|
|
4800e7179a
|
remove nan checks because it causes problems in distributed training because I'm not syncing between GPUs (and nan losses gets ignored anyways with loss scaling)
|
2024-12-15 09:42:54 -06:00 |
|
|
2ba6b483dc
|
ugh
|
2024-12-14 22:43:51 -06:00 |
|
|
3dd31e74d1
|
finally figured out a clean way to handle "resuming" the tqdm bar
|
2024-12-14 18:44:43 -06:00 |
|
|
35389481ee
|
move lazy-stored ortho matrix to the grad device for apollo because agony
|
2024-12-13 23:22:26 -06:00 |
|
|
09804ecc16
|
APOLLO tweaks to make it work with deepspeed
|
2024-12-13 23:03:52 -06:00 |
|
|
64c67160a3
|
tweaks
|
2024-12-13 19:00:35 -06:00 |
|
|
0fbfb8bbe8
|
actually save the optimizer for the local engine backend because safetensors doesn't save it
|
2024-12-12 17:12:59 -06:00 |
|
|
f41251f648
|
more fixes for local engine backend
|
2024-12-12 14:38:42 -06:00 |
|
|
6b237ae5e3
|
tweaks for the local engine orchestrator (that I never caught since I always used the deepspeed backend)
|
2024-12-12 13:37:38 -06:00 |
|
|
9a62e3b824
|
APOLLO cringe (doesn't want to work with deepspeed)
|
2024-12-12 00:31:58 -06:00 |
|
|
cddf8ca814
|
sort batches to try and reduce number of padded tokens in batched inference (also commented out F5 samples getting added to the demo page because I would have to regenerate them)
|
2024-12-11 22:45:38 -06:00 |
|
|
20b87bfbd0
|
store metrics and only recalculate them if the output file is newer than the metrics file
|
2024-12-11 20:55:43 -06:00 |
|
|
0c69e798f7
|
template cleanup
|
2024-12-11 20:06:55 -06:00 |
|
|
7e54e897f7
|
also shifted to transformer's pipeline for transcribing
|
2024-12-11 19:57:53 -06:00 |
|
|
b81a98799b
|
uplifting transformer's WavLM stuff to do speaker verification instead
|
2024-12-11 19:30:05 -06:00 |
|
|
6468e5d124
|
lol
|
2024-12-11 19:10:32 -06:00 |
|
|
6f1ee0c6fa
|
Added CER, transcription/similarity model args in demo
|
2024-12-10 21:00:51 -06:00 |
|
|
8568a93dad
|
added WER/SIM-O metrics, added APOLLO but I need to test it
|
2024-12-10 20:13:21 -06:00 |
|
|
a6c745bafb
|
chinese (mandarin?) support added (I guess I don't need pinyin, but tone markers are handled), korean validated, vocab adjusted
|
2024-12-09 14:26:19 -06:00 |
|
|
3ef8894290
|
oops
|
2024-12-08 15:24:21 -06:00 |
|
|
1d460b9fe3
|
logic fixes, I feel like output is better? (also NAR can have a temperature, I imagine it couldn't because it was having a causal masked passed to it for the longest time before I caught it a month ago)
|
2024-12-08 14:52:47 -06:00 |
|
|
0c5a458b00
|
deduce language per line to allow for a cheap way to allow for cross-lingual switching, kinda
|
2024-12-07 22:57:29 -06:00 |
|
|
a032ff588f
|
doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O)
|
2024-12-07 22:34:25 -06:00 |
|
|
5d80a2d0d4
|
fixed NAR-len issues with non-english maybe (langs weren't being passed), added interface to inference in batches through tts.batched_inference (no support for rolling context/prefixes because there's no way to do that), demo page uses batched inferencing now
|
2024-12-07 19:21:05 -06:00 |
|
|
1f54bf5b40
|
revert sageattn back to optional dependency because it's not on windows, force resize_modules on by default because I broke something
|
2024-12-07 17:09:39 -06:00 |
|
|
218d0e29fd
|
ugh (batchmean actually expects batch=seq_len, and not the actual batch)
|
2024-12-07 12:39:01 -06:00 |
|
|
61ed662856
|
ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode)
|
2024-12-07 12:31:54 -06:00 |
|
|
f97e8b0c7f
|
ACTUALLY do KD-loss because of an oversight with masked_select outputting 1D tensors that get softmax'd in total
|
2024-12-07 09:52:51 -06:00 |
|
|
34a66e1052
|
agnostified KD
|
2024-12-06 23:53:46 -06:00 |
|
|
953d3eb030
|
ugh
|
2024-12-06 22:35:30 -06:00 |
|
|
42fafbaaca
|
actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps)
|
2024-12-06 21:55:20 -06:00 |
|
|
23d402bf01
|
added knowledge distillation in the trainer (sadly it is not agnostic because of the grave mistake of further processing the batch within the forward pass, so subsequent calls do not match......)
|
2024-12-05 23:05:52 -06:00 |
|
|
4e21df8092
|
oops
|
2024-12-04 21:24:22 -06:00 |
|
|
93d27be539
|
rolling context finally (use last N utterances as the prefix for the next gen), option to split input text prompt by sentences instead of lines (or no splitting)
|
2024-12-04 20:31:44 -06:00 |
|
|
9dff68c0c5
|
NAR-len tweaks (remasks a small amount of tokens per step, it seems to help with reducing the number of steps needed some of the time?, disable CFG for the first half to speed things up)
|
2024-12-04 09:30:29 -06:00 |
|
|
cf97560e70
|
minimum CFG of 3 for NAR-len because it seems the model will auto-default to NAR-len now
|
2024-12-03 19:40:05 -06:00 |
|
|
ca31da0a95
|
sageattn (forgot to bother with testing this the other day, seems ifne)
|
2024-12-03 15:14:57 -06:00 |
|
|
31ab90d84a
|
cringe code to convert to LlamaForCausalLM-happy weights + tokenizer dict (still need to write logic to actually use these weights for proper inferencing)
|
2024-12-03 10:18:58 -06:00 |
|
|
84a05acb6d
|
touch ups in docs
|
2024-12-02 19:10:42 -06:00 |
|
|
dcaf38b359
|
fixed training tqdm being stubborn
|
2024-11-23 09:45:23 -06:00 |
|
|
41d7c30ea5
|
added much cleaner non-causal mask generation
|
2024-11-22 19:43:32 -06:00 |
|
|
c99a74e834
|
actually generate a causal mask because it seems sometimes it does not actually generate one because it makes assumptions
|
2024-11-22 18:30:24 -06:00 |
|
|
ccee5fc11c
|
that was actually all pointless since sdpa always had an attention mask fed to it and does not need is_causal to implicitly generate one
|
2024-11-22 16:51:50 -06:00 |
|
|
4aa685e749
|
what has science done
|
2024-11-22 16:45:40 -06:00 |
|
|
147219a5e0
|
huge oversight in the attention masking......... (i realized I have not been providing a non-causal mask to non-causal tasks)
|
2024-11-22 13:44:43 -06:00 |
|
|
24d888c47c
|
temporarily dropping support for xformers because it's breaking when using an attention mask (which i dont remember commenting it out when being passed), default to not use wandb because it's being a pain when doing tests and not actual sessionsS)
|
2024-11-22 11:29:12 -06:00 |
|
|
8aafae91fd
|
dont use timeembedding
|
2024-11-21 23:14:52 -06:00 |
|
|
2cef97e43f
|
cleanup
|
2024-11-21 23:08:43 -06:00 |
|
|
3fc0540f49
|
m
|
2024-11-21 15:07:46 -06:00 |
|
|
6845c447c9
|
added more harvard sentences to load from a text file
|
2024-11-21 13:18:11 -06:00 |
|
|
2a084544e8
|
moved duration padding for NAR-len to be a scalar instead (since it seems longer utterances need it much more so than shorter utterances)
|
2024-11-21 13:04:07 -06:00 |
|
|
6aee08f9c0
|
moved stuff in the web UI around (un-experimented the max NAR-len steps because its kind of important to adjust this value for better sounding audio / quicker generated audio)
|
2024-11-20 20:37:33 -06:00 |
|
|
dfdba3f190
|
oops
|
2024-11-20 19:21:03 -06:00 |
|
|
cd6e9ba2f2
|
oops
|
2024-11-20 16:27:51 -06:00 |
|
|
1a73ac6a20
|
I cannot believe it's not actually called Wand DB (added wandb logging support since I think it would have been a much better way to look at my metrics)
|
2024-11-20 16:10:47 -06:00 |
|
|
67f7bad168
|
added mixed modality AR+NAR-len to generate a short prefix through the AR, then inference with said prefix through the NAR-len (need to experiment with it more to ensure that the masked off tokens are the only tokens getting updated)
|
2024-11-20 14:22:12 -06:00 |
|
|
db64e6cb59
|
dependency updates (gradio 5.x now works on my machine)
|
2024-11-20 12:33:01 -06:00 |
|
|
b1369e7824
|
better modality selection (pick AR+NAR by default for the ar+nar model, pick NAR-len by default for the nar-len model), lowered default CFG because it makes the AR+NAR output sped up (but can't be too low since it's required for the NAR-len)
|
2024-11-19 18:51:17 -06:00 |
|
|
190a917b3e
|
I did it.
|
2024-11-19 12:24:33 -06:00 |
|
|
0e621354e7
|
cleaned up classifier-free guidance logit processing (in order to try and cope with a bad nar-len model)
|
2024-11-19 10:30:05 -06:00 |
|
|
5ba80686e1
|
two weeks of agony concludes
|
2024-11-18 21:29:28 -06:00 |
|
|
2b29790173
|
oops
|
2024-11-18 14:12:26 -06:00 |
|