Commit Graph

760 Commits

Author SHA1 Message Date
mrq
15b3c20e19 also throw exception for zero'd out tensor during training (I am very paranoid now) 2025-02-22 14:09:41 -06:00
mrq
ab0abd2b12 fixes fixes fixes (a quarter of my recently processed audio returned zero'd tensors......) 2025-02-22 09:07:33 -06:00
mrq
50506e5ebc oops 2025-02-20 20:55:58 -06:00
mrq
fc1ec2019d added option to buffer process jobs across multiple speakers to maybe squeeze out some throughput speeds for vall_e.emb.process (in the event of lots of speakers with low file counts, such as Emilia) 2025-02-20 14:56:32 -06:00
mrq
ce1ca0124a lol... 2025-02-20 13:40:36 -06:00
mrq
92139b6da9 additional cruft, added a note in documentation to be aware of NUMA node topology when running vall_e.emb.process with more than one process 2025-02-18 19:56:30 -06:00
mrq
596c2df11c added arg to skip processing speakers with not enough utterances for whenever I get around to processing my subest of Emilia for nvidia/audio-codec-44khz (because Emilia has a ton of low-utternace speaker counts and right now my focus with the nemo model is on getting it to actually speak without much problems rather than feed it a gorillion speakers) 2025-02-18 10:49:21 -06:00
mrq
8331eee6fa added arg to limit vall_e.emb.process batch size since there's some speaker groups in LibriLight/Speech/whatever that have 10K utterances and I'm going impatient 2025-02-18 10:19:17 -06:00
mrq
8f86cf0e4e possible logic optimization so I don't spend another 15 minutes simply iterating back to the point I was at in vall_e.emb.process 2025-02-16 11:34:05 -06:00
mrq
13c3a08853 nevermind thats slow 2025-02-14 16:35:17 -06:00
mrq
285e493b12 ugh.......... 2025-02-14 16:24:34 -06:00
mrq
a65c8144f4 with the amount of tweaks I keep making I could have probably had the nvidia/audio-codec-44khz model realized already...... 2025-02-13 18:38:40 -06:00
mrq
e3becec0e8 more better-er loss calc I suppose 2025-02-13 12:49:53 -06:00
mrq
e8f182b634 cleaned up loss calc code (it REALLY hates ignore_loss_for_inputs, but is fine with splitting with loss factors) 2025-02-13 09:35:27 -06:00
mrq
319ca09a4f cleanup 2025-02-12 23:36:32 -06:00
mrq
b52c5c5d80 this seems to work in testing 2025-02-12 16:16:04 -06:00
mrq
e029a8804d ironically none of this cruft gets the loss lower than the original way 2025-02-12 11:17:00 -06:00
mrq
4b31f5c808 this seems preferable 2025-02-12 00:36:50 -06:00
mrq
04fef5dad5 agony 2025-02-12 00:18:24 -06:00
mrq
e5916ea519 for my sanity it seems having extraneous tokens in the embedding/classifier has the loss/acc a little higher than it should 2025-02-11 14:47:35 -06:00
mrq
d4a6709fb4 stopgap cringe to get this training session working (it does not seem fruitful) 2025-02-11 13:45:09 -06:00
mrq
c0b46b82eb tweaks 2025-02-10 21:48:29 -06:00
mrq
d6a679ca5c tweaks 2025-02-10 20:53:08 -06:00
mrq
276a2342a4 tweaks to processing script 2025-02-10 19:18:13 -06:00
mrq
b3f9b76fd9 invalidate a path if loading via metadata and entry is not in hdf5 (to avoid reparsing my metadata since I'm using a partial copy of my dataset at the moment) 2025-02-10 14:43:15 -06:00
mrq
075ffef68a ugh 2025-02-09 13:02:51 -06:00
mrq
953015748f ugh 2025-02-07 20:49:28 -06:00
mrq
ed94b261dc could have sworn i had 'vall_e.emb.process --dtype' working, also possible RAM optimization so I can stop locking up my server when firing four encoding processes 2025-02-07 18:52:19 -06:00
mrq
47eb498046 more tweaks 2025-02-06 23:26:26 -06:00
mrq
67a9401cce oops 2025-02-06 15:14:14 -06:00
mrq
712ce4af5d maybe fixed errors with DAC backend, added option to limit by duration in emb.process (because I only really need short utternaces right now and I'm not ready to spend a week on processing everything again) 2025-02-06 12:37:18 -06:00
mrq
299cc88821 re-added amp encoding/decoding for audio, possible bad idea to ignore using amp instead if requested 2025-02-05 21:55:06 -06:00
mrq
7592befc53 updated vall_e.emb.process to allow for batched processing, some typo fixes (it's painfully slow on my 7900XTX...) 2025-02-05 21:13:20 -06:00
mrq
79c504c278 cleaned up encode/decode functions to make them a little more coherent, added option to batch encode/decode (would have been very nice in the past, but this should speed things up for me when i fall for the latest meme codec) 2025-02-05 20:54:31 -06:00
mrq
84174c1c1b oops 2025-02-05 10:25:03 -06:00
mrq
bb2ebe1ca2 fixed issues that may rise from updating transformers with attention, added nvidia/audio-codec-44khz backend support (by gutting everything necessary because I do NOT want to install more dependencies 2025-02-04 20:30:07 -06:00
mrq
0841f366e8 I should really just grab modelling_llama wholesale (fix for the adapted attention class) 2025-01-28 21:55:05 -06:00
mrq
e5f9da2221 oops 2025-01-21 11:59:24 -06:00
mrq
69c1d2991f updated mixtral backend (need this for something else) 2025-01-20 21:50:56 -06:00
mrq
1a26f789a5 added option to playback audio directly, removed no-phonemize option since I swear it worked in testing but it doesn't actually work 2025-01-12 21:52:49 -06:00
mrq
9fa87c417a added option to use raw text rather than the IPA phonemes (it requires a model trained on raw text) 2025-01-06 00:10:43 -06:00
mrq
3ab11bdc7b oops 2025-01-05 23:53:17 -06:00
mrq
b445f4abb6 experimental 2025-01-05 19:05:00 -06:00
mrq
2e6a7625e4 experimental 2025-01-05 12:47:03 -06:00
mrq
31cfef59c4 when you do more training thinking the original model that can do NS/SR got deleted but it was actually a string not having its quotes in the right place....... 2024-12-27 18:16:57 -06:00
mrq
9b0d2ccbe1 2024-12-26 21:42:17 -06:00
mrq
59f56ad099 cleaup 2024-12-24 23:14:32 -06:00
mrq
82e8592f2a working vall_e.cpp 2024-12-24 17:54:48 -06:00
mrq
497bdfc67b more work (the wall is non-causal decoding......) 2024-12-22 20:11:31 -06:00
mrq
5f289db275 ugh 2024-12-22 16:15:24 -06:00
mrq
0d4329d2e3 sanity cleanup 2024-12-22 15:05:45 -06:00
mrq
353e478e68 agony 2024-12-21 22:52:10 -06:00
mrq
5788db849b added extremely barebones vall_e.cpp so I can stop having to juggle this file around so much 2024-12-21 10:57:02 -06:00
mrq
91caf00212 ugh 2024-12-20 17:13:37 -06:00
mrq
d85273609e corrected export.py's --hf 2024-12-20 15:17:13 -06:00
mrq
59bf6b8b33 exposed additional task (ns, sr, vc) (vc is experimental) 2024-12-20 11:15:29 -06:00
mrq
53230efd74 changed prompt_inject_noise to prompt_inject_noise_p so I can have another reason to do this post-training 2024-12-19 19:28:50 -06:00
mrq
e7e7f48043 livid 2024-12-19 19:25:27 -06:00
mrq
8838babcba sanity checks (and I realized that the model actually had langs set to 4 in the yaml for KO/ZH so................ 2024-12-19 19:08:57 -06:00
mrq
7617b6485f instead just compute a bunch of stuff on the transcriptions to store later in different names so I can just retrieve what I want, also added tongue twisters for nefarious reasons 2024-12-18 23:43:11 -06:00
mrq
4775edaa41 added text cleaning/normalization for wer purposes but it amounts to nothing desu 2024-12-18 19:58:53 -06:00
mrq
9090c34f10 cringe script to process seed-tts-eval's eval dataset into something i can easily use 2024-12-17 22:47:12 -06:00
mrq
ed152f78df tweaks to prompt duration to allow me to divorce how i use it for training with how I'm using it for the demo page, and demo page tweaks to make my life easier 2024-12-17 19:33:04 -06:00
mrq
7129582303 actually do proper wer/cer calculation by un-normalizing the scores 2024-12-17 14:22:30 -06:00
mrq
c2c6d912ac actually do speaker verification 2024-12-17 10:11:14 -06:00
mrq
c2e17e287b really shoddy voice conversion implementation (it sort of works...) 2024-12-16 22:54:53 -06:00
mrq
8515038968 imagine my disappointment when the epoch finished just for it to throw an exception 2024-12-16 18:28:01 -06:00
mrq
4a65ac9eb7 oops 2024-12-15 17:21:51 -06:00
mrq
cd4a5f427c KO/ZH model soon 2024-12-15 17:01:14 -06:00
mrq
4800e7179a remove nan checks because it causes problems in distributed training because I'm not syncing between GPUs (and nan losses gets ignored anyways with loss scaling) 2024-12-15 09:42:54 -06:00
mrq
2ba6b483dc ugh 2024-12-14 22:43:51 -06:00
mrq
3dd31e74d1 finally figured out a clean way to handle "resuming" the tqdm bar 2024-12-14 18:44:43 -06:00
mrq
35389481ee move lazy-stored ortho matrix to the grad device for apollo because agony 2024-12-13 23:22:26 -06:00
mrq
09804ecc16 APOLLO tweaks to make it work with deepspeed 2024-12-13 23:03:52 -06:00
mrq
64c67160a3 tweaks 2024-12-13 19:00:35 -06:00
mrq
0fbfb8bbe8 actually save the optimizer for the local engine backend because safetensors doesn't save it 2024-12-12 17:12:59 -06:00
mrq
f41251f648 more fixes for local engine backend 2024-12-12 14:38:42 -06:00
mrq
6b237ae5e3 tweaks for the local engine orchestrator (that I never caught since I always used the deepspeed backend) 2024-12-12 13:37:38 -06:00
mrq
9a62e3b824 APOLLO cringe (doesn't want to work with deepspeed) 2024-12-12 00:31:58 -06:00
mrq
cddf8ca814 sort batches to try and reduce number of padded tokens in batched inference (also commented out F5 samples getting added to the demo page because I would have to regenerate them) 2024-12-11 22:45:38 -06:00
mrq
20b87bfbd0 store metrics and only recalculate them if the output file is newer than the metrics file 2024-12-11 20:55:43 -06:00
mrq
0c69e798f7 template cleanup 2024-12-11 20:06:55 -06:00
mrq
7e54e897f7 also shifted to transformer's pipeline for transcribing 2024-12-11 19:57:53 -06:00
mrq
b81a98799b uplifting transformer's WavLM stuff to do speaker verification instead 2024-12-11 19:30:05 -06:00
mrq
6468e5d124 lol 2024-12-11 19:10:32 -06:00
mrq
6f1ee0c6fa Added CER, transcription/similarity model args in demo 2024-12-10 21:00:51 -06:00
mrq
8568a93dad added WER/SIM-O metrics, added APOLLO but I need to test it 2024-12-10 20:13:21 -06:00
mrq
a6c745bafb chinese (mandarin?) support added (I guess I don't need pinyin, but tone markers are handled), korean validated, vocab adjusted 2024-12-09 14:26:19 -06:00
mrq
3ef8894290 oops 2024-12-08 15:24:21 -06:00
mrq
1d460b9fe3 logic fixes, I feel like output is better? (also NAR can have a temperature, I imagine it couldn't because it was having a causal masked passed to it for the longest time before I caught it a month ago) 2024-12-08 14:52:47 -06:00
mrq
0c5a458b00 deduce language per line to allow for a cheap way to allow for cross-lingual switching, kinda 2024-12-07 22:57:29 -06:00
mrq
a032ff588f doc update, added automatically deducing language from a given text, also checks if the input is already phonemized text to allow direct control without being cringe (procrastinating adding WER/SIM-O) 2024-12-07 22:34:25 -06:00
mrq
5d80a2d0d4 fixed NAR-len issues with non-english maybe (langs weren't being passed), added interface to inference in batches through tts.batched_inference (no support for rolling context/prefixes because there's no way to do that), demo page uses batched inferencing now 2024-12-07 19:21:05 -06:00
mrq
1f54bf5b40 revert sageattn back to optional dependency because it's not on windows, force resize_modules on by default because I broke something 2024-12-07 17:09:39 -06:00
mrq
218d0e29fd ugh (batchmean actually expects batch=seq_len, and not the actual batch) 2024-12-07 12:39:01 -06:00
mrq
61ed662856 ACTUALLY actually fix KD-loss (the -inf in the logits was caused by cringecode) 2024-12-07 12:31:54 -06:00
mrq
f97e8b0c7f ACTUALLY do KD-loss because of an oversight with masked_select outputting 1D tensors that get softmax'd in total 2024-12-07 09:52:51 -06:00
mrq
34a66e1052 agnostified KD 2024-12-06 23:53:46 -06:00
mrq
953d3eb030 ugh 2024-12-06 22:35:30 -06:00
mrq
42fafbaaca actually fixed knowledge distillation because of errant -inf logits causing problems and needed to be filtered (and splitting text language / output audio language because it helps) 2024-12-06 21:55:20 -06:00