|
84647f588a
|
more tweaks
|
2024-09-18 16:43:57 -05:00 |
|
|
ebac1db16c
|
maybe final tweaks, I really needed to unify my json read/write and orjson is proven to be fast enough for me to try and rely on it more
|
2024-09-17 22:57:04 -05:00 |
|
|
a9fbe81f98
|
oops
|
2024-09-17 15:25:12 -05:00 |
|
|
c440c4fe7e
|
relegated processing similarity data into vall_e.emb.similarity since it's easier, seems to work?
|
2024-09-17 14:37:21 -05:00 |
|
|
56f25f7a9b
|
more stuff for similar-speaker prompt sampling (to-do: actually test if this works...)
|
2024-09-16 23:10:29 -05:00 |
|
|
1c615a0f52
|
helper script (vall_e.emb.similar) to figure out the best way to compute similarity scores for audio (iunno how to go about it desu)
|
2024-09-10 16:34:23 -05:00 |
|
|
d059f6f56d
|
added helper script to process Emilia (amphion/Emilia-Dataset), clean up espeak phonemes for non-English transcriptions with English words (because for some reason espeak injects (en){word}(lang) markers and it's annoying)
|
2024-09-09 09:57:32 -05:00 |
|
|
31e8b7edb8
|
tweaks and fixes for lora stuffs
|
2024-09-08 18:05:21 -05:00 |
|
|
fa93061b3e
|
more fixes, moved sampler state dict to a better place, eval works again
|
2024-09-06 16:59:56 -05:00 |
|
|
341e19162b
|
fixes, again
|
2024-09-06 11:41:41 -05:00 |
|
|
94cf81d38c
|
tweak
|
2024-09-05 23:21:18 -05:00 |
|
|
54547b74d8
|
experimental implementation of STT (need to actually test on a model, test trainer seems to work)
|
2024-09-05 20:43:20 -05:00 |
|
|
32287710a2
|
moved prints to use logger, edited readme (fused_attn doesnt seem stable for training)
|
2024-08-29 13:27:16 -05:00 |
|
|
d636edd3a2
|
added flash_attn LlamaAttention (including flash_attn==1.0.9)
|
2024-08-18 20:51:14 -05:00 |
|
|
2a1794c084
|
ughghghhhh
|
2024-08-09 21:15:01 -05:00 |
|
|
c658a7b440
|
make loss scaling opt-in rather than automatically determined (because it seems a DAC-based model really doesnt like loss scaling)
|
2024-08-09 10:51:36 -05:00 |
|
|
d04f6911b4
|
oops
|
2024-08-08 19:38:55 -05:00 |
|
|
0aa59e6f3f
|
uncommented block that writes the metadata on HDF5 creation
|
2024-08-08 19:21:29 -05:00 |
|
|
79a6781c9e
|
fix vall_e.data --action=hdf5 actually transcribing because past me completely forgot it tried to already put the transcribe/process dataset scripts inside the module before
|
2024-08-08 07:51:42 -05:00 |
|
|
eac353cd0b
|
busy work and cleanup while I wait for 1TB of audio to quantize... again.
|
2024-08-06 20:23:33 -05:00 |
|
|
c09133d00f
|
added safetensors support (with metadata) and feed whatever torch.load/torch.save into it
|
2024-08-03 23:15:20 -05:00 |
|
|
6a733eb2ed
|
changed torch.Tensor().to(device, dtype) to just torch.tensor(..., device, dtype) because it's been bothering my autism that I'm creating tensors then converting rather than creating with the right device/dtype, some 'optimization' to compile the model but it doesnt seem to do anything useful
|
2024-08-03 22:10:21 -05:00 |
|
|
97c5241bef
|
fixes, throw an exception when using NAR only model with non-unified position IDs, since for some reason it outputs garbage for the NAR
|
2024-08-02 22:25:49 -05:00 |
|
|
4456d3172b
|
that's what I get for testing without hdf5 on my previous machine....
|
2024-08-02 20:44:01 -05:00 |
|
|
ce8bb1e4f7
|
sanity cleanups with weird off-by-one-ness, cleaned up and validated vall_e.models.experimental works again
|
2024-07-27 15:36:05 -05:00 |
|
|
682e4387dc
|
oops (fixed proms being erased from a config oversight)
|
2024-07-25 12:39:57 -05:00 |
|
|
75b04686f8
|
added prom-less training / inferencing, some other things
|
2024-07-22 19:36:07 -05:00 |
|
|
491ae2a684
|
some insanity for sanity checks (some phonemes from phonemizing japanese are not in my tokenizer...)
|
2024-07-22 00:30:40 -05:00 |
|
|
e19aa643a6
|
cleaned up demo page creation, added option to pass in RVQ level sampling distribution for training
|
2024-07-21 19:12:03 -05:00 |
|
|
d87b492295
|
added rudimentary demo page creator (currently just embeds base64 wavs into the page, need to test not doing that)
|
2024-07-19 20:49:40 -05:00 |
|
|
28a674e0f1
|
fixes...
|
2024-07-18 23:25:32 -05:00 |
|
|
39f961abcd
|
test trainer (vall_e.models.ar_nar) tests some SpeechX features
|
2024-07-18 18:46:45 -05:00 |
|
|
83a0954f85
|
fixes for re-introducing SpeechX tasks (need to actually validate if these all do the right things)
|
2024-07-18 17:16:32 -05:00 |
|
|
bccbb77a1a
|
added option to either naively concat codes to concat audio waveforms (prior behavior) or to decode => concat => encode instead (although this only currently happens for prom sampling if an utternace is too small)
|
2024-07-18 16:48:41 -05:00 |
|
|
97e768601c
|
re-introducing SpeechX tasks (need to validate them all, everything works with base tts anyways)
|
2024-07-18 16:16:14 -05:00 |
|
|
3acc54df22
|
allow loading a different model within the web ui (apparently I did not have the web UI in the documentation)
|
2024-07-15 19:59:48 -05:00 |
|
|
312a8e3ead
|
add shuffle to samplers that can support it
|
2024-06-30 11:36:46 -05:00 |
|
|
bc2a6fa756
|
sanity cleanup: moved experimental features under its own thing
|
2024-06-30 10:37:33 -05:00 |
|
|
793ccb16fb
|
ugh
|
2024-06-29 22:14:35 -05:00 |
|
|
c4dd523b6f
|
change from chunk-slicing paths for distributed dataloader to instead interleave
|
2024-06-29 10:10:35 -05:00 |
|
|
dd40463803
|
limit eval size because the training batch size seems to be used for the eval dataloader, somehow (bandaid)
|
2024-06-29 09:11:28 -05:00 |
|
|
591d3ac848
|
have eval dataloader use eval batch size for batchedordersampler
|
2024-06-28 22:44:00 -05:00 |
|
|
83075c1505
|
sort duration buckets to ensure that paths sorted-by-duration are actually sorted by duration (because i didnt know that python dicts can have non-strings as keys), added batching samples based on total duration to ensure best training throughput
|
2024-06-28 22:28:54 -05:00 |
|
|
8fffb94964
|
backport fix from tortoise_tts with local trainer + loading state when training lora
|
2024-06-25 13:41:29 -05:00 |
|
|
19410a919e
|
ugh
|
2024-06-15 12:29:03 -05:00 |
|
|
d343bde09b
|
residual_in_fp32=False for mamba arch backends because it breaks the classifier (output projection / lm head / what-have-you) under AMP
|
2024-06-15 12:08:03 -05:00 |
|
|
31f71fa134
|
sampler update (some brainworm just never actually had a sampler for sample_type=path)
|
2024-06-14 16:55:40 -05:00 |
|
|
b3b67f34ac
|
added option to sort paths by durations to better group equally lengthed sequences together (and there was maybe a logic error from creating the samplers and then interleave-reordering paths, desyncing them, maybe)
|
2024-06-13 22:37:34 -05:00 |
|
|
cca542a4c0
|
ugh
|
2024-06-11 23:59:28 -05:00 |
|
|
65a8960305
|
option to split classifier per-level instead of sharing one (at this point I'm just scrambling to try and cope with training a DAC model, the NAR is being a pain)
|
2024-06-11 22:28:59 -05:00 |
|
|
234f9efc6e
|
ugh
|
2024-06-09 11:39:43 -05:00 |
|
|
132a02c48b
|
sanity cleanup, backup config yaml for each log file
|
2024-06-09 11:22:52 -05:00 |
|
|
4ade2b60ee
|
ugh
|
2024-06-06 21:57:11 -05:00 |
|
|
014e565c4b
|
tweaks
|
2024-06-04 20:41:13 -05:00 |
|
|
6d5bd0156a
|
fixes
|
2024-06-04 18:50:48 -05:00 |
|
|
ed3aeaf3a1
|
copy pasted from test to actual trainer
|
2024-06-04 18:40:30 -05:00 |
|
|
0aa01ba31a
|
forgot one crucial detail (you *need* the previous RVQ level to keep coherence between all RVQ levels) (experimental deinterleaved is a bit crusty though)
|
2024-06-04 18:30:30 -05:00 |
|
|
406ff7bbe1
|
re-implemented config.model.interleave for the HF-compat experimental method
|
2024-06-04 14:19:52 -05:00 |
|
|
c93d5863fd
|
fixes
|
2024-06-04 00:07:00 -05:00 |
|
|
934672252b
|
feverish cleanup
|
2024-06-03 21:28:49 -05:00 |
|
|
8cf176ab46
|
ugh
|
2024-06-01 10:46:42 -05:00 |
|
|
d0ebce6bac
|
ugh
|
2024-06-01 10:30:13 -05:00 |
|
|
74df2f5332
|
split sampler dict by global_rank, also handle splitting dataset paths by global_rank if sampler_type == path (because I do not trust DistributedSampler) (need to test)
|
2024-06-01 09:29:49 -05:00 |
|
|
ddbacde0d1
|
DAC just doesn't work well enough......
|
2024-05-25 11:07:52 -05:00 |
|
|
e3ef89f5aa
|
100x better for subtrain/eval to be by group instead
|
2024-05-19 16:40:14 -05:00 |
|
|
4bc7e5a6d1
|
fix loading without needing an hdf5 dataset already prepped (and some other incidental speedups during dataloader prep)
|
2024-05-18 07:14:26 -05:00 |
|
|
d88a5ca183
|
ugh
|
2024-05-16 07:25:33 -05:00 |
|
|
d9aabfa3ae
|
final tweaks, hopefully, again
|
2024-05-15 23:04:19 -05:00 |
|
|
2437a86efa
|
ugh
|
2024-05-12 13:02:15 -05:00 |
|
|
4f1593c8db
|
a bunch of shit to salvage my old encodec-quantized audio because dac-encoded audio just does not want to converge
|
2024-05-12 10:17:29 -05:00 |
|
|
14709ac67f
|
ughh
|
2024-05-12 07:30:59 -05:00 |
|
|
3774fcbdee
|
ugh
|
2024-05-11 22:58:38 -05:00 |
|
|
4d93a16ef7
|
might just be better to explicitly define prompt duration ranges, especially under a "train small contexts then increase it" training paradigm
|
2024-05-11 09:50:54 -05:00 |
|
|
0d5d545a40
|
crammed in DAdaptation (doesn't seem worth it) and ScheduleFree (forgot I wanted to weeks ago, seems promising), optimization wrapper cleanup, test trainer changes, etc.
|
2024-05-09 20:28:20 -05:00 |
|
|
c6e0f905b5
|
final tweaks (again) before training restarts
|
2024-05-08 02:11:38 -05:00 |
|
|
33b7f81b94
|
small cleanups
|
2024-05-04 22:37:22 -05:00 |
|
|
ffa200eec7
|
added option to specify frames per second for the given audio representation (Encodec is 75Hz, DAC is 41Hz (at 24K sources))
|
2024-05-04 12:05:41 -05:00 |
|
|
b5d1456a09
|
backwards compat for my shitty old weights (was testing if disabling AudioEmbedding summing magically made things better (it did not))
|
2024-04-29 22:14:01 -05:00 |
|
|
6a11bc9cb6
|
update tokenizer because, for some reason, it had the wrong order for the special tokens to where eos = unk
|
2024-04-29 09:09:26 -05:00 |
|
|
57810e4ba4
|
metadata only path (might drop HDF5 since its giving file sizes twice as large as my actual unpacked dataset)
|
2024-04-28 23:03:09 -05:00 |
|
|
caad7ee3c9
|
final tweaks, hopefully
|
2024-04-28 22:28:29 -05:00 |
|
|
ffc334cf58
|
added dataset transcription helper script (now I don't ever have to touch ai-voice-cloning) (to-do: unify scripts into the module)
|
2024-04-21 17:43:20 -05:00 |
|
|
071fb97777
|
dataset preparation script updates, caved and am using HF tokenizer now
|
2024-04-21 14:49:18 -05:00 |
|
|
8214aa23d7
|
converting over to a different intermediary dataset format
|
2024-04-18 21:24:06 -05:00 |
|
|
4f5c9e518a
|
actually use the passed-through sample rate from encode for DAC because it does its own resampling I guess
|
2024-04-18 13:32:41 -05:00 |
|
|
545162195b
|
deprecate sole AR/NAR model by only keeping the AR+NAR (the beauty of no one using this is that I can break compat as much as I want), add tone token for when I classify my dataset with tone/emotion in the future, some other things
|
2024-04-15 19:54:32 -05:00 |
|
|
9c198eb75a
|
added torchscale XMOE integration (because Mixtral 8x7B seems very promising and I want to see if it works)
|
2023-12-20 18:45:58 -06:00 |
|
|
0aa2a3cc07
|
evaluation/validation passes language ID during training (oops)
|
2023-10-29 12:00:40 -05:00 |
|
|
9a6040383e
|
make validation samplers ignore sampler type
|
2023-10-22 09:01:47 -05:00 |
|
|
3195026dba
|
fixed issue with the 'add another target audio to artificially create longer sequences' for HDF5 just duplicating the utterance initially sampled
|
2023-10-18 20:38:33 -05:00 |
|
|
09cda7d3f9
|
added sampling by speaker group name (might be better to de-emphasize the LibriVox/Audiobooks that are in large numbers, and emphasize the smaller pools), log cleanup
|
2023-10-16 19:30:38 -05:00 |
|
|
65f500083d
|
tweaks to try and get deepspeed quantized inferencing, validating bitsandbytes and deepspeed quantization, nothing seems to work
|
2023-10-12 22:21:43 -05:00 |
|
|
8740cdefc6
|
added initial support for languages (still testing, marked as model version 3), added experimental 'context extend by limiting the resp context' (untested)
|
2023-10-11 20:38:40 -05:00 |
|
|
6045cbce94
|
added experimental option to append utterances for training target (emphasis on experimental)
|
2023-10-11 17:32:45 -05:00 |
|
|
b4405c98ea
|
remove double spaces in the text phonemes (might have caused problems.........)
|
2023-10-10 19:18:24 -05:00 |
|
|
87db03dd93
|
trim the input prompt to 3 seconds when training NAR tasks (marked as experimental; the paper mentions doing so, but I don't know how much this would harm the retention heads)
|
2023-10-09 22:03:58 -05:00 |
|
|
893a610fad
|
cleanup, use deepspeed inferencing pathway if requested
|
2023-10-09 15:24:04 -05:00 |
|
|
27483e56f0
|
disabled preparing of SpeechX tasks, added dynamic temperature testing (to-do: test it, credited in the function)
|
2023-10-09 13:01:40 -05:00 |
|
|
82f02ae9b1
|
oops
|
2023-10-06 09:26:52 -05:00 |
|
|
d12877ee09
|
added option to set probability of selecting the AR during training under a monolithic AR+NAR, added some more to-dos while I have them in mind
|
2023-10-02 16:52:42 -05:00 |
|