You were right, at around loss 3.0 I am getting human-like sounds (this is just on 30 hours of audio...). I was able to add some lines to emit the metrics separately. It looks like the ar loss is…
Another thing that would be fairly useful for the ar+nar class:
Right now, you can only see the combined loss and accuracy. One thing that may be useful to adjust over time is the p_ar_level
.…
Cool, that's useful for the purposes of debugging anyway. I do see in some of your earlier posts how sometimes quality versus loss/acc can be inconsistent.
Another question, I'm using the…
Although it's kind of hard to say exactly when these milestones precisely occurred. I'll have to assume an average sample would be 64 text tokens + 75 * 6 audio tokens = 514 tokens per sample,…
So, I'm trying to overfit on just 3 speakers just to ensure I have things set up correctly. I'd like to query exactly same data from the training set to ensure everything is going fine.
Right…
Another question: how are you plotting your loss curves etc? Was going to write some code for it, but looks like you were producing them somehow. Maybe I missed them in the repo.
Playing around with encodec encoding + vocos decoding. As good as vocos is, it still gives minor some audio artifacts for higher pitch voices. This puts on upperbound on the quality of the model,…
It looks like the original vall-e model used ~140B parameters.
Where'd you get that number from? The papers (VALL-E, VALL-E X, SpeechX) don't mention a parameter count anywhere.
…
It looks like the original vall-e model used ~140B parameters. That can't fit into a 4070 can it, so are you using a smaller model size? Does size: "full"
correspond to the original paper model…
Thanks, I'll look into that.
And what about model size? How do you control that currently? I didn't see any params for it in config.yaml.
I'm looking to make use of multiple GPUs, but for all scripts used in the repo, looks like it's overriding my PyTorch DataParallel settings, etc with whatever's being set by deepspeed. Struggling…
Streaming is very valuable but yeah it is surprisingly tough for most things.
Looks like you’re moving forward with RetNet, right? Why is that when the “vanilla” (no recurrent steps)…
For sure, having an already prepared dataset is very helpful. I had tried the script provided for dataset preparation that you had in the readme, but there were errors unpickling the audios that I…
@mrq Appreciate the response, and I totally get it. Thanks for letting me know, and good luck with all the work you’re doing here.
Hey @mrq , I sent you an email to mrq@ecker.tech reaching out about some things. Let me know if you’ve seen it and are able to respond there, thanks!
Does it seem the RetNet approach is better / more data efficient, or better to use the original vall-e implementation?
Also, I am using the phonemizer, but is keeps coming up with None values…
This repo's web UI handles it fine with the Train > Prepare Dataset tab (or whatever I ended up calling it again). It'll handle the entire stack from transcribing with Whisper (or preferably,…
Trying to get proper transcriptions right now for this repo.
I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? I see you're…
Trying to get proper transcriptions right now for this repo.
I just made use of the openai-whisper package and with the "tiny" model. Do you think that's sufficient? I see you're using whisperX…
Just wanted to say, I love what you're doing and your detailed updates. I wish I could do something similar, but I have my day job which gets in the way. How are you able to juggle this with work…