touch ups in docs
This commit is contained in:
parent
dcaf38b359
commit
84a05acb6d
|
@ -42,10 +42,12 @@ However, at this point and time, the implementation is rather divorced from VALL
|
||||||
* [ ] audio streaming
|
* [ ] audio streaming
|
||||||
- this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
|
- this *technically* can work without any additional architecture changes, just clever tricks with sampling-then-decoding-to-audio.
|
||||||
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming.
|
- something similar to HiFiGAN (or the one for TorToiSe) trained on the last hidden states of the AR *might* also enable an alternate way for streaming.
|
||||||
* [ ] speed up inferencing
|
* [ ] speed up inferencing for the AR
|
||||||
- KV caching both yields broken output and quadratically slow output, unless I'm doing something grossly wrong.
|
- KV caching both yields broken output and quadratically slow output, unless I'm doing something grossly wrong.
|
||||||
- A pure HF model is the only way to fix this, but converting the model to one is a bit of a chore.
|
|
||||||
* [x] provide a pure NAR model that foregoes most of the inferencing slowdowns a regular AR+NAR model will provide.
|
* [x] provide a pure NAR model that foregoes most of the inferencing slowdowns a regular AR+NAR model will provide.
|
||||||
|
* [ ] HF-ify the model
|
||||||
|
- this might be easily possible by subjugating the tokenizer to handle all the embeddings / classifiers
|
||||||
|
- this will pave the way to use the model under an easy marriage of `llama.cpp` and `encodec.cpp`
|
||||||
* [ ] replace the phonemizer with something that doesn't depend on espeak
|
* [ ] replace the phonemizer with something that doesn't depend on espeak
|
||||||
* [ ] train the model to handle text => phoneme (without a hit to the rest of the model)
|
* [ ] train the model to handle text => phoneme (without a hit to the rest of the model)
|
||||||
* [ ] ...and phonemes => text
|
* [ ] ...and phonemes => text
|
||||||
|
@ -62,9 +64,26 @@ However, at this point and time, the implementation is rather divorced from VALL
|
||||||
* mixing multiple speakers through summing input prompt embeddings
|
* mixing multiple speakers through summing input prompt embeddings
|
||||||
* I do not expect this to work, but you never know...
|
* I do not expect this to work, but you never know...
|
||||||
|
|
||||||
|
## "Postmortem"
|
||||||
|
|
||||||
|
For the most part, the model is complete. With the `NAR-len` being crammed on, I'm satisifed with the performance-to-quality.
|
||||||
|
|
||||||
|
However, while this solution boasts being lightweight, there are some caveats for its given size
|
||||||
|
* its at capacity on what it *can* do without additional tasks to augment it further
|
||||||
|
* post-fixing it with additional layers glued on doesn't seem to offer very much work (12 => 16 layers)
|
||||||
|
* wrangling it is a bit of a chore, as some voices work fine under the `AR` but not the `NAR-len`, and vice-versa
|
||||||
|
* some voices outright refuse to work without LoRA training
|
||||||
|
* some sampler settings works on some voices, but others need some tweaking
|
||||||
|
* for short durations, it excels, but despite training on longer durations, stability is less guaranteed
|
||||||
|
* subjugating an existing LLM architecture is a bit of a pain, as I would *love* to make full use of LLaMA niceties
|
||||||
|
* `hf`-ifying it is possible, but it'd be a chore to set up the tokenizer properly
|
||||||
|
* it still seems like the phase of the moon matters with how it wants to cooperate
|
||||||
|
* some eval tests it seems fine, other times issues like word errors will crop up
|
||||||
|
|
||||||
|
|
||||||
## Notices and Citations
|
## Notices and Citations
|
||||||
|
|
||||||
Unless otherwise credited/noted in this repo or within the designated Python file, this repository is [licensed](LICENSE) under AGPLv3.
|
Unless otherwise credited/noted in this repo or within the designated Python file, this repository is [licensed](/LICENSE) under AGPLv3.
|
||||||
|
|
||||||
- [EnCodec](https://github.com/facebookresearch/encodec) is licensed under CC-BY-NC 4.0. If you use the code to generate audio quantization or perform decoding, it is important to adhere to the terms of their license.
|
- [EnCodec](https://github.com/facebookresearch/encodec) is licensed under CC-BY-NC 4.0. If you use the code to generate audio quantization or perform decoding, it is important to adhere to the terms of their license.
|
||||||
|
|
||||||
|
|
|
@ -216,6 +216,15 @@ def main():
|
||||||
|
|
||||||
comparison_kwargs["disabled"]["amp"] = current_amp
|
comparison_kwargs["disabled"]["amp"] = current_amp
|
||||||
comparison_kwargs["enabled"]["amp"] = other_amp
|
comparison_kwargs["enabled"]["amp"] = other_amp
|
||||||
|
elif args.comparison == "modality":
|
||||||
|
comparison_kwargs["suffix"] = "modality"
|
||||||
|
comparison_kwargs["titles"] = [f"AR+NAR", f"NAR-len"]
|
||||||
|
|
||||||
|
comparison_kwargs["disabled"]["modality"] = "ar+nar"
|
||||||
|
comparison_kwargs["disabled"]["cfg_strength"] = 0.0
|
||||||
|
|
||||||
|
comparison_kwargs["enabled"]["modality"] = "nar-len"
|
||||||
|
comparison_kwargs["enabled"]["cfg_strength"] = 3.0
|
||||||
elif args.comparison == "cfg-strength":
|
elif args.comparison == "cfg-strength":
|
||||||
current_cfg_strength = 3.0
|
current_cfg_strength = 3.0
|
||||||
other_cfg_strength = 0.0
|
other_cfg_strength = 0.0
|
||||||
|
|
|
@ -7,8 +7,13 @@ from transformers.models.mamba2.modeling_mamba2 import Mamba2Model
|
||||||
from transformers.models.mamba2.configuration_mamba2 import Mamba2Config
|
from transformers.models.mamba2.configuration_mamba2 import Mamba2Config
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
"""
|
||||||
from mamba2_torch.modeling.configuration_mamba2 import Mamba2Config
|
from mamba2_torch.modeling.configuration_mamba2 import Mamba2Config
|
||||||
from mamba2_torch.modeling.modeling_mamba2 import Mamba2Model
|
from mamba2_torch.modeling.modeling_mamba2 import Mamba2Model
|
||||||
|
"""
|
||||||
|
|
||||||
|
from fla.models.mamba2.configuration_mamba2 import Mamba2Config
|
||||||
|
from fla.models.mamba2.modeling_mamba2 import Mamba2Model
|
||||||
|
|
||||||
"""
|
"""
|
||||||
# https://github.com/state-spaces/mamba
|
# https://github.com/state-spaces/mamba
|
||||||
|
|
|
@ -851,8 +851,8 @@ class Base(nn.Module):
|
||||||
aux_loss = torch.sum(torch.stack([ t for t in _["l_aux"] if t is not None])) * 0.001
|
aux_loss = torch.sum(torch.stack([ t for t in _["l_aux"] if t is not None])) * 0.001
|
||||||
elif self.arch_type in ["mamba","mamba2"]:
|
elif self.arch_type in ["mamba","mamba2"]:
|
||||||
kwargs = dict(
|
kwargs = dict(
|
||||||
#attention_mask=m,
|
|
||||||
inputs_embeds=x,
|
inputs_embeds=x,
|
||||||
|
attention_mask=m,
|
||||||
#cache_params=state,
|
#cache_params=state,
|
||||||
use_cache=False, # not self.training,
|
use_cache=False, # not self.training,
|
||||||
#position_ids=position_ids,
|
#position_ids=position_ids,
|
||||||
|
@ -864,8 +864,6 @@ class Base(nn.Module):
|
||||||
output = self.model(**kwargs)
|
output = self.model(**kwargs)
|
||||||
x = output["last_hidden_state"]
|
x = output["last_hidden_state"]
|
||||||
|
|
||||||
# to-do: figure out why KV caching doesn't work
|
|
||||||
#if not self.training:
|
|
||||||
if state is not None:
|
if state is not None:
|
||||||
state = output["cache_params"]
|
state = output["cache_params"]
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user