'borrowed' a sampling scheduler for NAR-len's RVQ level 0 (better than before, but still not good enough)
This commit is contained in:
parent
e108c54daf
commit
d17f0ebc7c
|
@ -41,7 +41,6 @@ One problem exhibited from a NAR is producing arfifacts ("crust") in the final w
|
|||
* `token_dropout_error`: This will randomly nudge a small percentage of tokens from the prior RVQ level to simulate wrong tokens being predicted.
|
||||
* `token_dropout_rate`: This will randomly mask off tokens from the prior RVQ level with a mask token, to try and have the model not-strongly-rely on the given input.
|
||||
|
||||
|
||||
### Pure NAR
|
||||
|
||||
The pure NAR (`nar-len`) model is a model-type that inferences audio tokens purely non-autoregressively. Despite being called a pure NAR, duration is then inferred by autoregressively decoding for its length (as the AR+NAR model shows that you can mix both types).
|
||||
|
@ -50,10 +49,13 @@ However, having a pure NAR is challenging, as you need to both explicitly provid
|
|||
* The former problem is easily "solved" by training a `len` inferencing task, where the given input predicts the requested duration for a given utterance autoregressively.
|
||||
* The latter however proves to be challenging, as generating tokens from nothing in one step is not possible.
|
||||
* diffusion solves this, but requires additional steps at best and a separate model at worse, just for one RVQ level.
|
||||
* however, it's possible to have a similar paradigm to diffusers, but instead iterating upon random noise, masked tokens are iterated per step, and each step picks the most confident tokens per step.
|
||||
* incidentally, [this paper](https://arxiv.org/abs/2406.05478) demonstrates this in the use of a NAR transformer for image generation
|
||||
* the normal NAR (RVQ level 1+) does not face this problem, as it's already given a sufficient initial sequence of tokens to work with, and thus only requires one step.
|
||||
|
||||
The implemented solution follows a similar paradigm to diffusion, but with masking instead of noise.
|
||||
* incidentally, [this paper](https://arxiv.org/abs/2406.05478) demonstrates this in the use of a NAR transformer for image generation
|
||||
|
||||
To-do: fill out this more when it works.
|
||||
|
||||
## Embeddings
|
||||
|
||||
The "magic" of subjugating a transformer for audio use lies within the ensemble of the embeddings. This is necessary as each piece of a sequence is fundamentally different, but a HF-compatible model can geta way with treating each sequence as separate ranges within a total token sequence.
|
||||
|
@ -99,7 +101,8 @@ Howver, the `resp` requires some extra care, as the model needs to both causally
|
|||
* The first embedding level pertains to RVQ level 0 for the AR.
|
||||
* The remaining embedding levels maps to RVQ level 0 + n for the NAR.
|
||||
* In other words, embedding level 1 => RVQ level 0, embedding level 2 => RVQ level 1, etc...
|
||||
* I believe this is because the model needs to "know" whether to predict the next token in the sequence, or the token in the same position of the next RVQ level.
|
||||
* I believe this is because the model needs to "know" whether to predict ~~the next token in the sequence, or the token in the same position of the next RVQ level~~ which tokens of a given embedding.
|
||||
* In other words, the AR's RVQ level 0 embedding predicts itself, while the NAR's embeddings predict the next level's embeddings.
|
||||
* Unfortunately, providing a token for the current/target RVQ level within the input sequence doesn't seem to help? I don't remember if I experimented with this or not, but testing of a "sane" `resp` embedding proved to be unfruitful.
|
||||
|
||||
The `prom` and `resp` are split since, in theory, it helps the model know better what audio to source from, and what audio is part of the output sequence. In theory.
|
||||
|
|
|
@ -391,7 +391,8 @@ class AR_NAR(Base):
|
|||
if sampled.entropy:
|
||||
metrics.append( sampled.entropy )
|
||||
elif sampled.scores:
|
||||
metrics.append( [ { "p": p[0], "exited_layer": output.exited_layer } for p in sampled.scores ] )
|
||||
#metrics.append( [ { "p": p[0], "exited_layer": output.exited_layer } for p in sampled.scores ] )
|
||||
metrics.append( [ { "p": p[0] } for p in sampled.scores ] )
|
||||
|
||||
if mirostat is not None:
|
||||
mirostat = sampled.scores
|
||||
|
|
|
@ -1317,9 +1317,12 @@ class Base(nn.Module):
|
|||
task_type = "tts"
|
||||
|
||||
dropout_mask = None
|
||||
#
|
||||
"""
|
||||
for name, input in batch:
|
||||
if name == "dropout_mask":
|
||||
dropout_mask = input
|
||||
"""
|
||||
|
||||
for name, input in batch:
|
||||
if name == "task":
|
||||
|
@ -1778,9 +1781,15 @@ class Base(nn.Module):
|
|||
res = [ Categorical(logits=logit).sample() for logit in logits ]
|
||||
|
||||
# calculate token probabilities
|
||||
scores = [
|
||||
[ F.softmax(logit[-1, :], dim=0)[token].item() for token in tokens ]
|
||||
for logit, tokens in zip(logits, res)
|
||||
]
|
||||
if "len" in self.capabilities:
|
||||
scores = [
|
||||
[ F.softmax(logit[i, :], dim=0)[token].item() for i, token in enumerate(tokens) ]
|
||||
for logit, tokens in zip(logits, res)
|
||||
]
|
||||
else:
|
||||
scores = [
|
||||
[ F.softmax(logit[-1, :], dim=0)[token].item() for token in tokens ]
|
||||
for logit, tokens in zip(logits, res)
|
||||
]
|
||||
|
||||
return Sampled(res, scores, entropy)
|
|
@ -6,21 +6,22 @@ It *does* have to inference the initial length in an autoregresssive-ish manner
|
|||
Initial experiments show this only really "works" for the a few brief seconds before going to silence. I imagine I need to read more papers or just need to train longer.
|
||||
"""
|
||||
|
||||
from .base import Base, list_to_tensor, Categorical
|
||||
from ..config import cfg
|
||||
|
||||
import torch
|
||||
from torch.nn.utils.rnn import pad_sequence
|
||||
|
||||
import random
|
||||
import math
|
||||
import numpy as np
|
||||
import logging
|
||||
import torch
|
||||
from torch.nn.utils.rnn import pad_sequence
|
||||
|
||||
from einops import rearrange
|
||||
from torch import Tensor
|
||||
from tqdm import trange
|
||||
|
||||
from .base import Base, list_to_tensor, Categorical, _dropout_mask
|
||||
from ..config import cfg
|
||||
from ..emb.qnt import trim, repeat_extend_audio
|
||||
|
||||
import logging
|
||||
from ..samplers import SampleScheduler
|
||||
|
||||
def clamp(n, lo, hi):
|
||||
return max(lo, min(n, hi))
|
||||
|
@ -211,23 +212,91 @@ class NAR(Base):
|
|||
|
||||
|
||||
if len_list is not None:
|
||||
# is NAR
|
||||
sampling_layer_skip_variables = {} if sampling_layer_skip else None
|
||||
|
||||
if max_levels == 0:
|
||||
max_levels = self.n_resp_levels
|
||||
|
||||
# fill with mock tokens
|
||||
#prev_list = [ torch.tensor([ self.stop_token for _ in range(resp_len) ], device=device, dtype=torch.int16) for resp_len in len_list ]
|
||||
#prev_list = [ repeat_extend_audio( prom, resp_len ) for resp_len, prom in zip(len_list, proms_list) ]
|
||||
#prev_list = [ None for resp_len in len_list ] # this breaks the position ID calc
|
||||
|
||||
max_levels = self.n_max_levels - 1
|
||||
|
||||
if sampling_layer_skip:
|
||||
if sampling_layer_skip_entropy_threshold >= 0:
|
||||
sampling_layer_skip_variables["entropy_threshold"] = sampling_layer_skip_entropy_threshold
|
||||
if sampling_layer_skip_varentropy_threshold >= 0:
|
||||
sampling_layer_skip_variables["varentropy_threshold"] = sampling_layer_skip_varentropy_threshold
|
||||
if sampling_layer_skip_exit_layer >= 0:
|
||||
sampling_layer_skip_variables["max_layer"] = sampling_layer_skip_exit_layer
|
||||
|
||||
# initial condition
|
||||
len_list = [ min(l, 500) for l in len_list ]
|
||||
metrics = []
|
||||
|
||||
mask_token = torch.tensor([self.stop_token], dtype=torch.int16, device=device)
|
||||
prev_list = [ torch.concat([ mask_token for _ in range( resp_len ) ]) for resp_len in len_list ]
|
||||
|
||||
# to-do: special "scheduling" to inference RVQ-level 0
|
||||
# special "scheduling" to inference RVQ-level 0
|
||||
level = 0
|
||||
if cfg.lora is not None:
|
||||
enable_lora( self, cfg.lora.active_level( level ) if use_lora is None else use_lora )
|
||||
|
||||
# to-do: figure out why this fails when I copy some things from ar_nar
|
||||
for n in trange( max_levels, desc="NAR", disable=disable_tqdm ):
|
||||
level = 0 if n == 0 else prev_list[0].shape[-1]
|
||||
_super = super()
|
||||
def forward_lambda( ids, step, temperature ):
|
||||
quant_levels = [ level for _ in range(batch_size) ]
|
||||
prev_list = [ ids[0] ]
|
||||
seq_len = ids.shape[-1]
|
||||
|
||||
inputs = _super.inputs(
|
||||
text_list=text_list,
|
||||
proms_list=proms_list,
|
||||
resps_list=prev_list,
|
||||
lang_list=lang_list,
|
||||
tone_list=tone_list,
|
||||
quant_levels=quant_levels,
|
||||
)
|
||||
|
||||
output = _super.forward(
|
||||
inputs=inputs,
|
||||
quant_levels=quant_levels,
|
||||
|
||||
layer_skip_variables=sampling_layer_skip_variables,
|
||||
)
|
||||
logits = output.logits
|
||||
|
||||
sampled = _super.sample(
|
||||
logits=logits,
|
||||
prev_list=prev_list,
|
||||
quant_levels=quant_levels,
|
||||
|
||||
temperature=temperature,
|
||||
min_temperature=sampling_min_temperature,
|
||||
top_p=sampling_top_p,
|
||||
top_k=sampling_top_k,
|
||||
min_p=sampling_min_p,
|
||||
repetition_penalty=sampling_repetition_penalty,
|
||||
repetition_penalty_decay=sampling_repetition_penalty_decay,
|
||||
length_penalty=sampling_length_penalty,
|
||||
#beam_width=sampling_beam_width,
|
||||
#mirostat=mirostat,
|
||||
)
|
||||
|
||||
ids = sampled[0]
|
||||
|
||||
return logits[0][-seq_len:].unsqueeze(0), ids[0].unsqueeze(0)
|
||||
|
||||
scheduler = SampleScheduler(
|
||||
device=device,
|
||||
mask_token=self.stop_token,
|
||||
max_steps=5,
|
||||
forward_lambda=forward_lambda,
|
||||
sampling_temperature=sampling_temperature,
|
||||
)
|
||||
prev_list = [ scheduler.sample( seq_len=len_list[0] ) ]
|
||||
|
||||
# expand if given a raw 1D tensor
|
||||
for i, resp in enumerate(prev_list):
|
||||
if resp.dim() == 1:
|
||||
prev_list[i] = resp.unsqueeze(-1)
|
||||
|
||||
for n in trange( max_levels, desc="NAR", disable=disable_tqdm ):
|
||||
level = prev_list[0].shape[-1]
|
||||
if level >= max_levels + 1: # min(max_levels + 1, self.n_resp_levels): # commented out to experiment with exceeding trained levels
|
||||
break
|
||||
|
||||
|
@ -249,7 +318,7 @@ class NAR(Base):
|
|||
inputs=inputs,
|
||||
quant_levels=quant_levels,
|
||||
|
||||
# layer_skip_variables=sampling_layer_skip_variables,
|
||||
layer_skip_variables=sampling_layer_skip_variables,
|
||||
)
|
||||
logits, state = output.logits, output.state
|
||||
|
||||
|
@ -258,24 +327,20 @@ class NAR(Base):
|
|||
prev_list=prev_list,
|
||||
quant_levels=quant_levels,
|
||||
|
||||
#temperature=sampling_temperature,
|
||||
temperature=1.0 if n == 0 else sampling_temperature,
|
||||
min_temperature=sampling_min_temperature,
|
||||
top_p=sampling_top_p,
|
||||
top_k=sampling_top_k,
|
||||
min_p=sampling_min_p,
|
||||
repetition_penalty=sampling_repetition_penalty,
|
||||
repetition_penalty_decay=sampling_repetition_penalty_decay,
|
||||
temperature=0.0, # sampling_temperature,
|
||||
#min_temperature=sampling_min_temperature,
|
||||
#top_p=sampling_top_p,
|
||||
#top_k=sampling_top_k,
|
||||
#min_p=sampling_min_p,
|
||||
#repetition_penalty=sampling_repetition_penalty,
|
||||
#repetition_penalty_decay=sampling_repetition_penalty_decay,
|
||||
#length_penalty=sampling_length_penalty,
|
||||
#beam_width=sampling_beam_width,
|
||||
#mirostat=mirostat,
|
||||
)
|
||||
resps_list = sampled[0]
|
||||
|
||||
if n == 0:
|
||||
prev_list = [ r.unsqueeze(-1).to(device) for r in resps_list ]
|
||||
else:
|
||||
prev_list = [ torch.cat([rs, r.unsqueeze(-1).to(device)], dim=-1) for rs, r in zip(prev_list, resps_list) ]
|
||||
resps_list = sampled[0]
|
||||
prev_list = [ torch.cat([rs, r.unsqueeze(-1).to(device=device)], dim=-1) for rs, r in zip(prev_list, resps_list) ]
|
||||
|
||||
return prev_list
|
||||
|
||||
|
|
|
@ -520,4 +520,63 @@ def sample_entropix(
|
|||
metrics["min_p"] = min_p
|
||||
"""
|
||||
|
||||
return res, metrics
|
||||
return res, metrics
|
||||
|
||||
#
|
||||
def add_gumbel_noise(t, temperature, device):
|
||||
return (t + torch.Tensor(temperature * np.random.gumbel(size=t.shape)).to(device))
|
||||
|
||||
# derived from https://github.com/LeapLabTHU/ImprovedNAT/blob/main/libs/nat_misc.py#L39
|
||||
# this
|
||||
class SampleScheduler:
|
||||
def __init__(
|
||||
self,
|
||||
forward_lambda = None,
|
||||
mask_token = -1,
|
||||
max_steps = 25,
|
||||
device = "cuda",
|
||||
sampling_temperature=1.0,
|
||||
):
|
||||
self.forward_lambda = forward_lambda
|
||||
self.max_steps = max_steps
|
||||
self.mask_token = mask_token
|
||||
self.device = device
|
||||
|
||||
self.ratios = (np.cos(np.linspace(0, math.pi / 2, self.max_steps + 1)))[1:-1]
|
||||
self.annealed_temperatures = (1 - np.linspace(0, 1, self.max_steps + 1))[:-2]
|
||||
self.sampling_temperatures = [sampling_temperature for _ in range(self.max_steps)]
|
||||
|
||||
def sample( self, seq_len ):
|
||||
ids = torch.full((1, seq_len), self.mask_token, dtype=torch.long, device=self.device)
|
||||
|
||||
for step in range( self.max_steps ):
|
||||
mask_ratio = self.ratios[step] if step + 1 < self.max_steps else 0
|
||||
annealed_temperature = self.annealed_temperatures[step] if step + 1 < self.max_steps else 0
|
||||
sampling_temperature = self.sampling_temperatures[step] if step + 1 < self.max_steps else 1.0
|
||||
|
||||
logits, sampled_ids = self.forward_lambda( ids, step=step, temperature=sampling_temperature )
|
||||
|
||||
if step + 1 == self.max_steps:
|
||||
break
|
||||
|
||||
# create next input sequence
|
||||
mask = (ids == self.mask_token)
|
||||
mask_len = torch.Tensor([np.floor(seq_len * mask_ratio)]).to(self.device)
|
||||
mask_len = torch.maximum(
|
||||
torch.Tensor([1]).to(self.device),
|
||||
torch.minimum( torch.sum(mask, dim=-1, keepdims=True) - 1, mask_len )
|
||||
)[0].squeeze()
|
||||
|
||||
logits = torch.log_softmax(logits, dim=-1)
|
||||
sampled_logits = torch.squeeze(torch.gather(logits, dim=-1, index=torch.unsqueeze(sampled_ids, -1)), -1)
|
||||
sampled_ids = torch.where(mask, sampled_ids, ids)
|
||||
sampled_logits = torch.where(mask, sampled_logits, +np.inf).float()
|
||||
|
||||
confidence = add_gumbel_noise(sampled_logits, annealed_temperature, self.device)
|
||||
sorted_confidence, _ = torch.sort(confidence, axis=-1)
|
||||
cut_off = sorted_confidence[:, mask_len.long() - 1:mask_len.long()]
|
||||
masking = (confidence <= cut_off)
|
||||
|
||||
ids = torch.where(masking, self.mask_token, sampled_ids)
|
||||
|
||||
return sampled_ids[0]
|
Loading…
Reference in New Issue
Block a user