cringe fix because I guess I moved which logit gets trained for len duration (I should probably rethink this)

2025-03-25 21:33:01 -05:00 · 2025-03-25 21:33:01 -05:00 · 476d87d4aa
commit 476d87d4aa
parent a1eb96e6c1
2 changed files with 4 additions and 1 deletions
--- a/docs/models_v2.md
+++ b/docs/models_v2.md
@ -145,6 +145,9 @@ These settings should be avoided:

 To be evaluated thoroughly.
 * The smaller model seems to have hit its capacity limit, while the larger model is slowly improving (although objective metrics are not noted).
+* The model seems pretty quick, even for the large model.
+* The smaller model seems small enough for CPU-only inferencing
+	* Despite its poor zero-shot performance, it could be perfectly fine for finetuning.

 At a glance, compared to the prior model setup, this implementation allows for the model to better represent speech as it's able to see the entire signal and account for it in its latent space, rather than only specific levels of it.

--- a/vall_e/models/base_v2.py
+++ b/vall_e/models/base_v2.py
@ -765,7 +765,7 @@ class Base_V2(nn.Module):

 			# needed, cringe
 			if task_type == "len":
-				batch[-1] = torch.cat( [ batch[-1], self.sep[None] ] )
+				batch[-1] = torch.cat( [ batch[-1], self.sep[None], self.sep[None] ] )

 			x_list.append( _join( batch, self.sep ) )