ai-voice-cloning/models/.template.yaml

name: ${voice}
model: extensibletrainer
scale: 1
gpu_ids: [0] # Superfluous, redundant, unnecessary, the way you launch the training script will set this
start_step: 0
checkpointing_enabled: true 
fp16: ${half_p}
bitsandbytes: ${bitsandbytes}
gpus: ${gpus}
wandb: false 
use_tb_logger: true

datasets:
  train:
    name: training
    n_workers: ${workers}
    batch_size: ${batch_size}
    mode: paired_voice_audio
    path: ${dataset_path}
    fetcher_mode: ['lj']
    phase: train
    max_wav_length: 255995
    max_text_length: 200
    sample_rate: 22050
    load_conditioning: True
    num_conditioning_candidates: 2
    conditioning_length: 44000
    use_bpe_tokenizer: True
    tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json
    load_aligned_codes: False
  val: # I really do not care about validation right now
    name: validation
    n_workers: ${workers}
    batch_size: ${validation_batch_size}
    mode: paired_voice_audio
    path: ${validation_path}
    fetcher_mode: ['lj']
    phase: val
    max_wav_length: 255995
    max_text_length: 200
    sample_rate: 22050
    load_conditioning: True
    num_conditioning_candidates: 2
    conditioning_length: 44000
    use_bpe_tokenizer: True
    tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json
    load_aligned_codes: False

steps:        
  gpt_train:
    training: gpt
    loss_log_buffer: 500

    # Generally follows the recipe from the DALLE paper.
    optimizer: ${optimizer} # this should be adamw_zero if you're using distributed training
    optimizer_params:
      lr: !!float ${learning_rate} # originally: 1e-4
      weight_decay: !!float 1e-2
      beta1: 0.9
      beta2: 0.96
    clip_grad_eps: 4

    injectors:
      paired_to_mel:
        type: torch_mel_spectrogram
        mel_norm_file: ./models/tortoise/clips_mel_norms.pth
        in: wav
        out: paired_mel
      paired_cond_to_mel:
        type: for_each
        subtype: torch_mel_spectrogram
        mel_norm_file: ./models/tortoise/clips_mel_norms.pth
        in: conditioning
        out: paired_conditioning_mel
      to_codes:
        type: discrete_token
        in: paired_mel
        out: paired_mel_codes
        dvae_config: "./models/tortoise/train_diffusion_vocoder_22k_level.yml"
      paired_fwd_text:
        type: generator
        generator: gpt
        in: [paired_conditioning_mel, padded_text, text_lengths, paired_mel_codes, wav_lengths]
        out: [loss_text_ce, loss_mel_ce, logits]      
    losses:
      text_ce:
        type: direct
        weight: ${text_ce_lr_weight}
        key: loss_text_ce
      mel_ce:
        type: direct
        weight: 1
        key: loss_mel_ce

networks:
  gpt:
    type: generator 
    which_model_G: unified_voice2 # none of the unified_voice*.py files actually match the tortoise inference code... 4 and 3 have "alignment_head" (wtf is that?), 2 lacks the types=1 parameter.
    kwargs:
      layers: 30 # originally: 8
      model_dim: 1024 # originally: 512
      heads: 16 # originally: 8
      max_text_tokens: 402 # originally: 120
      max_mel_tokens: 604 # originally: 250
      max_conditioning_inputs: 2 # originally: 1
      mel_length_compression: 1024
      number_text_tokens: 256 # supposed to be 255 for newer unified_voice files 
      number_mel_codes: 8194
      start_mel_token: 8192
      stop_mel_token: 8193
      start_text_token: 255
      train_solo_embeddings: False # missing in uv3/4
      use_mel_codes_as_input: True # ditto
      checkpointing: True
      #types: 1 # this is MISSING, but in my analysis 1 is equivalent to not having it.
      #only_alignment_head: False  # uv3/4

path:
  strict_load: true
  ${source_model} 
  ${resume_state}

train:
  niter: ${iterations}
  warmup_iter: -1
  mega_batch_factor: ${gradient_accumulation_size}
  val_freq: ${validation_rate}

  ema_enabled: false # I really don't think EMA matters

  ${learning_rate_scheme}

eval:
  pure: ${validation_enabled}
  output_state: gen

logger: 
  print_freq: ${print_rate}
  save_checkpoint_freq: ${save_rate}
  visuals: [gen, mel]
  visual_debug_rate: ${print_rate}
  is_mel_spectrogram: true
big cleanup to make my life easier when i add more parameters 2023-03-09 00:26:47 +00:00			`name: ${voice}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`model: extensibletrainer`
			`scale: 1`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`gpu_ids: [0] # Superfluous, redundant, unnecessary, the way you launch the training script will set this`
huge success 2023-02-23 06:24:54 +00:00			`start_step: 0`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`checkpointing_enabled: true`
big cleanup to make my life easier when i add more parameters 2023-03-09 00:26:47 +00:00			`fp16: ${half_p}`
			`bitsandbytes: ${bitsandbytes}`
			`gpus: ${gpus}`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`wandb: false`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`use_tb_logger: true`

			`datasets:`
			`train:`
big cleanup to make my life easier when i add more parameters 2023-03-09 00:26:47 +00:00			`name: training`
added option to set worker size in training config generator (because the default is overkill), for whisper transcriptions, load a specialized language model if it exists (for now, only english), output transcription to web UI when done transcribing 2023-03-05 05:17:19 +00:00			`n_workers: ${workers}`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`batch_size: ${batch_size}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`mode: paired_voice_audio`
			`path: ${dataset_path}`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`fetcher_mode: ['lj']`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`phase: train`
			`max_wav_length: 255995`
			`max_text_length: 200`
			`sample_rate: 22050`
			`load_conditioning: True`
			`num_conditioning_candidates: 2`
			`conditioning_length: 44000`
			`use_bpe_tokenizer: True`
Slight fix, getting close to be able to train from the web UI directly 2023-02-17 13:57:03 +00:00			`tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`load_aligned_codes: False`
Added very experimental float16 training for cards with not enough VRAM (10GiB and below, maybe) \!NOTE\! this is VERY EXPERIMETNAL, I have zero free time to validate it right now, I'll do it later 2023-02-21 19:31:57 +00:00			`val: # I really do not care about validation right now`
big cleanup to make my life easier when i add more parameters 2023-03-09 00:26:47 +00:00			`name: validation`
made validation working (will document later) 2023-03-08 02:58:00 +00:00			`n_workers: ${workers}`
disable validation if validation dataset not found, clamp validation batch size to validation dataset size instead of simply reusing batch size, switch to adamw_zero optimizier when training with multi-gpus (because the yaml comment said to and I think it might be why I'm absolutely having garbage luck training this japanese dataset) 2023-03-08 04:47:05 +00:00			`batch_size: ${validation_batch_size}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`mode: paired_voice_audio`
			`path: ${validation_path}`
			`fetcher_mode: ['lj']`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`phase: val`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`max_wav_length: 255995`
			`max_text_length: 200`
			`sample_rate: 22050`
			`load_conditioning: True`
			`num_conditioning_candidates: 2`
			`conditioning_length: 44000`
			`use_bpe_tokenizer: True`
Slight fix, getting close to be able to train from the web UI directly 2023-02-17 13:57:03 +00:00			`tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`load_aligned_codes: False`

			`steps:`
			`gpt_train:`
			`training: gpt`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`loss_log_buffer: 500`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00
			`# Generally follows the recipe from the DALLE paper.`
forgot template 2023-03-09 00:32:35 +00:00			`optimizer: ${optimizer} # this should be adamw_zero if you're using distributed training`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`optimizer_params:`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`lr: !!float ${learning_rate} # originally: 1e-4`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`weight_decay: !!float 1e-2`
			`beta1: 0.9`
			`beta2: 0.96`
			`clip_grad_eps: 4`

renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`injectors:`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`paired_to_mel:`
			`type: torch_mel_spectrogram`
Slight fix, getting close to be able to train from the web UI directly 2023-02-17 13:57:03 +00:00			`mel_norm_file: ./models/tortoise/clips_mel_norms.pth`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`in: wav`
			`out: paired_mel`
			`paired_cond_to_mel:`
			`type: for_each`
			`subtype: torch_mel_spectrogram`
Slight fix, getting close to be able to train from the web UI directly 2023-02-17 13:57:03 +00:00			`mel_norm_file: ./models/tortoise/clips_mel_norms.pth`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`in: conditioning`
			`out: paired_conditioning_mel`
			`to_codes:`
			`type: discrete_token`
			`in: paired_mel`
			`out: paired_mel_codes`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`dvae_config: "./models/tortoise/train_diffusion_vocoder_22k_level.yml"`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`paired_fwd_text:`
			`type: generator`
			`generator: gpt`
			`in: [paired_conditioning_mel, padded_text, text_lengths, paired_mel_codes, wav_lengths]`
			`out: [loss_text_ce, loss_mel_ce, logits]`
			`losses:`
			`text_ce:`
			`type: direct`
added new training tunable: loss_text_ce_loss weight, added option to specify source model in case you want to finetune a finetuned model (for example, train a Japanese finetune on a large dataset, then finetune for a specific voice, need to truly validate if it produces usable output), some bug fixes that came up for some reason now and not earlier 2023-03-01 01:17:38 +00:00			`weight: ${text_ce_lr_weight}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`key: loss_text_ce`
			`mel_ce:`
			`type: direct`
			`weight: 1`
			`key: loss_mel_ce`

			`networks:`
			`gpt:`
			`type: generator`
			`which_model_G: unified_voice2 # none of the unified_voice*.py files actually match the tortoise inference code... 4 and 3 have "alignment_head" (wtf is that?), 2 lacks the types=1 parameter.`
			`kwargs:`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`layers: 30 # originally: 8`
			`model_dim: 1024 # originally: 512`
			`heads: 16 # originally: 8`
			`max_text_tokens: 402 # originally: 120`
			`max_mel_tokens: 604 # originally: 250`
			`max_conditioning_inputs: 2 # originally: 1`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`mel_length_compression: 1024`
			`number_text_tokens: 256 # supposed to be 255 for newer unified_voice files`
			`number_mel_codes: 8194`
			`start_mel_token: 8192`
			`stop_mel_token: 8193`
			`start_text_token: 255`
			`train_solo_embeddings: False # missing in uv3/4`
			`use_mel_codes_as_input: True # ditto`
			`checkpointing: True`
			`#types: 1 # this is MISSING, but in my analysis 1 is equivalent to not having it.`
			`#only_alignment_head: False # uv3/4`

			`path:`
			`strict_load: true`
big cleanup to make my life easier when i add more parameters 2023-03-09 00:26:47 +00:00			`${source_model}`
added more safeties and parameters to training yaml generator, I think I tested it extensively enough 2023-02-19 16:16:44 +00:00			`${resume_state}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`train:`
oops 2023-02-18 15:50:51 +00:00			`niter: ${iterations}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`warmup_iter: -1`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`mega_batch_factor: ${gradient_accumulation_size}`
set validation to save rate and validation file if exists (need to test later) 2023-03-07 20:38:31 +00:00			`val_freq: ${validation_rate}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00
Added very experimental float16 training for cards with not enough VRAM (10GiB and below, maybe) \!NOTE\! this is VERY EXPERIMETNAL, I have zero free time to validate it right now, I'll do it later 2023-02-21 19:31:57 +00:00			`ema_enabled: false # I really don't think EMA matters`

actually make using adamw_zero optimizer for multi-gpus work 2023-03-08 15:31:33 +00:00			`${learning_rate_scheme}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00
			`eval:`
disable validation if validation dataset not found, clamp validation batch size to validation dataset size instead of simply reusing batch size, switch to adamw_zero optimizier when training with multi-gpus (because the yaml comment said to and I think it might be why I'm absolutely having garbage luck training this japanese dataset) 2023-03-08 04:47:05 +00:00			`pure: ${validation_enabled}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`output_state: gen`

			`logger:`
oops 2023-02-18 15:50:51 +00:00			`print_freq: ${print_rate}`
renamed mega batch factor to an actual real term: gradient accumulation factor, fixed halting training not actually killing the training process and freeing up resources, some logic cleanup for gradient accumulation (so many brain worms and wrong assumptions from testing on low batch sizes) (read the training section in the wiki for more details) 2023-03-04 15:55:06 +00:00			`save_checkpoint_freq: ${save_rate}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`visuals: [gen, mel]`
oops 2023-02-18 15:50:51 +00:00			`visual_debug_rate: ${print_rate}`
tab to generate the training YAML 2023-02-17 03:05:27 +00:00			`is_mel_spectrogram: true`