Can't get the model training started #416

I have nvidia GPU and Windows 10. Here's all the console output from the launch:

C:\Users\imint\Desktop\voice\ai-voice-cloning>call .\venv\Scripts\activate.bat
Whisper detected
Traceback (most recent call last):
  File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\utils.py", line 85, in <module>
    from vall_e.emb.qnt import encode as valle_quantize
ModuleNotFoundError: No module named 'vall_e'

Traceback (most recent call last):
  File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\utils.py", line 105, in <module>
    import bark
ModuleNotFoundError: No module named 'bark'

Running on local URL: 

To create a public link, set `share=True` in `launch()`.
Loading TorToiSe... (AR: ./models/tortoise/autoregressive.pth, diffusion: ./models/tortoise/diffusion_decoder.pth, vocoder: bigvgan_24khz_100band)
Hardware acceleration found: cuda
use_deepspeed api_debug False
Loading tokenizer JSON: ./modules/tortoise-tts/tortoise/data/tokenizer.json
Loaded tokenizer
Loading autoregressive model: ./models/tortoise/autoregressive.pth
Loaded autoregressive model
Loaded diffusion model
Loading vocoder model: bigvgan_24khz_100band
Loading vocoder model: bigvgan_24khz_100band.pth
Removing weight norm...
Loaded vocoder model
Loaded TTS, ready for generation.
Unloaded TTS
Loading specialized model for language: en
Loading Whisper model: base.en
Loading Whisper model: base.en
Loaded Whisper model
C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torchaudio\functional\functional.py:1458: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
  warnings.warn(
Text length too long (200 < 5343), using segments: voice1.wav
Audio not segmented, segmenting: voice1.wav
Sliced segments: 1 => 78.
Unloaded Whisper
Spawning process:  train.bat ./training/manvoice1/train.yaml
[Training] [2023-10-15T14:44:40.026622]
[Training] [2023-10-15T14:44:40.030622] (venv) C:\Users\imint\Desktop\voice\ai-voice-cloning>call .\venv\Scripts\activate.bat
[Training] [2023-10-15T14:44:42.947291] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-10-15T14:44:46.213041] 23-10-15 14:44:46.213 - INFO:   name: manvoice1
[Training] [2023-10-15T14:44:46.216042]   model: extensibletrainer
[Training] [2023-10-15T14:44:46.219043]   scale: 1
[Training] [2023-10-15T14:44:46.223044]   gpu_ids: [0]
[Training] [2023-10-15T14:44:46.226044]   start_step: 0
[Training] [2023-10-15T14:44:46.229045]   checkpointing_enabled: True
[Training] [2023-10-15T14:44:46.232045]   fp16: False
[Training] [2023-10-15T14:44:46.235047]   bitsandbytes: True
[Training] [2023-10-15T14:44:46.239047]   gpus: 1
[Training] [2023-10-15T14:44:46.242048]   datasets:[
[Training] [2023-10-15T14:44:46.245049]     train:[
[Training] [2023-10-15T14:44:46.248049]       name: training
[Training] [2023-10-15T14:44:46.251051]       n_workers: 2
[Training] [2023-10-15T14:44:46.254051]       batch_size: 78
[Training] [2023-10-15T14:44:46.257052]       mode: paired_voice_audio
[Training] [2023-10-15T14:44:46.260052]       path: ./training/manvoice1/train.txt
[Training] [2023-10-15T14:44:46.263053]       fetcher_mode: ['lj']
[Training] [2023-10-15T14:44:46.266053]       phase: train
[Training] [2023-10-15T14:44:46.269054]       max_wav_length: 255995
[Training] [2023-10-15T14:44:46.272055]       max_text_length: 200
[Training] [2023-10-15T14:44:46.275056]       sample_rate: 22050
[Training] [2023-10-15T14:44:46.277056]       load_conditioning: True
[Training] [2023-10-15T14:44:46.281057]       num_conditioning_candidates: 2
[Training] [2023-10-15T14:44:46.284058]       conditioning_length: 44000
[Training] [2023-10-15T14:44:46.287058]       use_bpe_tokenizer: True
[Training] [2023-10-15T14:44:46.291060]       tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json
[Training] [2023-10-15T14:44:46.294060]       load_aligned_codes: False
[Training] [2023-10-15T14:44:46.297060]       data_type: img
[Training] [2023-10-15T14:44:46.300061]     ]
[Training] [2023-10-15T14:44:46.303062]     val:[
[Training] [2023-10-15T14:44:46.307063]       name: validation
[Training] [2023-10-15T14:44:46.310064]       n_workers: 2
[Training] [2023-10-15T14:44:46.313064]       batch_size: 0
[Training] [2023-10-15T14:44:46.316065]       mode: paired_voice_audio
[Training] [2023-10-15T14:44:46.319065]       path: ./training/manvoice1/validation.txt
[Training] [2023-10-15T14:44:46.322066]       fetcher_mode: ['lj']
[Training] [2023-10-15T14:44:46.325067]       phase: val
[Training] [2023-10-15T14:44:46.328068]       max_wav_length: 255995
[Training] [2023-10-15T14:44:46.331069]       max_text_length: 200
[Training] [2023-10-15T14:44:46.334069]       sample_rate: 22050
[Training] [2023-10-15T14:44:46.338070]       load_conditioning: True
[Training] [2023-10-15T14:44:46.341071]       num_conditioning_candidates: 2
[Training] [2023-10-15T14:44:46.344072]       conditioning_length: 44000
[Training] [2023-10-15T14:44:46.347073]       use_bpe_tokenizer: True
[Training] [2023-10-15T14:44:46.351073]       tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json
[Training] [2023-10-15T14:44:46.353074]       load_aligned_codes: False
[Training] [2023-10-15T14:44:46.356074]       data_type: img
[Training] [2023-10-15T14:44:46.360075]     ]
[Training] [2023-10-15T14:44:46.363077]   ]
[Training] [2023-10-15T14:44:46.366076]   steps:[
[Training] [2023-10-15T14:44:46.369077]     gpt_train:[
[Training] [2023-10-15T14:44:46.372078]       training: gpt
[Training] [2023-10-15T14:44:46.375079]       loss_log_buffer: 500
[Training] [2023-10-15T14:44:46.378079]       optimizer: adamw
[Training] [2023-10-15T14:44:46.381080]       optimizer_params:[
[Training] [2023-10-15T14:44:46.384080]         lr: 1e-05
[Training] [2023-10-15T14:44:46.387081]         weight_decay: 0.01
[Training] [2023-10-15T14:44:46.390083]         beta1: 0.9
[Training] [2023-10-15T14:44:46.393082]         beta2: 0.96
[Training] [2023-10-15T14:44:46.396083]       ]
[Training] [2023-10-15T14:44:46.399084]       clip_grad_eps: 4
[Training] [2023-10-15T14:44:46.402084]       injectors:[
[Training] [2023-10-15T14:44:46.405085]         paired_to_mel:[
[Training] [2023-10-15T14:44:46.408086]           type: torch_mel_spectrogram
[Training] [2023-10-15T14:44:46.411087]           mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth
[Training] [2023-10-15T14:44:46.414087]           in: wav
[Training] [2023-10-15T14:44:46.416087]           out: paired_mel
[Training] [2023-10-15T14:44:46.420089]         ]
[Training] [2023-10-15T14:44:46.422089]         paired_cond_to_mel:[
[Training] [2023-10-15T14:44:46.425090]           type: for_each
[Training] [2023-10-15T14:44:46.428091]           subtype: torch_mel_spectrogram
[Training] [2023-10-15T14:44:46.432092]           mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth
[Training] [2023-10-15T14:44:46.435092]           in: conditioning
[Training] [2023-10-15T14:44:46.437092]           out: paired_conditioning_mel
[Training] [2023-10-15T14:44:46.441094]         ]
[Training] [2023-10-15T14:44:46.444095]         to_codes:[
[Training] [2023-10-15T14:44:46.448095]           type: discrete_token
[Training] [2023-10-15T14:44:46.450096]           in: paired_mel
[Training] [2023-10-15T14:44:46.453097]           out: paired_mel_codes
[Training] [2023-10-15T14:44:46.456097]           dvae_config: ./models/tortoise/train_diffusion_vocoder_22k_level.yml
[Training] [2023-10-15T14:44:46.459098]         ]
[Training] [2023-10-15T14:44:46.462098]         paired_fwd_text:[
[Training] [2023-10-15T14:44:46.465099]           type: generator
[Training] [2023-10-15T14:44:46.468100]           generator: gpt
[Training] [2023-10-15T14:44:46.471100]           in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths']
[Training] [2023-10-15T14:44:46.474102]           out: ['loss_text_ce', 'loss_mel_ce', 'logits']
[Training] [2023-10-15T14:44:46.477101]         ]
[Training] [2023-10-15T14:44:46.480103]       ]
[Training] [2023-10-15T14:44:46.483103]       losses:[
[Training] [2023-10-15T14:44:46.486104]         text_ce:[
[Training] [2023-10-15T14:44:46.489104]           type: direct
[Training] [2023-10-15T14:44:46.492105]           weight: 0.01
[Training] [2023-10-15T14:44:46.495106]           key: loss_text_ce
[Training] [2023-10-15T14:44:46.497106]         ]
[Training] [2023-10-15T14:44:46.500107]         mel_ce:[
[Training] [2023-10-15T14:44:46.503107]           type: direct
[Training] [2023-10-15T14:44:46.506108]           weight: 1
[Training] [2023-10-15T14:44:46.510110]           key: loss_mel_ce
[Training] [2023-10-15T14:44:46.513110]         ]
[Training] [2023-10-15T14:44:46.517111]       ]
[Training] [2023-10-15T14:44:46.519111]     ]
[Training] [2023-10-15T14:44:46.522112]   ]
[Training] [2023-10-15T14:44:46.525113]   networks:[
[Training] [2023-10-15T14:44:46.528114]     gpt:[
[Training] [2023-10-15T14:44:46.531114]       type: generator
[Training] [2023-10-15T14:44:46.534115]       which_model_G: unified_voice2
[Training] [2023-10-15T14:44:46.537116]       kwargs:[
[Training] [2023-10-15T14:44:46.540116]         layers: 30
[Training] [2023-10-15T14:44:46.543117]         model_dim: 1024
[Training] [2023-10-15T14:44:46.546118]         heads: 16
[Training] [2023-10-15T14:44:46.548118]         max_text_tokens: 402
[Training] [2023-10-15T14:44:46.551119]         max_mel_tokens: 604
[Training] [2023-10-15T14:44:46.554120]         max_conditioning_inputs: 2
[Training] [2023-10-15T14:44:46.557120]         mel_length_compression: 1024
[Training] [2023-10-15T14:44:46.560121]         number_text_tokens: 256
[Training] [2023-10-15T14:44:46.563122]         number_mel_codes: 8194
[Training] [2023-10-15T14:44:46.566122]         start_mel_token: 8192
[Training] [2023-10-15T14:44:46.569123]         stop_mel_token: 8193
[Training] [2023-10-15T14:44:46.572124]         start_text_token: 255
[Training] [2023-10-15T14:44:46.574125]         train_solo_embeddings: False
[Training] [2023-10-15T14:44:46.577125]         use_mel_codes_as_input: True
[Training] [2023-10-15T14:44:46.580126]         checkpointing: True
[Training] [2023-10-15T14:44:46.583127]         tortoise_compat: True
[Training] [2023-10-15T14:44:46.586128]       ]
[Training] [2023-10-15T14:44:46.589128]     ]
[Training] [2023-10-15T14:44:46.592129]   ]
[Training] [2023-10-15T14:44:46.595130]   path:[
[Training] [2023-10-15T14:44:46.598130]     strict_load: True
[Training] [2023-10-15T14:44:46.601130]     pretrain_model_gpt: ./models/tortoise/autoregressive.pth
[Training] [2023-10-15T14:44:46.604131]     root: ./
[Training] [2023-10-15T14:44:46.607132]     experiments_root: ./training\manvoice1\finetune
[Training] [2023-10-15T14:44:46.610133]     models: ./training\manvoice1\finetune\models
[Training] [2023-10-15T14:44:46.613134]     training_state: ./training\manvoice1\finetune\training_state
[Training] [2023-10-15T14:44:46.616134]     log: ./training\manvoice1\finetune
[Training] [2023-10-15T14:44:46.619135]     val_images: ./training\manvoice1\finetune\val_images
[Training] [2023-10-15T14:44:46.621135]   ]
[Training] [2023-10-15T14:44:46.624136]   train:[
[Training] [2023-10-15T14:44:46.627136]     niter: 300
[Training] [2023-10-15T14:44:46.630137]     warmup_iter: -1
[Training] [2023-10-15T14:44:46.633138]     mega_batch_factor: 19
[Training] [2023-10-15T14:44:46.636138]     val_freq: 50
[Training] [2023-10-15T14:44:46.639139]     ema_enabled: False
[Training] [2023-10-15T14:44:46.643140]     default_lr_scheme: MultiStepLR
[Training] [2023-10-15T14:44:46.645140]     gen_lr_steps: [2, 4, 9, 18, 25, 33, 50]
[Training] [2023-10-15T14:44:46.649141]     lr_gamma: 0.5
[Training] [2023-10-15T14:44:46.652142]   ]
[Training] [2023-10-15T14:44:46.654142]   eval:[
[Training] [2023-10-15T14:44:46.658144]     pure: False
[Training] [2023-10-15T14:44:46.661144]     output_state: gen
[Training] [2023-10-15T14:44:46.665145]   ]
[Training] [2023-10-15T14:44:46.668145]   logger:[
[Training] [2023-10-15T14:44:46.671146]     save_checkpoint_freq: 50
[Training] [2023-10-15T14:44:46.673147]     visuals: ['gen', 'mel']
[Training] [2023-10-15T14:44:46.676148]     visual_debug_rate: 50
[Training] [2023-10-15T14:44:46.679149]     is_mel_spectrogram: True
[Training] [2023-10-15T14:44:46.683149]   ]
[Training] [2023-10-15T14:44:46.686150]   is_train: True
[Training] [2023-10-15T14:44:46.689151]   dist: False
[Training] [2023-10-15T14:44:46.692152]
[Training] [2023-10-15T14:44:46.695152] 23-10-15 14:44:46.213 - INFO: Random seed: 3787
[Training] [2023-10-15T14:44:47.691380] 23-10-15 14:44:47.691 - INFO: Number of training data elements: 78, iters: 1
[Training] [2023-10-15T14:44:47.694381] 23-10-15 14:44:47.691 - INFO: Total epochs needed: 300 for iters 300
[Training] [2023-10-15T14:44:48.895160] C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\transformers\configuration_utils.py:363: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
[Training] [2023-10-15T14:44:48.899161]   warnings.warn(
[Training] [2023-10-15T14:44:56.655180] 23-10-15 14:44:56.655 - INFO: Loading model for [./models/tortoise/autoregressive.pth]
[Training] [2023-10-15T14:44:59.082738] 23-10-15 14:44:59.044 - INFO: Start training from epoch: 0, iter: 0
[Training] [2023-10-15T14:45:01.855374] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-10-15T14:45:04.184911] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-10-15T14:45:05.347884] C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torch\optim\lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
[Training] [2023-10-15T14:45:05.347884]   warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
[Training] [2023-10-15T14:45:21.709777] Disabled distributed training.
[Training] [2023-10-15T14:45:21.709777] Loading from ./models/tortoise/dvae.pth
[Training] [2023-10-15T14:45:21.710778] Traceback (most recent call last):
[Training] [2023-10-15T14:45:21.710778]   File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\train.py", line 64, in <module>
[Training] [2023-10-15T14:45:21.710778]     train(config_path, args.launcher)
[Training] [2023-10-15T14:45:21.710778]   File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\train.py", line 31, in train
[Training] [2023-10-15T14:45:21.711778]     trainer.do_training()
[Training] [2023-10-15T14:45:21.711778]   File "c:\users\imint\desktop\voice\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training
[Training] [2023-10-15T14:45:21.750786]     metric = self.do_step(train_data)
[Training] [2023-10-15T14:45:21.750786]   File "c:\users\imint\desktop\voice\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step
[Training] [2023-10-15T14:45:21.751787]     gradient_norms_dict = self.model.optimize_parameters(
[Training] [2023-10-15T14:45:21.751787]   File "c:\users\imint\desktop\voice\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 321, in optimize_parameters
[Training] [2023-10-15T14:45:21.806800]     ns = step.do_forward_backward(
[Training] [2023-10-15T14:45:21.807801]   File "c:\users\imint\desktop\voice\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 242, in do_forward_backward
[Training] [2023-10-15T14:45:21.813802]     local_state[k] = v[grad_accum_step]
[Training] [2023-10-15T14:45:21.813802] IndexError: list index out of range

After that, nothing happens (although Python is using RAM and VRAM in the task manager)

I have nvidia GPU and Windows 10. Here's all the console output from the launch: ``` C:\Users\imint\Desktop\voice\ai-voice-cloning>call .\venv\Scripts\activate.bat Whisper detected Traceback (most recent call last): File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\utils.py", line 85, in <module> from vall_e.emb.qnt import encode as valle_quantize ModuleNotFoundError: No module named 'vall_e' Traceback (most recent call last): File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\utils.py", line 105, in <module> import bark ModuleNotFoundError: No module named 'bark' Running on local URL: To create a public link, set `share=True` in `launch()`. Loading TorToiSe... (AR: ./models/tortoise/autoregressive.pth, diffusion: ./models/tortoise/diffusion_decoder.pth, vocoder: bigvgan_24khz_100band) Hardware acceleration found: cuda use_deepspeed api_debug False Loading tokenizer JSON: ./modules/tortoise-tts/tortoise/data/tokenizer.json Loaded tokenizer Loading autoregressive model: ./models/tortoise/autoregressive.pth Loaded autoregressive model Loaded diffusion model Loading vocoder model: bigvgan_24khz_100band Loading vocoder model: bigvgan_24khz_100band.pth Removing weight norm... Loaded vocoder model Loaded TTS, ready for generation. Unloaded TTS Loading specialized model for language: en Loading Whisper model: base.en Loading Whisper model: base.en Loaded Whisper model C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torchaudio\functional\functional.py:1458: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged. warnings.warn( Text length too long (200 < 5343), using segments: voice1.wav Audio not segmented, segmenting: voice1.wav Sliced segments: 1 => 78. Unloaded Whisper Spawning process: train.bat ./training/manvoice1/train.yaml [Training] [2023-10-15T14:44:40.026622] [Training] [2023-10-15T14:44:40.030622] (venv) C:\Users\imint\Desktop\voice\ai-voice-cloning>call .\venv\Scripts\activate.bat [Training] [2023-10-15T14:44:42.947291] NOTE: Redirects are currently not supported in Windows or MacOs. [Training] [2023-10-15T14:44:46.213041] 23-10-15 14:44:46.213 - INFO: name: manvoice1 [Training] [2023-10-15T14:44:46.216042] model: extensibletrainer [Training] [2023-10-15T14:44:46.219043] scale: 1 [Training] [2023-10-15T14:44:46.223044] gpu_ids: [0] [Training] [2023-10-15T14:44:46.226044] start_step: 0 [Training] [2023-10-15T14:44:46.229045] checkpointing_enabled: True [Training] [2023-10-15T14:44:46.232045] fp16: False [Training] [2023-10-15T14:44:46.235047] bitsandbytes: True [Training] [2023-10-15T14:44:46.239047] gpus: 1 [Training] [2023-10-15T14:44:46.242048] datasets:[ [Training] [2023-10-15T14:44:46.245049] train:[ [Training] [2023-10-15T14:44:46.248049] name: training [Training] [2023-10-15T14:44:46.251051] n_workers: 2 [Training] [2023-10-15T14:44:46.254051] batch_size: 78 [Training] [2023-10-15T14:44:46.257052] mode: paired_voice_audio [Training] [2023-10-15T14:44:46.260052] path: ./training/manvoice1/train.txt [Training] [2023-10-15T14:44:46.263053] fetcher_mode: ['lj'] [Training] [2023-10-15T14:44:46.266053] phase: train [Training] [2023-10-15T14:44:46.269054] max_wav_length: 255995 [Training] [2023-10-15T14:44:46.272055] max_text_length: 200 [Training] [2023-10-15T14:44:46.275056] sample_rate: 22050 [Training] [2023-10-15T14:44:46.277056] load_conditioning: True [Training] [2023-10-15T14:44:46.281057] num_conditioning_candidates: 2 [Training] [2023-10-15T14:44:46.284058] conditioning_length: 44000 [Training] [2023-10-15T14:44:46.287058] use_bpe_tokenizer: True [Training] [2023-10-15T14:44:46.291060] tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json [Training] [2023-10-15T14:44:46.294060] load_aligned_codes: False [Training] [2023-10-15T14:44:46.297060] data_type: img [Training] [2023-10-15T14:44:46.300061] ] [Training] [2023-10-15T14:44:46.303062] val:[ [Training] [2023-10-15T14:44:46.307063] name: validation [Training] [2023-10-15T14:44:46.310064] n_workers: 2 [Training] [2023-10-15T14:44:46.313064] batch_size: 0 [Training] [2023-10-15T14:44:46.316065] mode: paired_voice_audio [Training] [2023-10-15T14:44:46.319065] path: ./training/manvoice1/validation.txt [Training] [2023-10-15T14:44:46.322066] fetcher_mode: ['lj'] [Training] [2023-10-15T14:44:46.325067] phase: val [Training] [2023-10-15T14:44:46.328068] max_wav_length: 255995 [Training] [2023-10-15T14:44:46.331069] max_text_length: 200 [Training] [2023-10-15T14:44:46.334069] sample_rate: 22050 [Training] [2023-10-15T14:44:46.338070] load_conditioning: True [Training] [2023-10-15T14:44:46.341071] num_conditioning_candidates: 2 [Training] [2023-10-15T14:44:46.344072] conditioning_length: 44000 [Training] [2023-10-15T14:44:46.347073] use_bpe_tokenizer: True [Training] [2023-10-15T14:44:46.351073] tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json [Training] [2023-10-15T14:44:46.353074] load_aligned_codes: False [Training] [2023-10-15T14:44:46.356074] data_type: img [Training] [2023-10-15T14:44:46.360075] ] [Training] [2023-10-15T14:44:46.363077] ] [Training] [2023-10-15T14:44:46.366076] steps:[ [Training] [2023-10-15T14:44:46.369077] gpt_train:[ [Training] [2023-10-15T14:44:46.372078] training: gpt [Training] [2023-10-15T14:44:46.375079] loss_log_buffer: 500 [Training] [2023-10-15T14:44:46.378079] optimizer: adamw [Training] [2023-10-15T14:44:46.381080] optimizer_params:[ [Training] [2023-10-15T14:44:46.384080] lr: 1e-05 [Training] [2023-10-15T14:44:46.387081] weight_decay: 0.01 [Training] [2023-10-15T14:44:46.390083] beta1: 0.9 [Training] [2023-10-15T14:44:46.393082] beta2: 0.96 [Training] [2023-10-15T14:44:46.396083] ] [Training] [2023-10-15T14:44:46.399084] clip_grad_eps: 4 [Training] [2023-10-15T14:44:46.402084] injectors:[ [Training] [2023-10-15T14:44:46.405085] paired_to_mel:[ [Training] [2023-10-15T14:44:46.408086] type: torch_mel_spectrogram [Training] [2023-10-15T14:44:46.411087] mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth [Training] [2023-10-15T14:44:46.414087] in: wav [Training] [2023-10-15T14:44:46.416087] out: paired_mel [Training] [2023-10-15T14:44:46.420089] ] [Training] [2023-10-15T14:44:46.422089] paired_cond_to_mel:[ [Training] [2023-10-15T14:44:46.425090] type: for_each [Training] [2023-10-15T14:44:46.428091] subtype: torch_mel_spectrogram [Training] [2023-10-15T14:44:46.432092] mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth [Training] [2023-10-15T14:44:46.435092] in: conditioning [Training] [2023-10-15T14:44:46.437092] out: paired_conditioning_mel [Training] [2023-10-15T14:44:46.441094] ] [Training] [2023-10-15T14:44:46.444095] to_codes:[ [Training] [2023-10-15T14:44:46.448095] type: discrete_token [Training] [2023-10-15T14:44:46.450096] in: paired_mel [Training] [2023-10-15T14:44:46.453097] out: paired_mel_codes [Training] [2023-10-15T14:44:46.456097] dvae_config: ./models/tortoise/train_diffusion_vocoder_22k_level.yml [Training] [2023-10-15T14:44:46.459098] ] [Training] [2023-10-15T14:44:46.462098] paired_fwd_text:[ [Training] [2023-10-15T14:44:46.465099] type: generator [Training] [2023-10-15T14:44:46.468100] generator: gpt [Training] [2023-10-15T14:44:46.471100] in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths'] [Training] [2023-10-15T14:44:46.474102] out: ['loss_text_ce', 'loss_mel_ce', 'logits'] [Training] [2023-10-15T14:44:46.477101] ] [Training] [2023-10-15T14:44:46.480103] ] [Training] [2023-10-15T14:44:46.483103] losses:[ [Training] [2023-10-15T14:44:46.486104] text_ce:[ [Training] [2023-10-15T14:44:46.489104] type: direct [Training] [2023-10-15T14:44:46.492105] weight: 0.01 [Training] [2023-10-15T14:44:46.495106] key: loss_text_ce [Training] [2023-10-15T14:44:46.497106] ] [Training] [2023-10-15T14:44:46.500107] mel_ce:[ [Training] [2023-10-15T14:44:46.503107] type: direct [Training] [2023-10-15T14:44:46.506108] weight: 1 [Training] [2023-10-15T14:44:46.510110] key: loss_mel_ce [Training] [2023-10-15T14:44:46.513110] ] [Training] [2023-10-15T14:44:46.517111] ] [Training] [2023-10-15T14:44:46.519111] ] [Training] [2023-10-15T14:44:46.522112] ] [Training] [2023-10-15T14:44:46.525113] networks:[ [Training] [2023-10-15T14:44:46.528114] gpt:[ [Training] [2023-10-15T14:44:46.531114] type: generator [Training] [2023-10-15T14:44:46.534115] which_model_G: unified_voice2 [Training] [2023-10-15T14:44:46.537116] kwargs:[ [Training] [2023-10-15T14:44:46.540116] layers: 30 [Training] [2023-10-15T14:44:46.543117] model_dim: 1024 [Training] [2023-10-15T14:44:46.546118] heads: 16 [Training] [2023-10-15T14:44:46.548118] max_text_tokens: 402 [Training] [2023-10-15T14:44:46.551119] max_mel_tokens: 604 [Training] [2023-10-15T14:44:46.554120] max_conditioning_inputs: 2 [Training] [2023-10-15T14:44:46.557120] mel_length_compression: 1024 [Training] [2023-10-15T14:44:46.560121] number_text_tokens: 256 [Training] [2023-10-15T14:44:46.563122] number_mel_codes: 8194 [Training] [2023-10-15T14:44:46.566122] start_mel_token: 8192 [Training] [2023-10-15T14:44:46.569123] stop_mel_token: 8193 [Training] [2023-10-15T14:44:46.572124] start_text_token: 255 [Training] [2023-10-15T14:44:46.574125] train_solo_embeddings: False [Training] [2023-10-15T14:44:46.577125] use_mel_codes_as_input: True [Training] [2023-10-15T14:44:46.580126] checkpointing: True [Training] [2023-10-15T14:44:46.583127] tortoise_compat: True [Training] [2023-10-15T14:44:46.586128] ] [Training] [2023-10-15T14:44:46.589128] ] [Training] [2023-10-15T14:44:46.592129] ] [Training] [2023-10-15T14:44:46.595130] path:[ [Training] [2023-10-15T14:44:46.598130] strict_load: True [Training] [2023-10-15T14:44:46.601130] pretrain_model_gpt: ./models/tortoise/autoregressive.pth [Training] [2023-10-15T14:44:46.604131] root: ./ [Training] [2023-10-15T14:44:46.607132] experiments_root: ./training\manvoice1\finetune [Training] [2023-10-15T14:44:46.610133] models: ./training\manvoice1\finetune\models [Training] [2023-10-15T14:44:46.613134] training_state: ./training\manvoice1\finetune\training_state [Training] [2023-10-15T14:44:46.616134] log: ./training\manvoice1\finetune [Training] [2023-10-15T14:44:46.619135] val_images: ./training\manvoice1\finetune\val_images [Training] [2023-10-15T14:44:46.621135] ] [Training] [2023-10-15T14:44:46.624136] train:[ [Training] [2023-10-15T14:44:46.627136] niter: 300 [Training] [2023-10-15T14:44:46.630137] warmup_iter: -1 [Training] [2023-10-15T14:44:46.633138] mega_batch_factor: 19 [Training] [2023-10-15T14:44:46.636138] val_freq: 50 [Training] [2023-10-15T14:44:46.639139] ema_enabled: False [Training] [2023-10-15T14:44:46.643140] default_lr_scheme: MultiStepLR [Training] [2023-10-15T14:44:46.645140] gen_lr_steps: [2, 4, 9, 18, 25, 33, 50] [Training] [2023-10-15T14:44:46.649141] lr_gamma: 0.5 [Training] [2023-10-15T14:44:46.652142] ] [Training] [2023-10-15T14:44:46.654142] eval:[ [Training] [2023-10-15T14:44:46.658144] pure: False [Training] [2023-10-15T14:44:46.661144] output_state: gen [Training] [2023-10-15T14:44:46.665145] ] [Training] [2023-10-15T14:44:46.668145] logger:[ [Training] [2023-10-15T14:44:46.671146] save_checkpoint_freq: 50 [Training] [2023-10-15T14:44:46.673147] visuals: ['gen', 'mel'] [Training] [2023-10-15T14:44:46.676148] visual_debug_rate: 50 [Training] [2023-10-15T14:44:46.679149] is_mel_spectrogram: True [Training] [2023-10-15T14:44:46.683149] ] [Training] [2023-10-15T14:44:46.686150] is_train: True [Training] [2023-10-15T14:44:46.689151] dist: False [Training] [2023-10-15T14:44:46.692152] [Training] [2023-10-15T14:44:46.695152] 23-10-15 14:44:46.213 - INFO: Random seed: 3787 [Training] [2023-10-15T14:44:47.691380] 23-10-15 14:44:47.691 - INFO: Number of training data elements: 78, iters: 1 [Training] [2023-10-15T14:44:47.694381] 23-10-15 14:44:47.691 - INFO: Total epochs needed: 300 for iters 300 [Training] [2023-10-15T14:44:48.895160] C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\transformers\configuration_utils.py:363: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`. [Training] [2023-10-15T14:44:48.899161] warnings.warn( [Training] [2023-10-15T14:44:56.655180] 23-10-15 14:44:56.655 - INFO: Loading model for [./models/tortoise/autoregressive.pth] [Training] [2023-10-15T14:44:59.082738] 23-10-15 14:44:59.044 - INFO: Start training from epoch: 0, iter: 0 [Training] [2023-10-15T14:45:01.855374] NOTE: Redirects are currently not supported in Windows or MacOs. [Training] [2023-10-15T14:45:04.184911] NOTE: Redirects are currently not supported in Windows or MacOs. [Training] [2023-10-15T14:45:05.347884] C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torch\optim\lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate [Training] [2023-10-15T14:45:05.347884] warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. " [Training] [2023-10-15T14:45:21.709777] Disabled distributed training. [Training] [2023-10-15T14:45:21.709777] Loading from ./models/tortoise/dvae.pth [Training] [2023-10-15T14:45:21.710778] Traceback (most recent call last): [Training] [2023-10-15T14:45:21.710778] File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\train.py", line 64, in <module> [Training] [2023-10-15T14:45:21.710778] train(config_path, args.launcher) [Training] [2023-10-15T14:45:21.710778] File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\train.py", line 31, in train [Training] [2023-10-15T14:45:21.711778] trainer.do_training() [Training] [2023-10-15T14:45:21.711778] File "c:\users\imint\desktop\voice\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training [Training] [2023-10-15T14:45:21.750786] metric = self.do_step(train_data) [Training] [2023-10-15T14:45:21.750786] File "c:\users\imint\desktop\voice\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step [Training] [2023-10-15T14:45:21.751787] gradient_norms_dict = self.model.optimize_parameters( [Training] [2023-10-15T14:45:21.751787] File "c:\users\imint\desktop\voice\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 321, in optimize_parameters [Training] [2023-10-15T14:45:21.806800] ns = step.do_forward_backward( [Training] [2023-10-15T14:45:21.807801] File "c:\users\imint\desktop\voice\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 242, in do_forward_backward [Training] [2023-10-15T14:45:21.813802] local_state[k] = v[grad_accum_step] [Training] [2023-10-15T14:45:21.813802] IndexError: list index out of range ``` After that, nothing happens (although Python is using RAM and VRAM in the task manager)

[Training] [2023-10-15T14:45:21.813802] local_state[k] = v[grad_accum_step]
[Training] [2023-10-15T14:45:21.813802] IndexError: list index out of range

Your gradient accumulation size is either too large or not divisible enough by your batch size.

> [Training] [2023-10-15T14:45:21.813802] local_state[k] = v[grad_accum_step] > [Training] [2023-10-15T14:45:21.813802] IndexError: list index out of range Your gradient accumulation size is either too large or not divisible enough by your batch size.

Thanks a lot, I set Batch size to 64 and Gradient Accumulation Size to 32 and everything worked. But I have another question, probably very stupid. I liked one of the random generated voices, how do I use it? I tried using the same seed, but the voice was different in the second case. In the first case it was female and in the second case it was male.

I liked one of the random generated voices, how do I use it?

If you happened to have Embed Output Metadata enabled, you can take the output with the random latents you want, and drag and drop it into the Utilities > Import/Analyze tab, and there should be a field that should extract the latents used for that generation. You can then take the cond_latents.pth file and put it under ./voices/{voice name}/, and it should use those latents for subsequent generations.

If you didn't, you should be able to just use the outputted file as a voice input again.

However, neither options are hard guarantees. I don't recall how consistent reusing randomly generated latents are.

tried using the same seed, but the voice was different in the second case.

I don't believe the seed gets used when generating random latents, just for the generation step.

> I liked one of the random generated voices, how do I use it? If you happened to have `Embed Output Metadata` enabled, you can take the output with the random latents you want, and drag and drop it into the `Utilities` > `Import/Analyze` tab, and there should be a field that should extract the latents used for that generation. You can then take the `cond_latents.pth` file and put it under `./voices/{voice name}/`, and it should use those latents for subsequent generations. If you didn't, you should be able to just use the outputted file as a voice input again. However, neither options are hard guarantees. I don't recall how consistent reusing randomly generated latents are. > tried using the same seed, but the voice was different in the second case. I don't believe the seed gets used when generating random latents, just for the generation step.

Thank you again. I did as you say, checked, Embed Output Metadata is enabled in the settings, inserted the audio of the model I want in Import/Analyze, then imported that voice and selected it in the Generate tab, but I get an error when generating. Here is the console output: after "Loaded TTS, ready for generation.":
Importing latents to b'PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x1e\x00\x04\x00cond_latents_d1f79232/data.pklFB\x00\x00\x80\x02N.PK\x07\x08\r\xd2\xb5}\x04\x00\x00\x00\x04\x00\x00\x00PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x1d\x001\x00cond_latents_d1f79232/versionFB-\x00ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ3\nPK\x07\x08\xd1\x9egU\x02\x00\x00\x00\x02\x00\x00\x00PK\x01\x02\x00\x00\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\r\xd2\xb5}\x04\x00\x00\x00\x04\x00\x00\x00\x1e\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00cond_latents_d1f79232/data.pklPK\x01\x02\x00\x00\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\xd1\x9egU\x02\x00\x00\x00\x02\x00\x00\x00\x1d\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00T\x00\x00\x00cond_latents_d1f79232/versionPK\x06\x06,\x00\x00\x00\x00\x00\x00\x00\x1e\x03-\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x97\x00\x00\x00\x00\x00\x00\x00\xd2\x00\x00\x00\x00\x00\x00\x00PK\x06\x07\x00\x00\x00\x00i\x01\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00PK\x05\x06\x00\x00\x00\x00\x02\x00\x02\x00\x97\x00\x00\x00\xd2\x00\x00\x00\x00\x00'
Imported latents to ./voices/testvoice//cond_latents.pth
[1/1] Generating line: The Great Wall of China Is Not Visible from Space: Despite the common myth, the Great Wall of China is not visible to the naked eye from space.
Loading voice: testvoice with model d1f79232
Loading voice: testvoice
C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torchaudio\functional\functional.py:1458: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
warnings.warn(
Traceback (most recent call last):
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 394, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1075, in process_api
result = await self.call_function(
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 884, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\anyio_backends_asyncio.py", line 807, in run
result = context.run(func, *args)
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\gradio\helpers.py", line 587, in tracked_fn
response = fn(*args)
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\webui.py", line 94, in generate_proxy
raise e
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\webui.py", line 88, in generate_proxy
sample, outputs, stats = generate(**kwargs)
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\utils.py", line 351, in generate
return generate_tortoise(**kwargs)
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\utils.py", line 1211, in generate_tortoise
gen, additionals = tts.tts(cut_text, **settings )
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\modules\tortoise-tts\tortoise\api.py", line 717, in tts
auto_conditioning, diffusion_conditioning, auto_conds, _ = self.get_conditioning_latents(voice_samples, return_mels=True, verbose=True)
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\imint\Desktop\voice\ai-voice-cloning\modules\tortoise-tts\tortoise\api.py", line 545, in get_conditioning_latents
concat = torch.cat(samples, dim=-1)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Thank you again. I did as you say, checked, Embed Output Metadata is enabled in the settings, inserted the audio of the model I want in Import/Analyze, then imported that voice and selected it in the Generate tab, but I get an error when generating. Here is the console output: after "Loaded TTS, ready for generation.": Importing latents to b'PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x1e\x00\x04\x00cond_latents_d1f79232/data.pklFB\x00\x00\x80\x02N.PK\x07\x08\r\xd2\xb5}\x04\x00\x00\x00\x04\x00\x00\x00PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x1d\x001\x00cond_latents_d1f79232/versionFB-\x00ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ3\nPK\x07\x08\xd1\x9egU\x02\x00\x00\x00\x02\x00\x00\x00PK\x01\x02\x00\x00\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\r\xd2\xb5}\x04\x00\x00\x00\x04\x00\x00\x00\x1e\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00cond_latents_d1f79232/data.pklPK\x01\x02\x00\x00\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\xd1\x9egU\x02\x00\x00\x00\x02\x00\x00\x00\x1d\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00T\x00\x00\x00cond_latents_d1f79232/versionPK\x06\x06,\x00\x00\x00\x00\x00\x00\x00\x1e\x03-\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x97\x00\x00\x00\x00\x00\x00\x00\xd2\x00\x00\x00\x00\x00\x00\x00PK\x06\x07\x00\x00\x00\x00i\x01\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00PK\x05\x06\x00\x00\x00\x00\x02\x00\x02\x00\x97\x00\x00\x00\xd2\x00\x00\x00\x00\x00' Imported latents to ./voices/testvoice//cond_latents.pth [1/1] Generating line: The Great Wall of China Is Not Visible from Space: Despite the common myth, the Great Wall of China is not visible to the naked eye from space. Loading voice: testvoice with model d1f79232 Loading voice: testvoice C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torchaudio\functional\functional.py:1458: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged. warnings.warn( Traceback (most recent call last): File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\gradio\routes.py", line 394, in run_predict output = await app.get_blocks().process_api( File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 1075, in process_api result = await self.call_function( File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\gradio\blocks.py", line 884, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run result = context.run(func, *args) File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\gradio\helpers.py", line 587, in tracked_fn response = fn(*args) File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\webui.py", line 94, in generate_proxy raise e File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\webui.py", line 88, in generate_proxy sample, outputs, stats = generate(**kwargs) File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\utils.py", line 351, in generate return generate_tortoise(**kwargs) File "C:\Users\imint\Desktop\voice\ai-voice-cloning\src\utils.py", line 1211, in generate_tortoise gen, additionals = tts.tts(cut_text, **settings ) File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\imint\Desktop\voice\ai-voice-cloning\modules\tortoise-tts\tortoise\api.py", line 717, in tts auto_conditioning, diffusion_conditioning, auto_conds, _ = self.get_conditioning_latents(voice_samples, return_mels=True, verbose=True) File "C:\Users\imint\Desktop\voice\ai-voice-cloning\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\imint\Desktop\voice\ai-voice-cloning\modules\tortoise-tts\tortoise\api.py", line 545, in get_conditioning_latents concat = torch.cat(samples, dim=-1) RuntimeError: torch.cat(): expected a non-empty list of Tensors

I think I figured it out, it looks like I need to have audio with this voice in the folder. Thanks for your help.

Labels Milestones

Can't get the model training started #416