torch.cuda.OutOfMemoryError: CUDA out of memory. OOM #464

New Issue

mic1012 · 2023-12-17T14:57:40Z

mic1012 commented

2023-12-17 14:57:40 +00:00

Hi,
I am not able to run the training with 8GB VRAM display card with cuda=11.8. Any ideas? Thank you for your help.

I have tried the following (but still having CUDA out of memory):

Batch Size: 2
Gradient Accumulation Size: 1

in \modules\tortoise-tts\tortoise\api.py
set autoregressive_batch_size=1

def init(self, autoregressive_batch_size=1, models_dir=MODELS_DIR, enable_redaction=True, device=None,

only 2 wav files in voices folder with only 398kb file size in total

In Settings tab,
Low VRAM is checked.

Line Delimiter: \n is set in Generate tab.

Below are the logs:

To create a public link, set share=True in launch().
Loading TorToiSe... (AR: ./models/tortoise/autoregressive.pth, diffusion: ./models/tortoise/diffusion_decoder.pth, vocoder: bigvgan_24khz_100band)
Hardware acceleration found: cuda
use_deepspeed api_debug False
Loading tokenizer JSON: ./modules/tortoise-tts/tortoise/data/tokenizer.json
Loaded tokenizer
Loading autoregressive model: ./models/tortoise/autoregressive.pth
Loaded autoregressive model
Loaded diffusion model
Loading vocoder model: bigvgan_24khz_100band
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Loading vocoder model: bigvgan_24khz_100band.pth
Removing weight norm...
Loaded vocoder model
Loaded TTS, ready for generation.
Unloaded TTS
Spawning process: train.bat ./training/jack/train.yaml
[Training] [2023-12-17T22:33:03.842674]
[Training] [2023-12-17T22:33:03.848591] (pytorch_tts) E:\ai-voice-cloning>#call .\venv\Scripts\activate.bat
[Training] [2023-12-17T22:33:03.854659] '#call' is not recognized as an internal or external command,
[Training] [2023-12-17T22:33:03.860181] operable program or batch file.
[Training] [2023-12-17T22:33:03.866641]
[Training] [2023-12-17T22:33:03.873753] (pytorch_tts) E:\ai-voice-cloning>set PYTHONUTF8=1
[Training] [2023-12-17T22:33:03.879305]
[Training] [2023-12-17T22:33:03.884244] (pytorch_tts) E:\ai-voice-cloning>python ./src/train.py --yaml "./training/jack/train.yaml"
[Training] [2023-12-17T22:33:05.457371] [2023-12-17 22:33:05,457] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-12-17T22:33:06.598164] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib'), WindowsPath('C')}
[Training] [2023-12-17T22:33:06.602153] warn(
[Training] [2023-12-17T22:33:06.604145] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths...
[Training] [2023-12-17T22:33:06.610273] warn(
[Training] [2023-12-17T22:33:06.616800] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
[Training] [2023-12-17T22:33:06.622087] warn(
[Training] [2023-12-17T22:33:06.959799] 23-12-17 22:33:06.957 - INFO: name: jack
[Training] [2023-12-17T22:33:06.962771] model: extensibletrainer
[Training] [2023-12-17T22:33:06.964764] scale: 1
[Training] [2023-12-17T22:33:06.967749] gpu_ids: [0]
[Training] [2023-12-17T22:33:06.970756] start_step: 0
[Training] [2023-12-17T22:33:06.974180] checkpointing_enabled: True
[Training] [2023-12-17T22:33:06.982167] fp16: True
[Training] [2023-12-17T22:33:06.989423] bitsandbytes: False
[Training] [2023-12-17T22:33:06.996402] gpus: 1
[Training] [2023-12-17T22:33:07.002384] datasets:[
[Training] [2023-12-17T22:33:07.012402] train:[
[Training] [2023-12-17T22:33:07.020621] name: training
[Training] [2023-12-17T22:33:07.025208] n_workers: 2
[Training] [2023-12-17T22:33:07.029199] batch_size: 1
[Training] [2023-12-17T22:33:07.036598] mode: paired_voice_audio
[Training] [2023-12-17T22:33:07.041644] path: ./training/jack/train.txt
[Training] [2023-12-17T22:33:07.047659] fetcher_mode: ['lj']
[Training] [2023-12-17T22:33:07.053604] phase: train
[Training] [2023-12-17T22:33:07.060579] max_wav_length: 255995
[Training] [2023-12-17T22:33:07.068554] max_text_length: 200
[Training] [2023-12-17T22:33:07.073004] sample_rate: 22050
[Training] [2023-12-17T22:33:07.076995] load_conditioning: True
[Training] [2023-12-17T22:33:07.081978] num_conditioning_candidates: 2
[Training] [2023-12-17T22:33:07.086964] conditioning_length: 44000
[Training] [2023-12-17T22:33:07.093234] use_bpe_tokenizer: True
[Training] [2023-12-17T22:33:07.099166] tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json
[Training] [2023-12-17T22:33:07.104281] load_aligned_codes: False
[Training] [2023-12-17T22:33:07.111298] data_type: img
[Training] [2023-12-17T22:33:07.116302] ]
[Training] [2023-12-17T22:33:07.121629] val:[
[Training] [2023-12-17T22:33:07.127268] name: validation
[Training] [2023-12-17T22:33:07.131611] n_workers: 2
[Training] [2023-12-17T22:33:07.135530] batch_size: 1
[Training] [2023-12-17T22:33:07.141512] mode: paired_voice_audio
[Training] [2023-12-17T22:33:07.145497] path: ./training/jack/validation.txt
[Training] [2023-12-17T22:33:07.149484] fetcher_mode: ['lj']
[Training] [2023-12-17T22:33:07.153471] phase: val
[Training] [2023-12-17T22:33:07.157498] max_wav_length: 255995
[Training] [2023-12-17T22:33:07.162592] max_text_length: 200
[Training] [2023-12-17T22:33:07.166623] sample_rate: 22050
[Training] [2023-12-17T22:33:07.170570] load_conditioning: True
[Training] [2023-12-17T22:33:07.175552] num_conditioning_candidates: 2
[Training] [2023-12-17T22:33:07.178542] conditioning_length: 44000
[Training] [2023-12-17T22:33:07.182529] use_bpe_tokenizer: True
[Training] [2023-12-17T22:33:07.186674] tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json
[Training] [2023-12-17T22:33:07.190767] load_aligned_codes: False
[Training] [2023-12-17T22:33:07.194856] data_type: img
[Training] [2023-12-17T22:33:07.198741] ]
[Training] [2023-12-17T22:33:07.203166] ]
[Training] [2023-12-17T22:33:07.207154] steps:[
[Training] [2023-12-17T22:33:07.212417] gpt_train:[
[Training] [2023-12-17T22:33:07.215360] training: gpt
[Training] [2023-12-17T22:33:07.219346] loss_log_buffer: 500
[Training] [2023-12-17T22:33:07.225051] optimizer: adamw
[Training] [2023-12-17T22:33:07.230040] optimizer_params:[
[Training] [2023-12-17T22:33:07.236358] lr: 1e-05
[Training] [2023-12-17T22:33:07.242294] weight_decay: 0.01
[Training] [2023-12-17T22:33:07.247213] beta1: 0.9
[Training] [2023-12-17T22:33:07.253192] beta2: 0.96
[Training] [2023-12-17T22:33:07.259590] ]
[Training] [2023-12-17T22:33:07.263668] clip_grad_eps: 4
[Training] [2023-12-17T22:33:07.272722] injectors:[
[Training] [2023-12-17T22:33:07.280747] paired_to_mel:[
[Training] [2023-12-17T22:33:07.287764] type: torch_mel_spectrogram
[Training] [2023-12-17T22:33:07.298639] mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth
[Training] [2023-12-17T22:33:07.307023] in: wav
[Training] [2023-12-17T22:33:07.313970] out: paired_mel
[Training] [2023-12-17T22:33:07.320014] ]
[Training] [2023-12-17T22:33:07.327532] paired_cond_to_mel:[
[Training] [2023-12-17T22:33:07.334438] type: for_each
[Training] [2023-12-17T22:33:07.341640] subtype: torch_mel_spectrogram
[Training] [2023-12-17T22:33:07.346632] mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth
[Training] [2023-12-17T22:33:07.353607] in: conditioning
[Training] [2023-12-17T22:33:07.364579] out: paired_conditioning_mel
[Training] [2023-12-17T22:33:07.373781] ]
[Training] [2023-12-17T22:33:07.380763] to_codes:[
[Training] [2023-12-17T22:33:07.394721] type: discrete_token
[Training] [2023-12-17T22:33:07.408670] in: paired_mel
[Training] [2023-12-17T22:33:07.416648] out: paired_mel_codes
[Training] [2023-12-17T22:33:07.427133] dvae_config: ./models/tortoise/train_diffusion_vocoder_22k_level.yml
[Training] [2023-12-17T22:33:07.435448] ]
[Training] [2023-12-17T22:33:07.441470] paired_fwd_text:[
[Training] [2023-12-17T22:33:07.445414] type: generator
[Training] [2023-12-17T22:33:07.449692] generator: gpt
[Training] [2023-12-17T22:33:07.452682] in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths']
[Training] [2023-12-17T22:33:07.455865] out: ['loss_text_ce', 'loss_mel_ce', 'logits']
[Training] [2023-12-17T22:33:07.460182] ]
[Training] [2023-12-17T22:33:07.464168] ]
[Training] [2023-12-17T22:33:07.467158] losses:[
[Training] [2023-12-17T22:33:07.470244] text_ce:[
[Training] [2023-12-17T22:33:07.473607] type: direct
[Training] [2023-12-17T22:33:07.477697] weight: 0.01
[Training] [2023-12-17T22:33:07.481077] key: loss_text_ce
[Training] [2023-12-17T22:33:07.484312] ]
[Training] [2023-12-17T22:33:07.488394] mel_ce:[
[Training] [2023-12-17T22:33:07.491792] type: direct
[Training] [2023-12-17T22:33:07.495130] weight: 1
[Training] [2023-12-17T22:33:07.498444] key: loss_mel_ce
[Training] [2023-12-17T22:33:07.501721] ]
[Training] [2023-12-17T22:33:07.505988] ]
[Training] [2023-12-17T22:33:07.510105] ]
[Training] [2023-12-17T22:33:07.514182] ]
[Training] [2023-12-17T22:33:07.518176] networks:[
[Training] [2023-12-17T22:33:07.521157] gpt:[
[Training] [2023-12-17T22:33:07.525324] type: generator
[Training] [2023-12-17T22:33:07.530398] which_model_G: unified_voice2
[Training] [2023-12-17T22:33:07.534381] kwargs:[
[Training] [2023-12-17T22:33:07.537364] layers: 30
[Training] [2023-12-17T22:33:07.540479] model_dim: 1024
[Training] [2023-12-17T22:33:07.544549] heads: 16
[Training] [2023-12-17T22:33:07.547542] max_text_tokens: 402
[Training] [2023-12-17T22:33:07.550529] max_mel_tokens: 604
[Training] [2023-12-17T22:33:07.553517] max_conditioning_inputs: 2
[Training] [2023-12-17T22:33:07.557698] mel_length_compression: 1024
[Training] [2023-12-17T22:33:07.561724] number_text_tokens: 256
[Training] [2023-12-17T22:33:07.564717] number_mel_codes: 8194
[Training] [2023-12-17T22:33:07.567776] start_mel_token: 8192
[Training] [2023-12-17T22:33:07.572037] stop_mel_token: 8193
[Training] [2023-12-17T22:33:07.576133] start_text_token: 255
[Training] [2023-12-17T22:33:07.580201] train_solo_embeddings: False
[Training] [2023-12-17T22:33:07.583193] use_mel_codes_as_input: True
[Training] [2023-12-17T22:33:07.586198] checkpointing: True
[Training] [2023-12-17T22:33:07.591175] tortoise_compat: True
[Training] [2023-12-17T22:33:07.595079] ]
[Training] [2023-12-17T22:33:07.598745] ]
[Training] [2023-12-17T22:33:07.601701] ]
[Training] [2023-12-17T22:33:07.604688] path:[
[Training] [2023-12-17T22:33:07.609674] strict_load: True
[Training] [2023-12-17T22:33:07.612669] pretrain_model_gpt: ./models/tortoise/autoregressive.pth
[Training] [2023-12-17T22:33:07.616746] root: ./
[Training] [2023-12-17T22:33:07.619721] experiments_root: ./training\jack\finetune
[Training] [2023-12-17T22:33:07.622711] models: ./training\jack\finetune\models
[Training] [2023-12-17T22:33:07.626525] training_state: ./training\jack\finetune\training_state
[Training] [2023-12-17T22:33:07.630627] log: ./training\jack\finetune
[Training] [2023-12-17T22:33:07.634124] val_images: ./training\jack\finetune\val_images
[Training] [2023-12-17T22:33:07.637574] ]
[Training] [2023-12-17T22:33:07.642488] train:[
[Training] [2023-12-17T22:33:07.646480] niter: 5
[Training] [2023-12-17T22:33:07.649478] warmup_iter: -1
[Training] [2023-12-17T22:33:07.652459] mega_batch_factor: 1
[Training] [2023-12-17T22:33:07.656794] val_freq: 5
[Training] [2023-12-17T22:33:07.659825] ema_enabled: False
[Training] [2023-12-17T22:33:07.662814] default_lr_scheme: MultiStepLR
[Training] [2023-12-17T22:33:07.666801] gen_lr_steps: [2, 4, 9, 18, 25, 33, 50]
[Training] [2023-12-17T22:33:07.669756] lr_gamma: 0.5
[Training] [2023-12-17T22:33:07.673950] ]
[Training] [2023-12-17T22:33:07.682004] eval:[
[Training] [2023-12-17T22:33:07.690955] pure: False
[Training] [2023-12-17T22:33:07.695968] output_state: gen
[Training] [2023-12-17T22:33:07.701935] ]
[Training] [2023-12-17T22:33:07.705927] logger:[
[Training] [2023-12-17T22:33:07.710176] save_checkpoint_freq: 5
[Training] [2023-12-17T22:33:07.716168] visuals: ['gen', 'mel']
[Training] [2023-12-17T22:33:07.720425] visual_debug_rate: 5
[Training] [2023-12-17T22:33:07.727411] is_mel_spectrogram: True
[Training] [2023-12-17T22:33:07.738449] ]
[Training] [2023-12-17T22:33:07.745512] is_train: True
[Training] [2023-12-17T22:33:07.754564] dist: False
[Training] [2023-12-17T22:33:07.759781]
[Training] [2023-12-17T22:33:07.763825] 23-12-17 22:33:06.957 - INFO: Random seed: 5430
[Training] [2023-12-17T22:33:08.089639] 23-12-17 22:33:08.088 - INFO: Number of training data elements: 1, iters: 1
[Training] [2023-12-17T22:33:08.098615] 23-12-17 22:33:08.088 - INFO: Total epochs needed: 5 for iters 5
[Training] [2023-12-17T22:33:08.716558] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\transformers\configuration_utils.py:380: UserWarning: Passing gradient_checkpointing to a config initialization is deprecated and will be removed in v5 Transformers. Using model.gradient_checkpointing_enable() instead, or if you are using the Trainer API, pass gradient_checkpointing=True in your TrainingArguments.
[Training] [2023-12-17T22:33:08.722919] warnings.warn(
[Training] [2023-12-17T22:33:13.663338] 23-12-17 22:33:13.663 - INFO: Loading model for [./models/tortoise/autoregressive.pth]
[Training] [2023-12-17T22:33:15.058962] 23-12-17 22:33:15.058 - INFO: Start training from epoch: 0, iter: 0
[Training] [2023-12-17T22:33:17.263563] [2023-12-17 22:33:17,263] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-12-17T22:33:17.267551] [2023-12-17 22:33:17,267] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-12-17T22:33:20.253042] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C'), WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib')}
[Training] [2023-12-17T22:33:20.254041] warn(
[Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths...
[Training] [2023-12-17T22:33:20.254041] warn(
[Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
[Training] [2023-12-17T22:33:20.254041] warn(
[Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C'), WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib')}
[Training] [2023-12-17T22:33:20.255035] warn(
[Training] [2023-12-17T22:33:20.255035] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths...
[Training] [2023-12-17T22:33:20.255035] warn(
[Training] [2023-12-17T22:33:20.255035] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
[Training] [2023-12-17T22:33:20.255035] warn(
[Training] [2023-12-17T22:33:20.980822] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
[Training] [2023-12-17T22:33:20.981819] warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
[Training] [2023-12-17T22:33:22.708261] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
[Training] [2023-12-17T22:33:22.708261] warnings.warn(
[Training] [2023-12-17T22:33:23.276018] 23-12-17 22:33:23.275 - INFO: Training Metrics: {"loss_text_ce": 5.98748254776001, "loss_mel_ce": 3.129274368286133, "loss_gpt_total": 3.1891491413116455, "lr": 5e-06, "it": 1, "step": 1, "steps": 1, "epoch": 0, "iteration_rate": 2.2932069301605225}
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated.
Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\vegalite\v5\api.py:469: AltairDeprecationWarning: The types 'single' and 'multi' are now
combined and should be specified using "selection_point()".
warnings.warn(
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated.
Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\vegalite\v5\api.py:469: AltairDeprecationWarning: The types 'single' and 'multi' are now
combined and should be specified using "selection_point()".
warnings.warn(
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
[Training] [2023-12-17T22:33:24.075711]
[Training] [2023-12-17T22:33:24.078703] ===================================BUG REPORT===================================
[Training] [2023-12-17T22:33:24.078703] Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
[Training] [2023-12-17T22:33:24.078703] For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
[Training] [2023-12-17T22:33:24.078703] ================================================================================
[Training] [2023-12-17T22:33:24.078703] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
[Training] [2023-12-17T22:33:24.078703] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
[Training] [2023-12-17T22:33:24.078703] CUDA SETUP: Loading binary C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
[Training] [2023-12-17T22:33:24.078703] Disabled distributed training.
[Training] [2023-12-17T22:33:24.078703] Path already exists. Rename it to [./training\jack\finetune_archived_231217-223306]
[Training] [2023-12-17T22:33:24.078703] Loading from ./models/tortoise/dvae.pth
[Training] [2023-12-17T22:33:24.079712] Traceback (most recent call last):
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\src\train.py", line 64, in
[Training] [2023-12-17T22:33:24.079712] train(config_path, args.launcher)
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\src\train.py", line 31, in train
[Training] [2023-12-17T22:33:24.079712] trainer.do_training()
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training
[Training] [2023-12-17T22:33:24.079712] metric = self.do_step(train_data)
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step
[Training] [2023-12-17T22:33:24.079712] gradient_norms_dict = self.model.optimize_parameters(
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 396, in optimize_parameters
[Training] [2023-12-17T22:33:24.079712] self.consume_gradients(state, step, it)
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 445, in consume_gradients
[Training] [2023-12-17T22:33:24.079712] step.do_step(it)
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 398, in do_step
[Training] [2023-12-17T22:33:24.079712] self.scaler.step(opt)
[Training] [2023-12-17T22:33:24.079712] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 416, in step
[Training] [2023-12-17T22:33:24.079712] retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
[Training] [2023-12-17T22:33:24.079712] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 315, in _maybe_opt_step
[Training] [2023-12-17T22:33:24.080694] retval = optimizer.step(*args, **kwargs)
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\lr_scheduler.py", line 68, in wrapper
[Training] [2023-12-17T22:33:24.080694] return wrapped(*args, **kwargs)
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\optimizer.py", line 373, in wrapper
[Training] [2023-12-17T22:33:24.080694] out = func(*args, **kwargs)
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\optimizer.py", line 76, in _use_grad
[Training] [2023-12-17T22:33:24.080694] ret = func(self, *args, **kwargs)
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 184, in step
[Training] [2023-12-17T22:33:24.080694] adamw(
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 335, in adamw
[Training] [2023-12-17T22:33:24.080694] func(
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 599, in _multi_tensor_adamw
[Training] [2023-12-17T22:33:24.080694] exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[Training] [2023-12-17T22:33:24.080694] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Of the allocated memory 7.17 GiB is allocated by PyTorch, and 54.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[Training] [2023-12-17T22:33:36.188970]

Hi, I am not able to run the training with 8GB VRAM display card with cuda=11.8. Any ideas? Thank you for your help. I have tried the following (but still having CUDA out of memory): 1. Batch Size: 2 Gradient Accumulation Size: 1 2. in \modules\tortoise-tts\tortoise\api.py set autoregressive_batch_size=1 def __init__(self, autoregressive_batch_size=1, models_dir=MODELS_DIR, enable_redaction=True, device=None, 3. only 2 wav files in voices folder with only 398kb file size in total 4. In Settings tab, Low VRAM is checked. 5. Line Delimiter: \n is set in Generate tab. Below are the logs: To create a public link, set `share=True` in `launch()`. Loading TorToiSe... (AR: ./models/tortoise/autoregressive.pth, diffusion: ./models/tortoise/diffusion_decoder.pth, vocoder: bigvgan_24khz_100band) Hardware acceleration found: cuda use_deepspeed api_debug False Loading tokenizer JSON: ./modules/tortoise-tts/tortoise/data/tokenizer.json Loaded tokenizer Loading autoregressive model: ./models/tortoise/autoregressive.pth Loaded autoregressive model Loaded diffusion model Loading vocoder model: bigvgan_24khz_100band C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") Loading vocoder model: bigvgan_24khz_100band.pth Removing weight norm... Loaded vocoder model Loaded TTS, ready for generation. Unloaded TTS Spawning process: train.bat ./training/jack/train.yaml [Training] [2023-12-17T22:33:03.842674] [Training] [2023-12-17T22:33:03.848591] (pytorch_tts) E:\ai-voice-cloning>#call .\venv\Scripts\activate.bat [Training] [2023-12-17T22:33:03.854659] '#call' is not recognized as an internal or external command, [Training] [2023-12-17T22:33:03.860181] operable program or batch file. [Training] [2023-12-17T22:33:03.866641] [Training] [2023-12-17T22:33:03.873753] (pytorch_tts) E:\ai-voice-cloning>set PYTHONUTF8=1 [Training] [2023-12-17T22:33:03.879305] [Training] [2023-12-17T22:33:03.884244] (pytorch_tts) E:\ai-voice-cloning>python ./src/train.py --yaml "./training/jack/train.yaml" [Training] [2023-12-17T22:33:05.457371] [2023-12-17 22:33:05,457] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [Training] [2023-12-17T22:33:06.598164] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib'), WindowsPath('C')} [Training] [2023-12-17T22:33:06.602153] warn( [Training] [2023-12-17T22:33:06.604145] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths... [Training] [2023-12-17T22:33:06.610273] warn( [Training] [2023-12-17T22:33:06.616800] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')} [Training] [2023-12-17T22:33:06.622087] warn( [Training] [2023-12-17T22:33:06.959799] 23-12-17 22:33:06.957 - INFO: name: jack [Training] [2023-12-17T22:33:06.962771] model: extensibletrainer [Training] [2023-12-17T22:33:06.964764] scale: 1 [Training] [2023-12-17T22:33:06.967749] gpu_ids: [0] [Training] [2023-12-17T22:33:06.970756] start_step: 0 [Training] [2023-12-17T22:33:06.974180] checkpointing_enabled: True [Training] [2023-12-17T22:33:06.982167] fp16: True [Training] [2023-12-17T22:33:06.989423] bitsandbytes: False [Training] [2023-12-17T22:33:06.996402] gpus: 1 [Training] [2023-12-17T22:33:07.002384] datasets:[ [Training] [2023-12-17T22:33:07.012402] train:[ [Training] [2023-12-17T22:33:07.020621] name: training [Training] [2023-12-17T22:33:07.025208] n_workers: 2 [Training] [2023-12-17T22:33:07.029199] batch_size: 1 [Training] [2023-12-17T22:33:07.036598] mode: paired_voice_audio [Training] [2023-12-17T22:33:07.041644] path: ./training/jack/train.txt [Training] [2023-12-17T22:33:07.047659] fetcher_mode: ['lj'] [Training] [2023-12-17T22:33:07.053604] phase: train [Training] [2023-12-17T22:33:07.060579] max_wav_length: 255995 [Training] [2023-12-17T22:33:07.068554] max_text_length: 200 [Training] [2023-12-17T22:33:07.073004] sample_rate: 22050 [Training] [2023-12-17T22:33:07.076995] load_conditioning: True [Training] [2023-12-17T22:33:07.081978] num_conditioning_candidates: 2 [Training] [2023-12-17T22:33:07.086964] conditioning_length: 44000 [Training] [2023-12-17T22:33:07.093234] use_bpe_tokenizer: True [Training] [2023-12-17T22:33:07.099166] tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json [Training] [2023-12-17T22:33:07.104281] load_aligned_codes: False [Training] [2023-12-17T22:33:07.111298] data_type: img [Training] [2023-12-17T22:33:07.116302] ] [Training] [2023-12-17T22:33:07.121629] val:[ [Training] [2023-12-17T22:33:07.127268] name: validation [Training] [2023-12-17T22:33:07.131611] n_workers: 2 [Training] [2023-12-17T22:33:07.135530] batch_size: 1 [Training] [2023-12-17T22:33:07.141512] mode: paired_voice_audio [Training] [2023-12-17T22:33:07.145497] path: ./training/jack/validation.txt [Training] [2023-12-17T22:33:07.149484] fetcher_mode: ['lj'] [Training] [2023-12-17T22:33:07.153471] phase: val [Training] [2023-12-17T22:33:07.157498] max_wav_length: 255995 [Training] [2023-12-17T22:33:07.162592] max_text_length: 200 [Training] [2023-12-17T22:33:07.166623] sample_rate: 22050 [Training] [2023-12-17T22:33:07.170570] load_conditioning: True [Training] [2023-12-17T22:33:07.175552] num_conditioning_candidates: 2 [Training] [2023-12-17T22:33:07.178542] conditioning_length: 44000 [Training] [2023-12-17T22:33:07.182529] use_bpe_tokenizer: True [Training] [2023-12-17T22:33:07.186674] tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json [Training] [2023-12-17T22:33:07.190767] load_aligned_codes: False [Training] [2023-12-17T22:33:07.194856] data_type: img [Training] [2023-12-17T22:33:07.198741] ] [Training] [2023-12-17T22:33:07.203166] ] [Training] [2023-12-17T22:33:07.207154] steps:[ [Training] [2023-12-17T22:33:07.212417] gpt_train:[ [Training] [2023-12-17T22:33:07.215360] training: gpt [Training] [2023-12-17T22:33:07.219346] loss_log_buffer: 500 [Training] [2023-12-17T22:33:07.225051] optimizer: adamw [Training] [2023-12-17T22:33:07.230040] optimizer_params:[ [Training] [2023-12-17T22:33:07.236358] lr: 1e-05 [Training] [2023-12-17T22:33:07.242294] weight_decay: 0.01 [Training] [2023-12-17T22:33:07.247213] beta1: 0.9 [Training] [2023-12-17T22:33:07.253192] beta2: 0.96 [Training] [2023-12-17T22:33:07.259590] ] [Training] [2023-12-17T22:33:07.263668] clip_grad_eps: 4 [Training] [2023-12-17T22:33:07.272722] injectors:[ [Training] [2023-12-17T22:33:07.280747] paired_to_mel:[ [Training] [2023-12-17T22:33:07.287764] type: torch_mel_spectrogram [Training] [2023-12-17T22:33:07.298639] mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth [Training] [2023-12-17T22:33:07.307023] in: wav [Training] [2023-12-17T22:33:07.313970] out: paired_mel [Training] [2023-12-17T22:33:07.320014] ] [Training] [2023-12-17T22:33:07.327532] paired_cond_to_mel:[ [Training] [2023-12-17T22:33:07.334438] type: for_each [Training] [2023-12-17T22:33:07.341640] subtype: torch_mel_spectrogram [Training] [2023-12-17T22:33:07.346632] mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth [Training] [2023-12-17T22:33:07.353607] in: conditioning [Training] [2023-12-17T22:33:07.364579] out: paired_conditioning_mel [Training] [2023-12-17T22:33:07.373781] ] [Training] [2023-12-17T22:33:07.380763] to_codes:[ [Training] [2023-12-17T22:33:07.394721] type: discrete_token [Training] [2023-12-17T22:33:07.408670] in: paired_mel [Training] [2023-12-17T22:33:07.416648] out: paired_mel_codes [Training] [2023-12-17T22:33:07.427133] dvae_config: ./models/tortoise/train_diffusion_vocoder_22k_level.yml [Training] [2023-12-17T22:33:07.435448] ] [Training] [2023-12-17T22:33:07.441470] paired_fwd_text:[ [Training] [2023-12-17T22:33:07.445414] type: generator [Training] [2023-12-17T22:33:07.449692] generator: gpt [Training] [2023-12-17T22:33:07.452682] in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths'] [Training] [2023-12-17T22:33:07.455865] out: ['loss_text_ce', 'loss_mel_ce', 'logits'] [Training] [2023-12-17T22:33:07.460182] ] [Training] [2023-12-17T22:33:07.464168] ] [Training] [2023-12-17T22:33:07.467158] losses:[ [Training] [2023-12-17T22:33:07.470244] text_ce:[ [Training] [2023-12-17T22:33:07.473607] type: direct [Training] [2023-12-17T22:33:07.477697] weight: 0.01 [Training] [2023-12-17T22:33:07.481077] key: loss_text_ce [Training] [2023-12-17T22:33:07.484312] ] [Training] [2023-12-17T22:33:07.488394] mel_ce:[ [Training] [2023-12-17T22:33:07.491792] type: direct [Training] [2023-12-17T22:33:07.495130] weight: 1 [Training] [2023-12-17T22:33:07.498444] key: loss_mel_ce [Training] [2023-12-17T22:33:07.501721] ] [Training] [2023-12-17T22:33:07.505988] ] [Training] [2023-12-17T22:33:07.510105] ] [Training] [2023-12-17T22:33:07.514182] ] [Training] [2023-12-17T22:33:07.518176] networks:[ [Training] [2023-12-17T22:33:07.521157] gpt:[ [Training] [2023-12-17T22:33:07.525324] type: generator [Training] [2023-12-17T22:33:07.530398] which_model_G: unified_voice2 [Training] [2023-12-17T22:33:07.534381] kwargs:[ [Training] [2023-12-17T22:33:07.537364] layers: 30 [Training] [2023-12-17T22:33:07.540479] model_dim: 1024 [Training] [2023-12-17T22:33:07.544549] heads: 16 [Training] [2023-12-17T22:33:07.547542] max_text_tokens: 402 [Training] [2023-12-17T22:33:07.550529] max_mel_tokens: 604 [Training] [2023-12-17T22:33:07.553517] max_conditioning_inputs: 2 [Training] [2023-12-17T22:33:07.557698] mel_length_compression: 1024 [Training] [2023-12-17T22:33:07.561724] number_text_tokens: 256 [Training] [2023-12-17T22:33:07.564717] number_mel_codes: 8194 [Training] [2023-12-17T22:33:07.567776] start_mel_token: 8192 [Training] [2023-12-17T22:33:07.572037] stop_mel_token: 8193 [Training] [2023-12-17T22:33:07.576133] start_text_token: 255 [Training] [2023-12-17T22:33:07.580201] train_solo_embeddings: False [Training] [2023-12-17T22:33:07.583193] use_mel_codes_as_input: True [Training] [2023-12-17T22:33:07.586198] checkpointing: True [Training] [2023-12-17T22:33:07.591175] tortoise_compat: True [Training] [2023-12-17T22:33:07.595079] ] [Training] [2023-12-17T22:33:07.598745] ] [Training] [2023-12-17T22:33:07.601701] ] [Training] [2023-12-17T22:33:07.604688] path:[ [Training] [2023-12-17T22:33:07.609674] strict_load: True [Training] [2023-12-17T22:33:07.612669] pretrain_model_gpt: ./models/tortoise/autoregressive.pth [Training] [2023-12-17T22:33:07.616746] root: ./ [Training] [2023-12-17T22:33:07.619721] experiments_root: ./training\jack\finetune [Training] [2023-12-17T22:33:07.622711] models: ./training\jack\finetune\models [Training] [2023-12-17T22:33:07.626525] training_state: ./training\jack\finetune\training_state [Training] [2023-12-17T22:33:07.630627] log: ./training\jack\finetune [Training] [2023-12-17T22:33:07.634124] val_images: ./training\jack\finetune\val_images [Training] [2023-12-17T22:33:07.637574] ] [Training] [2023-12-17T22:33:07.642488] train:[ [Training] [2023-12-17T22:33:07.646480] niter: 5 [Training] [2023-12-17T22:33:07.649478] warmup_iter: -1 [Training] [2023-12-17T22:33:07.652459] mega_batch_factor: 1 [Training] [2023-12-17T22:33:07.656794] val_freq: 5 [Training] [2023-12-17T22:33:07.659825] ema_enabled: False [Training] [2023-12-17T22:33:07.662814] default_lr_scheme: MultiStepLR [Training] [2023-12-17T22:33:07.666801] gen_lr_steps: [2, 4, 9, 18, 25, 33, 50] [Training] [2023-12-17T22:33:07.669756] lr_gamma: 0.5 [Training] [2023-12-17T22:33:07.673950] ] [Training] [2023-12-17T22:33:07.682004] eval:[ [Training] [2023-12-17T22:33:07.690955] pure: False [Training] [2023-12-17T22:33:07.695968] output_state: gen [Training] [2023-12-17T22:33:07.701935] ] [Training] [2023-12-17T22:33:07.705927] logger:[ [Training] [2023-12-17T22:33:07.710176] save_checkpoint_freq: 5 [Training] [2023-12-17T22:33:07.716168] visuals: ['gen', 'mel'] [Training] [2023-12-17T22:33:07.720425] visual_debug_rate: 5 [Training] [2023-12-17T22:33:07.727411] is_mel_spectrogram: True [Training] [2023-12-17T22:33:07.738449] ] [Training] [2023-12-17T22:33:07.745512] is_train: True [Training] [2023-12-17T22:33:07.754564] dist: False [Training] [2023-12-17T22:33:07.759781] [Training] [2023-12-17T22:33:07.763825] 23-12-17 22:33:06.957 - INFO: Random seed: 5430 [Training] [2023-12-17T22:33:08.089639] 23-12-17 22:33:08.088 - INFO: Number of training data elements: 1, iters: 1 [Training] [2023-12-17T22:33:08.098615] 23-12-17 22:33:08.088 - INFO: Total epochs needed: 5 for iters 5 [Training] [2023-12-17T22:33:08.716558] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\transformers\configuration_utils.py:380: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`. [Training] [2023-12-17T22:33:08.722919] warnings.warn( [Training] [2023-12-17T22:33:13.663338] 23-12-17 22:33:13.663 - INFO: Loading model for [./models/tortoise/autoregressive.pth] [Training] [2023-12-17T22:33:15.058962] 23-12-17 22:33:15.058 - INFO: Start training from epoch: 0, iter: 0 [Training] [2023-12-17T22:33:17.263563] [2023-12-17 22:33:17,263] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [Training] [2023-12-17T22:33:17.267551] [2023-12-17 22:33:17,267] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [Training] [2023-12-17T22:33:20.253042] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C'), WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib')} [Training] [2023-12-17T22:33:20.254041] warn( [Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths... [Training] [2023-12-17T22:33:20.254041] warn( [Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')} [Training] [2023-12-17T22:33:20.254041] warn( [Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C'), WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib')} [Training] [2023-12-17T22:33:20.255035] warn( [Training] [2023-12-17T22:33:20.255035] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths... [Training] [2023-12-17T22:33:20.255035] warn( [Training] [2023-12-17T22:33:20.255035] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')} [Training] [2023-12-17T22:33:20.255035] warn( [Training] [2023-12-17T22:33:20.980822] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate [Training] [2023-12-17T22:33:20.981819] warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. " [Training] [2023-12-17T22:33:22.708261] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. [Training] [2023-12-17T22:33:22.708261] warnings.warn( [Training] [2023-12-17T22:33:23.276018] 23-12-17 22:33:23.275 - INFO: Training Metrics: {"loss_text_ce": 5.98748254776001, "loss_mel_ce": 3.129274368286133, "loss_gpt_total": 3.1891491413116455, "lr": 5e-06, "it": 1, "step": 1, "steps": 1, "epoch": 0, "iteration_rate": 2.2932069301605225} C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated. Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings. warnings.warn(message, AltairDeprecationWarning, stacklevel=1) C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\vegalite\v5\api.py:469: AltairDeprecationWarning: The types 'single' and 'multi' are now combined and should be specified using "selection_point()". warnings.warn( C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead. warnings.warn(message, AltairDeprecationWarning, stacklevel=1) C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated. Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings. warnings.warn(message, AltairDeprecationWarning, stacklevel=1) C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\vegalite\v5\api.py:469: AltairDeprecationWarning: The types 'single' and 'multi' are now combined and should be specified using "selection_point()". warnings.warn( C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead. warnings.warn(message, AltairDeprecationWarning, stacklevel=1) [Training] [2023-12-17T22:33:24.075711] [Training] [2023-12-17T22:33:24.078703] ===================================BUG REPORT=================================== [Training] [2023-12-17T22:33:24.078703] Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues [Training] [2023-12-17T22:33:24.078703] For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link [Training] [2023-12-17T22:33:24.078703] ================================================================================ [Training] [2023-12-17T22:33:24.078703] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... [Training] [2023-12-17T22:33:24.078703] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)! [Training] [2023-12-17T22:33:24.078703] CUDA SETUP: Loading binary C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so... [Training] [2023-12-17T22:33:24.078703] Disabled distributed training. [Training] [2023-12-17T22:33:24.078703] Path already exists. Rename it to [./training\jack\finetune_archived_231217-223306] [Training] [2023-12-17T22:33:24.078703] Loading from ./models/tortoise/dvae.pth [Training] [2023-12-17T22:33:24.079712] Traceback (most recent call last): [Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\src\train.py", line 64, in <module> [Training] [2023-12-17T22:33:24.079712] train(config_path, args.launcher) [Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\src\train.py", line 31, in train [Training] [2023-12-17T22:33:24.079712] trainer.do_training() [Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training [Training] [2023-12-17T22:33:24.079712] metric = self.do_step(train_data) [Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step [Training] [2023-12-17T22:33:24.079712] gradient_norms_dict = self.model.optimize_parameters( [Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 396, in optimize_parameters [Training] [2023-12-17T22:33:24.079712] self.consume_gradients(state, step, it) [Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 445, in consume_gradients [Training] [2023-12-17T22:33:24.079712] step.do_step(it) [Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 398, in do_step [Training] [2023-12-17T22:33:24.079712] self.scaler.step(opt) [Training] [2023-12-17T22:33:24.079712] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 416, in step [Training] [2023-12-17T22:33:24.079712] retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs) [Training] [2023-12-17T22:33:24.079712] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 315, in _maybe_opt_step [Training] [2023-12-17T22:33:24.080694] retval = optimizer.step(*args, **kwargs) [Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\lr_scheduler.py", line 68, in wrapper [Training] [2023-12-17T22:33:24.080694] return wrapped(*args, **kwargs) [Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\optimizer.py", line 373, in wrapper [Training] [2023-12-17T22:33:24.080694] out = func(*args, **kwargs) [Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\optimizer.py", line 76, in _use_grad [Training] [2023-12-17T22:33:24.080694] ret = func(self, *args, **kwargs) [Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 184, in step [Training] [2023-12-17T22:33:24.080694] adamw( [Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 335, in adamw [Training] [2023-12-17T22:33:24.080694] func( [Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 599, in _multi_tensor_adamw [Training] [2023-12-17T22:33:24.080694] exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [Training] [2023-12-17T22:33:24.080694] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Of the allocated memory 7.17 GiB is allocated by PyTorch, and 54.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [Training] [2023-12-17T22:33:36.188970]

OOM.png

112 KiB

DoctorPopi commented

2024-01-26 10:10:16 +00:00

Hey! From what I see in your logs there might be a problem of environment variable.

Training] [2023-12-17T22:33:24.078703] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... [Training] [2023-12-17T22:33:24.078703] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!

These messages appear regularly in your log.

Are you sure your LD_LIBRARY_VARIABLE is correctly set to the Cuda installation of your conda environment?

If I were you I would just reinstall everything and run the setup.bat / sh again just to be sure!

Hey! From what I see in your logs there might be a problem of environment variable. `Training] [2023-12-17T22:33:24.078703] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... [Training] [2023-12-17T22:33:24.078703] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!` These messages appear regularly in your log. Are you sure your LD_LIBRARY_VARIABLE is correctly set to the Cuda installation of your conda environment? If I were you I would just reinstall everything and run the setup.bat / sh again just to be sure!

Sign in to join this conversation.