torch.cuda.OutOfMemoryError: CUDA out of memory. OOM #464
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#464
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi,
I am not able to run the training with 8GB VRAM display card with cuda=11.8. Any ideas? Thank you for your help.
I have tried the following (but still having CUDA out of memory):
Batch Size: 2
Gradient Accumulation Size: 1
in \modules\tortoise-tts\tortoise\api.py
set autoregressive_batch_size=1
def init(self, autoregressive_batch_size=1, models_dir=MODELS_DIR, enable_redaction=True, device=None,
only 2 wav files in voices folder with only 398kb file size in total
In Settings tab,
Low VRAM is checked.
Line Delimiter: \n is set in Generate tab.
Below are the logs:
To create a public link, set
share=True
inlaunch()
.Loading TorToiSe... (AR: ./models/tortoise/autoregressive.pth, diffusion: ./models/tortoise/diffusion_decoder.pth, vocoder: bigvgan_24khz_100band)
Hardware acceleration found: cuda
use_deepspeed api_debug False
Loading tokenizer JSON: ./modules/tortoise-tts/tortoise/data/tokenizer.json
Loaded tokenizer
Loading autoregressive model: ./models/tortoise/autoregressive.pth
Loaded autoregressive model
Loaded diffusion model
Loading vocoder model: bigvgan_24khz_100band
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Loading vocoder model: bigvgan_24khz_100band.pth
Removing weight norm...
Loaded vocoder model
Loaded TTS, ready for generation.
Unloaded TTS
Spawning process: train.bat ./training/jack/train.yaml
[Training] [2023-12-17T22:33:03.842674]
[Training] [2023-12-17T22:33:03.848591] (pytorch_tts) E:\ai-voice-cloning>#call .\venv\Scripts\activate.bat
[Training] [2023-12-17T22:33:03.854659] '#call' is not recognized as an internal or external command,
[Training] [2023-12-17T22:33:03.860181] operable program or batch file.
[Training] [2023-12-17T22:33:03.866641]
[Training] [2023-12-17T22:33:03.873753] (pytorch_tts) E:\ai-voice-cloning>set PYTHONUTF8=1
[Training] [2023-12-17T22:33:03.879305]
[Training] [2023-12-17T22:33:03.884244] (pytorch_tts) E:\ai-voice-cloning>python ./src/train.py --yaml "./training/jack/train.yaml"
[Training] [2023-12-17T22:33:05.457371] [2023-12-17 22:33:05,457] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-12-17T22:33:06.598164] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib'), WindowsPath('C')}
[Training] [2023-12-17T22:33:06.602153] warn(
[Training] [2023-12-17T22:33:06.604145] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths...
[Training] [2023-12-17T22:33:06.610273] warn(
[Training] [2023-12-17T22:33:06.616800] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
[Training] [2023-12-17T22:33:06.622087] warn(
[Training] [2023-12-17T22:33:06.959799] 23-12-17 22:33:06.957 - INFO: name: jack
[Training] [2023-12-17T22:33:06.962771] model: extensibletrainer
[Training] [2023-12-17T22:33:06.964764] scale: 1
[Training] [2023-12-17T22:33:06.967749] gpu_ids: [0]
[Training] [2023-12-17T22:33:06.970756] start_step: 0
[Training] [2023-12-17T22:33:06.974180] checkpointing_enabled: True
[Training] [2023-12-17T22:33:06.982167] fp16: True
[Training] [2023-12-17T22:33:06.989423] bitsandbytes: False
[Training] [2023-12-17T22:33:06.996402] gpus: 1
[Training] [2023-12-17T22:33:07.002384] datasets:[
[Training] [2023-12-17T22:33:07.012402] train:[
[Training] [2023-12-17T22:33:07.020621] name: training
[Training] [2023-12-17T22:33:07.025208] n_workers: 2
[Training] [2023-12-17T22:33:07.029199] batch_size: 1
[Training] [2023-12-17T22:33:07.036598] mode: paired_voice_audio
[Training] [2023-12-17T22:33:07.041644] path: ./training/jack/train.txt
[Training] [2023-12-17T22:33:07.047659] fetcher_mode: ['lj']
[Training] [2023-12-17T22:33:07.053604] phase: train
[Training] [2023-12-17T22:33:07.060579] max_wav_length: 255995
[Training] [2023-12-17T22:33:07.068554] max_text_length: 200
[Training] [2023-12-17T22:33:07.073004] sample_rate: 22050
[Training] [2023-12-17T22:33:07.076995] load_conditioning: True
[Training] [2023-12-17T22:33:07.081978] num_conditioning_candidates: 2
[Training] [2023-12-17T22:33:07.086964] conditioning_length: 44000
[Training] [2023-12-17T22:33:07.093234] use_bpe_tokenizer: True
[Training] [2023-12-17T22:33:07.099166] tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json
[Training] [2023-12-17T22:33:07.104281] load_aligned_codes: False
[Training] [2023-12-17T22:33:07.111298] data_type: img
[Training] [2023-12-17T22:33:07.116302] ]
[Training] [2023-12-17T22:33:07.121629] val:[
[Training] [2023-12-17T22:33:07.127268] name: validation
[Training] [2023-12-17T22:33:07.131611] n_workers: 2
[Training] [2023-12-17T22:33:07.135530] batch_size: 1
[Training] [2023-12-17T22:33:07.141512] mode: paired_voice_audio
[Training] [2023-12-17T22:33:07.145497] path: ./training/jack/validation.txt
[Training] [2023-12-17T22:33:07.149484] fetcher_mode: ['lj']
[Training] [2023-12-17T22:33:07.153471] phase: val
[Training] [2023-12-17T22:33:07.157498] max_wav_length: 255995
[Training] [2023-12-17T22:33:07.162592] max_text_length: 200
[Training] [2023-12-17T22:33:07.166623] sample_rate: 22050
[Training] [2023-12-17T22:33:07.170570] load_conditioning: True
[Training] [2023-12-17T22:33:07.175552] num_conditioning_candidates: 2
[Training] [2023-12-17T22:33:07.178542] conditioning_length: 44000
[Training] [2023-12-17T22:33:07.182529] use_bpe_tokenizer: True
[Training] [2023-12-17T22:33:07.186674] tokenizer_vocab: ./modules/tortoise-tts/tortoise/data/tokenizer.json
[Training] [2023-12-17T22:33:07.190767] load_aligned_codes: False
[Training] [2023-12-17T22:33:07.194856] data_type: img
[Training] [2023-12-17T22:33:07.198741] ]
[Training] [2023-12-17T22:33:07.203166] ]
[Training] [2023-12-17T22:33:07.207154] steps:[
[Training] [2023-12-17T22:33:07.212417] gpt_train:[
[Training] [2023-12-17T22:33:07.215360] training: gpt
[Training] [2023-12-17T22:33:07.219346] loss_log_buffer: 500
[Training] [2023-12-17T22:33:07.225051] optimizer: adamw
[Training] [2023-12-17T22:33:07.230040] optimizer_params:[
[Training] [2023-12-17T22:33:07.236358] lr: 1e-05
[Training] [2023-12-17T22:33:07.242294] weight_decay: 0.01
[Training] [2023-12-17T22:33:07.247213] beta1: 0.9
[Training] [2023-12-17T22:33:07.253192] beta2: 0.96
[Training] [2023-12-17T22:33:07.259590] ]
[Training] [2023-12-17T22:33:07.263668] clip_grad_eps: 4
[Training] [2023-12-17T22:33:07.272722] injectors:[
[Training] [2023-12-17T22:33:07.280747] paired_to_mel:[
[Training] [2023-12-17T22:33:07.287764] type: torch_mel_spectrogram
[Training] [2023-12-17T22:33:07.298639] mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth
[Training] [2023-12-17T22:33:07.307023] in: wav
[Training] [2023-12-17T22:33:07.313970] out: paired_mel
[Training] [2023-12-17T22:33:07.320014] ]
[Training] [2023-12-17T22:33:07.327532] paired_cond_to_mel:[
[Training] [2023-12-17T22:33:07.334438] type: for_each
[Training] [2023-12-17T22:33:07.341640] subtype: torch_mel_spectrogram
[Training] [2023-12-17T22:33:07.346632] mel_norm_file: ./modules/tortoise-tts/tortoise/data/mel_norms.pth
[Training] [2023-12-17T22:33:07.353607] in: conditioning
[Training] [2023-12-17T22:33:07.364579] out: paired_conditioning_mel
[Training] [2023-12-17T22:33:07.373781] ]
[Training] [2023-12-17T22:33:07.380763] to_codes:[
[Training] [2023-12-17T22:33:07.394721] type: discrete_token
[Training] [2023-12-17T22:33:07.408670] in: paired_mel
[Training] [2023-12-17T22:33:07.416648] out: paired_mel_codes
[Training] [2023-12-17T22:33:07.427133] dvae_config: ./models/tortoise/train_diffusion_vocoder_22k_level.yml
[Training] [2023-12-17T22:33:07.435448] ]
[Training] [2023-12-17T22:33:07.441470] paired_fwd_text:[
[Training] [2023-12-17T22:33:07.445414] type: generator
[Training] [2023-12-17T22:33:07.449692] generator: gpt
[Training] [2023-12-17T22:33:07.452682] in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths']
[Training] [2023-12-17T22:33:07.455865] out: ['loss_text_ce', 'loss_mel_ce', 'logits']
[Training] [2023-12-17T22:33:07.460182] ]
[Training] [2023-12-17T22:33:07.464168] ]
[Training] [2023-12-17T22:33:07.467158] losses:[
[Training] [2023-12-17T22:33:07.470244] text_ce:[
[Training] [2023-12-17T22:33:07.473607] type: direct
[Training] [2023-12-17T22:33:07.477697] weight: 0.01
[Training] [2023-12-17T22:33:07.481077] key: loss_text_ce
[Training] [2023-12-17T22:33:07.484312] ]
[Training] [2023-12-17T22:33:07.488394] mel_ce:[
[Training] [2023-12-17T22:33:07.491792] type: direct
[Training] [2023-12-17T22:33:07.495130] weight: 1
[Training] [2023-12-17T22:33:07.498444] key: loss_mel_ce
[Training] [2023-12-17T22:33:07.501721] ]
[Training] [2023-12-17T22:33:07.505988] ]
[Training] [2023-12-17T22:33:07.510105] ]
[Training] [2023-12-17T22:33:07.514182] ]
[Training] [2023-12-17T22:33:07.518176] networks:[
[Training] [2023-12-17T22:33:07.521157] gpt:[
[Training] [2023-12-17T22:33:07.525324] type: generator
[Training] [2023-12-17T22:33:07.530398] which_model_G: unified_voice2
[Training] [2023-12-17T22:33:07.534381] kwargs:[
[Training] [2023-12-17T22:33:07.537364] layers: 30
[Training] [2023-12-17T22:33:07.540479] model_dim: 1024
[Training] [2023-12-17T22:33:07.544549] heads: 16
[Training] [2023-12-17T22:33:07.547542] max_text_tokens: 402
[Training] [2023-12-17T22:33:07.550529] max_mel_tokens: 604
[Training] [2023-12-17T22:33:07.553517] max_conditioning_inputs: 2
[Training] [2023-12-17T22:33:07.557698] mel_length_compression: 1024
[Training] [2023-12-17T22:33:07.561724] number_text_tokens: 256
[Training] [2023-12-17T22:33:07.564717] number_mel_codes: 8194
[Training] [2023-12-17T22:33:07.567776] start_mel_token: 8192
[Training] [2023-12-17T22:33:07.572037] stop_mel_token: 8193
[Training] [2023-12-17T22:33:07.576133] start_text_token: 255
[Training] [2023-12-17T22:33:07.580201] train_solo_embeddings: False
[Training] [2023-12-17T22:33:07.583193] use_mel_codes_as_input: True
[Training] [2023-12-17T22:33:07.586198] checkpointing: True
[Training] [2023-12-17T22:33:07.591175] tortoise_compat: True
[Training] [2023-12-17T22:33:07.595079] ]
[Training] [2023-12-17T22:33:07.598745] ]
[Training] [2023-12-17T22:33:07.601701] ]
[Training] [2023-12-17T22:33:07.604688] path:[
[Training] [2023-12-17T22:33:07.609674] strict_load: True
[Training] [2023-12-17T22:33:07.612669] pretrain_model_gpt: ./models/tortoise/autoregressive.pth
[Training] [2023-12-17T22:33:07.616746] root: ./
[Training] [2023-12-17T22:33:07.619721] experiments_root: ./training\jack\finetune
[Training] [2023-12-17T22:33:07.622711] models: ./training\jack\finetune\models
[Training] [2023-12-17T22:33:07.626525] training_state: ./training\jack\finetune\training_state
[Training] [2023-12-17T22:33:07.630627] log: ./training\jack\finetune
[Training] [2023-12-17T22:33:07.634124] val_images: ./training\jack\finetune\val_images
[Training] [2023-12-17T22:33:07.637574] ]
[Training] [2023-12-17T22:33:07.642488] train:[
[Training] [2023-12-17T22:33:07.646480] niter: 5
[Training] [2023-12-17T22:33:07.649478] warmup_iter: -1
[Training] [2023-12-17T22:33:07.652459] mega_batch_factor: 1
[Training] [2023-12-17T22:33:07.656794] val_freq: 5
[Training] [2023-12-17T22:33:07.659825] ema_enabled: False
[Training] [2023-12-17T22:33:07.662814] default_lr_scheme: MultiStepLR
[Training] [2023-12-17T22:33:07.666801] gen_lr_steps: [2, 4, 9, 18, 25, 33, 50]
[Training] [2023-12-17T22:33:07.669756] lr_gamma: 0.5
[Training] [2023-12-17T22:33:07.673950] ]
[Training] [2023-12-17T22:33:07.682004] eval:[
[Training] [2023-12-17T22:33:07.690955] pure: False
[Training] [2023-12-17T22:33:07.695968] output_state: gen
[Training] [2023-12-17T22:33:07.701935] ]
[Training] [2023-12-17T22:33:07.705927] logger:[
[Training] [2023-12-17T22:33:07.710176] save_checkpoint_freq: 5
[Training] [2023-12-17T22:33:07.716168] visuals: ['gen', 'mel']
[Training] [2023-12-17T22:33:07.720425] visual_debug_rate: 5
[Training] [2023-12-17T22:33:07.727411] is_mel_spectrogram: True
[Training] [2023-12-17T22:33:07.738449] ]
[Training] [2023-12-17T22:33:07.745512] is_train: True
[Training] [2023-12-17T22:33:07.754564] dist: False
[Training] [2023-12-17T22:33:07.759781]
[Training] [2023-12-17T22:33:07.763825] 23-12-17 22:33:06.957 - INFO: Random seed: 5430
[Training] [2023-12-17T22:33:08.089639] 23-12-17 22:33:08.088 - INFO: Number of training data elements: 1, iters: 1
[Training] [2023-12-17T22:33:08.098615] 23-12-17 22:33:08.088 - INFO: Total epochs needed: 5 for iters 5
[Training] [2023-12-17T22:33:08.716558] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\transformers\configuration_utils.py:380: UserWarning: Passing
gradient_checkpointing
to a config initialization is deprecated and will be removed in v5 Transformers. Usingmodel.gradient_checkpointing_enable()
instead, or if you are using theTrainer
API, passgradient_checkpointing=True
in yourTrainingArguments
.[Training] [2023-12-17T22:33:08.722919] warnings.warn(
[Training] [2023-12-17T22:33:13.663338] 23-12-17 22:33:13.663 - INFO: Loading model for [./models/tortoise/autoregressive.pth]
[Training] [2023-12-17T22:33:15.058962] 23-12-17 22:33:15.058 - INFO: Start training from epoch: 0, iter: 0
[Training] [2023-12-17T22:33:17.263563] [2023-12-17 22:33:17,263] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-12-17T22:33:17.267551] [2023-12-17 22:33:17,267] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[Training] [2023-12-17T22:33:20.253042] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C'), WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib')}
[Training] [2023-12-17T22:33:20.254041] warn(
[Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths...
[Training] [2023-12-17T22:33:20.254041] warn(
[Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
[Training] [2023-12-17T22:33:20.254041] warn(
[Training] [2023-12-17T22:33:20.254041] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C'), WindowsPath('/Users/Jay/anaconda3/envs/pytorch_tts/lib')}
[Training] [2023-12-17T22:33:20.255035] warn(
[Training] [2023-12-17T22:33:20.255035] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\Jay\anaconda3\envs\pytorch_tts did not contain libcudart.so as expected! Searching further paths...
[Training] [2023-12-17T22:33:20.255035] warn(
[Training] [2023-12-17T22:33:20.255035] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
[Training] [2023-12-17T22:33:20.255035] warn(
[Training] [2023-12-17T22:33:20.980822] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()
beforelr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate[Training] [2023-12-17T22:33:20.981819] warnings.warn("Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. "[Training] [2023-12-17T22:33:22.708261] C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
[Training] [2023-12-17T22:33:22.708261] warnings.warn(
[Training] [2023-12-17T22:33:23.276018] 23-12-17 22:33:23.275 - INFO: Training Metrics: {"loss_text_ce": 5.98748254776001, "loss_mel_ce": 3.129274368286133, "loss_gpt_total": 3.1891491413116455, "lr": 5e-06, "it": 1, "step": 1, "steps": 1, "epoch": 0, "iteration_rate": 2.2932069301605225}
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated.
Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\vegalite\v5\api.py:469: AltairDeprecationWarning: The types 'single' and 'multi' are now
combined and should be specified using "selection_point()".
warnings.warn(
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated.
Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\vegalite\v5\api.py:469: AltairDeprecationWarning: The types 'single' and 'multi' are now
combined and should be specified using "selection_point()".
warnings.warn(
C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\altair\utils\deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
[Training] [2023-12-17T22:33:24.075711]
[Training] [2023-12-17T22:33:24.078703] ===================================BUG REPORT===================================
[Training] [2023-12-17T22:33:24.078703] Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
[Training] [2023-12-17T22:33:24.078703] For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
[Training] [2023-12-17T22:33:24.078703] ================================================================================
[Training] [2023-12-17T22:33:24.078703] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
[Training] [2023-12-17T22:33:24.078703] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
[Training] [2023-12-17T22:33:24.078703] CUDA SETUP: Loading binary C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
[Training] [2023-12-17T22:33:24.078703] Disabled distributed training.
[Training] [2023-12-17T22:33:24.078703] Path already exists. Rename it to [./training\jack\finetune_archived_231217-223306]
[Training] [2023-12-17T22:33:24.078703] Loading from ./models/tortoise/dvae.pth
[Training] [2023-12-17T22:33:24.079712] Traceback (most recent call last):
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\src\train.py", line 64, in
[Training] [2023-12-17T22:33:24.079712] train(config_path, args.launcher)
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\src\train.py", line 31, in train
[Training] [2023-12-17T22:33:24.079712] trainer.do_training()
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training
[Training] [2023-12-17T22:33:24.079712] metric = self.do_step(train_data)
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step
[Training] [2023-12-17T22:33:24.079712] gradient_norms_dict = self.model.optimize_parameters(
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 396, in optimize_parameters
[Training] [2023-12-17T22:33:24.079712] self.consume_gradients(state, step, it)
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 445, in consume_gradients
[Training] [2023-12-17T22:33:24.079712] step.do_step(it)
[Training] [2023-12-17T22:33:24.079712] File "E:\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 398, in do_step
[Training] [2023-12-17T22:33:24.079712] self.scaler.step(opt)
[Training] [2023-12-17T22:33:24.079712] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 416, in step
[Training] [2023-12-17T22:33:24.079712] retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
[Training] [2023-12-17T22:33:24.079712] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 315, in _maybe_opt_step
[Training] [2023-12-17T22:33:24.080694] retval = optimizer.step(*args, **kwargs)
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\lr_scheduler.py", line 68, in wrapper
[Training] [2023-12-17T22:33:24.080694] return wrapped(*args, **kwargs)
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\optimizer.py", line 373, in wrapper
[Training] [2023-12-17T22:33:24.080694] out = func(*args, **kwargs)
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\optimizer.py", line 76, in _use_grad
[Training] [2023-12-17T22:33:24.080694] ret = func(self, *args, **kwargs)
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 184, in step
[Training] [2023-12-17T22:33:24.080694] adamw(
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 335, in adamw
[Training] [2023-12-17T22:33:24.080694] func(
[Training] [2023-12-17T22:33:24.080694] File "C:\Users\Jay\anaconda3\envs\pytorch_tts\lib\site-packages\torch\optim\adamw.py", line 599, in _multi_tensor_adamw
[Training] [2023-12-17T22:33:24.080694] exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[Training] [2023-12-17T22:33:24.080694] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Of the allocated memory 7.17 GiB is allocated by PyTorch, and 54.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[Training] [2023-12-17T22:33:36.188970]
Hey! From what I see in your logs there might be a problem of environment variable.
Training] [2023-12-17T22:33:24.078703] CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... [Training] [2023-12-17T22:33:24.078703] WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
These messages appear regularly in your log.
Are you sure your LD_LIBRARY_VARIABLE is correctly set to the Cuda installation of your conda environment?
If I were you I would just reinstall everything and run the setup.bat / sh again just to be sure!