Question about saving and quitting vall-e #5

Closed
opened 2023-08-24 18:07:08 +07:00 by Bluebomber182 · 8 comments

The quitting instructions say you can save and quit by typing in quit. But that didn't work for me. Will typing in the ctrl and c keys let me save and quit?

The quitting instructions say you can save and quit by typing in quit. But that didn't work for me. Will typing in the ctrl and c keys let me save and quit?

quit and hit Enter.

`quit` and hit Enter.

I type in quit and then the enter key in this terminal window and nothing happened.

I type in quit and then the enter key in this terminal window and nothing happened.

Nothing's being trained at all anyways. Either the dataloaders didn't load anything (you can validate this when it prints out the symmap / speakers / sample counts / duration) or your batch size is larger than your dataset size, since TorToiSe/DLAS does the exact same thing.

Nothing's being trained at all anyways. Either the dataloaders didn't load anything (you can validate this when it prints out the symmap / speakers / sample counts / duration) or your batch size is larger than your dataset size, since TorToiSe/DLAS does the exact same thing.

It can't be the batch size because I used the "validate training configuration" button in the gui and set it to 8. Does it have anything to do with the location of the train.txt file and the audio files folder? These are the current locations for the folder and file
/ai-voice-cloning/training/Voice/train.txt
/ai-voice-cloning/training/Voice/audio/

It can't be the batch size because I used the "validate training configuration" button in the gui and set it to 8. Does it have anything to do with the location of the train.txt file and the audio files folder? These are the current locations for the folder and file `/ai-voice-cloning/training/Voice/train.txt` `/ai-voice-cloning/training/Voice/audio/`

In your training YAML, change dataset.sample_type to path instead of speaker.

Additionally, ensure that your ./training/{voice}/ folder contains ckpt/{,n}ar-retnet-4 to finetune, or it will train from scratch.

In your training YAML, change `dataset.sample_type` to `path` instead of `speaker`. Additionally, ensure that your `./training/{voice}/` folder contains `ckpt/{,n}ar-retnet-4` to finetune, or it will train from scratch.

Thank you. I get this error now.

2023-08-24 15:23:05 - vall_e.utils.trainer - INFO - GR=0;LR=0 -
New epoch starts.
Epoch progress: 0%| | 0/13 [00:00<?, ?it/s][2023-08-24 15:23:05,228] [INFO] [scheduler.py:76:check_weight_quantization] Weight quantization is enabled at step 0
Forward "triu_tril_cuda_template" not implemented for 'BFloat16'
[2023-08-24 15:23:05,465] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 0 is about to be saved!
/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-08-24 15:23:05,476] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: training/voice/ckpt/ar-retnet-4/0/mp_rank_00_model_states.pt
[2023-08-24 15:23:05,476] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving training/voice/ckpt/ar-retnet-4/0/mp_rank_00_model_states.pt...
[2023-08-24 15:23:05,745] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved training/voice/ckpt/ar-retnet-4/0/mp_rank_00_model_states.pt.
[2023-08-24 15:23:05,745] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving training/voice/ckpt/ar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-08-24 15:23:06,813] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved training/voice/ckpt/ar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-08-24 15:23:06,814] [INFO] [engine.py:3322:_save_zero_checkpoint] zero checkpoint saved training/voice/ckpt/ar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-08-24 15:23:06,814] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 0 is ready now!
[2023-08-24 15:23:06,815] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 0 is about to be saved!
[2023-08-24 15:23:06,821] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: training/voice/ckpt/nar-retnet-4/0/mp_rank_00_model_states.pt
[2023-08-24 15:23:06,821] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving training/voice/ckpt/nar-retnet-4/0/mp_rank_00_model_states.pt...
[2023-08-24 15:23:07,100] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved training/voice/ckpt/nar-retnet-4/0/mp_rank_00_model_states.pt.
[2023-08-24 15:23:07,100] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving training/voice/ckpt/nar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-08-24 15:23:11,576] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved training/voice/ckpt/nar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-08-24 15:23:11,576] [INFO] [engine.py:3322:_save_zero_checkpoint] zero checkpoint saved training/voice/ckpt/nar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-08-24 15:23:11,576] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 0 is ready now!
Epoch progress: 0%| | 0/13 [00:06<?, ?it/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 170, in
main()
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 163, in main
trainer.train(
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/utils/trainer.py", line 253, in train
stats = engines.step(batch=batch, feeder=train_feeder)
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/engines/base.py", line 354, in step
raise e
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/engines/base.py", line 347, in step
res = feeder( engine=engine, batch=batch )
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 26, in train_feeder
engine(
File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1768, in forward
loss = self.module(*inputs, **kwargs)
File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/models/ar.py", line 70, in forward
return super().forward(
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/models/base.py", line 292, in forward
x, _ = self.retnet(x, incremental_state=state, token_embeddings=x, features_only=True)
File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/models/retnet.py", line 59, in forward
return super().forward(src_tokens, **kwargs)
File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torchscale/architecture/retnet.py", line 362, in forward
retention_rel_pos = self.retnet_rel_pos(slen, incremental_state is not None and not is_first_step, chunkwise_recurrent=self.chunkwise_recurrent)
File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torchscale/architecture/retnet.py", line 45, in forward
mask = torch.tril(torch.ones(self.recurrent_chunk_size, self.recurrent_chunk_size).to(self.decay))
RuntimeError: "triu_tril_cuda_template" not implemented for 'BFloat16'
[2023-08-24 15:23:12,896] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 581
[2023-08-24 15:23:12,896] [ERROR] [launch.py:321:sigkill_handler] ['/home/user/ai-voice-cloning/venv/bin/python3.10', '-u', '-m', 'vall_e.train', '--local_rank=0', 'yaml=/home/user/ai-voice-cloning/training/voice/config.yaml'] exits with return code = 1

Thank you. I get this error now. 2023-08-24 15:23:05 - vall_e.utils.trainer - INFO - GR=0;LR=0 - New epoch starts. Epoch progress: 0%| | 0/13 [00:00<?, ?it/s][2023-08-24 15:23:05,228] [INFO] [scheduler.py:76:check_weight_quantization] Weight quantization is enabled at step 0 Forward "triu_tril_cuda_template" not implemented for 'BFloat16' [2023-08-24 15:23:05,465] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 0 is about to be saved! /home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-08-24 15:23:05,476] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: training/voice/ckpt/ar-retnet-4/0/mp_rank_00_model_states.pt [2023-08-24 15:23:05,476] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving training/voice/ckpt/ar-retnet-4/0/mp_rank_00_model_states.pt... [2023-08-24 15:23:05,745] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved training/voice/ckpt/ar-retnet-4/0/mp_rank_00_model_states.pt. [2023-08-24 15:23:05,745] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving training/voice/ckpt/ar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-08-24 15:23:06,813] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved training/voice/ckpt/ar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-08-24 15:23:06,814] [INFO] [engine.py:3322:_save_zero_checkpoint] zero checkpoint saved training/voice/ckpt/ar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-08-24 15:23:06,814] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 0 is ready now! [2023-08-24 15:23:06,815] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 0 is about to be saved! [2023-08-24 15:23:06,821] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: training/voice/ckpt/nar-retnet-4/0/mp_rank_00_model_states.pt [2023-08-24 15:23:06,821] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving training/voice/ckpt/nar-retnet-4/0/mp_rank_00_model_states.pt... [2023-08-24 15:23:07,100] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved training/voice/ckpt/nar-retnet-4/0/mp_rank_00_model_states.pt. [2023-08-24 15:23:07,100] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving training/voice/ckpt/nar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-08-24 15:23:11,576] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved training/voice/ckpt/nar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-08-24 15:23:11,576] [INFO] [engine.py:3322:_save_zero_checkpoint] zero checkpoint saved training/voice/ckpt/nar-retnet-4/0/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-08-24 15:23:11,576] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 0 is ready now! Epoch progress: 0%| | 0/13 [00:06<?, ?it/s] Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 170, in <module> main() File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 163, in main trainer.train( File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/utils/trainer.py", line 253, in train stats = engines.step(batch=batch, feeder=train_feeder) File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/engines/base.py", line 354, in step raise e File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/engines/base.py", line 347, in step res = feeder( engine=engine, batch=batch ) File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 26, in train_feeder engine( File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1768, in forward loss = self.module(*inputs, **kwargs) File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/models/ar.py", line 70, in forward return super().forward( File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/models/base.py", line 292, in forward x, _ = self.retnet(x, incremental_state=state, token_embeddings=x, features_only=True) File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/user/ai-voice-cloning/modules/vall-e/vall_e/models/retnet.py", line 59, in forward return super().forward(src_tokens, **kwargs) File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torchscale/architecture/retnet.py", line 362, in forward retention_rel_pos = self.retnet_rel_pos(slen, incremental_state is not None and not is_first_step, chunkwise_recurrent=self.chunkwise_recurrent) File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/user/ai-voice-cloning/venv/lib/python3.10/site-packages/torchscale/architecture/retnet.py", line 45, in forward mask = torch.tril(torch.ones(self.recurrent_chunk_size, self.recurrent_chunk_size).to(self.decay)) RuntimeError: "triu_tril_cuda_template" not implemented for 'BFloat16' [2023-08-24 15:23:12,896] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 581 [2023-08-24 15:23:12,896] [ERROR] [launch.py:321:sigkill_handler] ['/home/user/ai-voice-cloning/venv/bin/python3.10', '-u', '-m', 'vall_e.train', '--local_rank=0', 'yaml=/home/user/ai-voice-cloning/training/voice/config.yaml'] exits with return code = 1

In your training YAML, under trainer.weight_dtype, set it to float32 or float16.

In your training YAML, under `trainer.weight_dtype`, set it to `float32` or `float16`.

Thank you! That did it! Typing in quit and pressing enter now works too.

Thank you! That did it! Typing in quit and pressing enter now works too.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/vall-e#5
There is no content yet.