./train.sh: line 12: 36088 Segmentation fault #107

New Issue

pwcfhjofbkuno · 2023-03-10T04:31:37Z

pwcfhjofbkuno commented

2023-03-10 04:31:37 +00:00

I'm having this problem, I reinstalled everything several times and it keeps happening. When I go to train I immediately get this:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Loading specialized model for language: en
Loading Whisper model: base.en
Loaded Whisper model
Transcribed file: ./voices/dennis/audio1.wav, 43 found.
Unloaded Whisper
Culled 15 lines
Spawning process: ./train.sh 1 ./training/dennis/train.yaml
[Training] [2023-03-10T06:21:16.798310] 23-03-10 06:21:16.798 - INFO: name: dennis
[Training] [2023-03-10T06:21:16.801605] model: extensibletrainer
[Training] [2023-03-10T06:21:16.804485] scale: 1
[Training] [2023-03-10T06:21:16.807994] gpu_ids: [0]
[Training] [2023-03-10T06:21:16.810449] start_step: 0
[Training] [2023-03-10T06:21:16.812977] checkpointing_enabled: True
[Training] [2023-03-10T06:21:16.815432] fp16: False
[Training] [2023-03-10T06:21:16.817761] bitsandbytes: True
[Training] [2023-03-10T06:21:16.820087] gpus: 1
[Training] [2023-03-10T06:21:16.822301] wandb: False
[Training] [2023-03-10T06:21:16.824671] use_tb_logger: True
[Training] [2023-03-10T06:21:16.827204] datasets:[
[Training] [2023-03-10T06:21:16.829419] train:[
[Training] [2023-03-10T06:21:16.831741] name: training
[Training] [2023-03-10T06:21:16.833977] n_workers: 1
[Training] [2023-03-10T06:21:16.836294] batch_size: 29
[Training] [2023-03-10T06:21:16.838696] mode: paired_voice_audio
[Training] [2023-03-10T06:21:16.841097] path: ./training/dennis/train.txt
[Training] [2023-03-10T06:21:16.843316] fetcher_mode: ['lj']
[Training] [2023-03-10T06:21:16.845753] phase: train
[Training] [2023-03-10T06:21:16.847942] max_wav_length: 255995
[Training] [2023-03-10T06:21:16.850195] max_text_length: 200
[Training] [2023-03-10T06:21:16.852767] sample_rate: 22050
[Training] [2023-03-10T06:21:16.855026] load_conditioning: True
[Training] [2023-03-10T06:21:16.857346] num_conditioning_candidates: 2
[Training] [2023-03-10T06:21:16.859549] conditioning_length: 44000
[Training] [2023-03-10T06:21:16.861870] use_bpe_tokenizer: True
[Training] [2023-03-10T06:21:16.864039] tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json
[Training] [2023-03-10T06:21:16.866465] load_aligned_codes: False
[Training] [2023-03-10T06:21:16.868817] data_type: img
[Training] [2023-03-10T06:21:16.871223] ]
[Training] [2023-03-10T06:21:16.873617] val:[
[Training] [2023-03-10T06:21:16.875959] name: validation
[Training] [2023-03-10T06:21:16.878214] n_workers: 1
[Training] [2023-03-10T06:21:16.880626] batch_size: 15
[Training] [2023-03-10T06:21:16.883031] mode: paired_voice_audio
[Training] [2023-03-10T06:21:16.885198] path: ./training/dennis/validation.txt
[Training] [2023-03-10T06:21:16.887481] fetcher_mode: ['lj']
[Training] [2023-03-10T06:21:16.889639] phase: val
[Training] [2023-03-10T06:21:16.891812] max_wav_length: 255995
[Training] [2023-03-10T06:21:16.893914] max_text_length: 200
[Training] [2023-03-10T06:21:16.896043] sample_rate: 22050
[Training] [2023-03-10T06:21:16.898170] load_conditioning: True
[Training] [2023-03-10T06:21:16.900341] num_conditioning_candidates: 2
[Training] [2023-03-10T06:21:16.902578] conditioning_length: 44000
[Training] [2023-03-10T06:21:16.904891] use_bpe_tokenizer: True
[Training] [2023-03-10T06:21:16.907171] tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json
[Training] [2023-03-10T06:21:16.909332] load_aligned_codes: False
[Training] [2023-03-10T06:21:16.911447] data_type: img
[Training] [2023-03-10T06:21:16.913525] ]
[Training] [2023-03-10T06:21:16.915688] ]
[Training] [2023-03-10T06:21:16.917800] steps:[
[Training] [2023-03-10T06:21:16.919915] gpt_train:[
[Training] [2023-03-10T06:21:16.922000] training: gpt
[Training] [2023-03-10T06:21:16.924152] loss_log_buffer: 500
[Training] [2023-03-10T06:21:16.926196] optimizer: adamw
[Training] [2023-03-10T06:21:16.928294] optimizer_params:[
[Training] [2023-03-10T06:21:16.930324] lr: 1e-05
[Training] [2023-03-10T06:21:16.932394] weight_decay: 0.01
[Training] [2023-03-10T06:21:16.934428] beta1: 0.9
[Training] [2023-03-10T06:21:16.936702] beta2: 0.96
[Training] [2023-03-10T06:21:16.938800] ]
[Training] [2023-03-10T06:21:16.940949] clip_grad_eps: 4
[Training] [2023-03-10T06:21:16.943008] injectors:[
[Training] [2023-03-10T06:21:16.945151] paired_to_mel:[
[Training] [2023-03-10T06:21:16.947228] type: torch_mel_spectrogram
[Training] [2023-03-10T06:21:16.949388] mel_norm_file: ./models/tortoise/clips_mel_norms.pth
[Training] [2023-03-10T06:21:16.951473] in: wav
[Training] [2023-03-10T06:21:16.953635] out: paired_mel
[Training] [2023-03-10T06:21:16.955667] ]
[Training] [2023-03-10T06:21:16.957802] paired_cond_to_mel:[
[Training] [2023-03-10T06:21:16.959840] type: for_each
[Training] [2023-03-10T06:21:16.961977] subtype: torch_mel_spectrogram
[Training] [2023-03-10T06:21:16.964008] mel_norm_file: ./models/tortoise/clips_mel_norms.pth
[Training] [2023-03-10T06:21:16.966262] in: conditioning
[Training] [2023-03-10T06:21:16.968320] out: paired_conditioning_mel
[Training] [2023-03-10T06:21:16.970511] ]
[Training] [2023-03-10T06:21:16.972545] to_codes:[
[Training] [2023-03-10T06:21:16.974707] type: discrete_token
[Training] [2023-03-10T06:21:16.976734] in: paired_mel
[Training] [2023-03-10T06:21:16.978893] out: paired_mel_codes
[Training] [2023-03-10T06:21:16.981017] dvae_config: ./models/tortoise/train_diffusion_vocoder_22k_level.yml
[Training] [2023-03-10T06:21:16.983260] ]
[Training] [2023-03-10T06:21:16.985487] paired_fwd_text:[
[Training] [2023-03-10T06:21:16.990201] type: generator
[Training] [2023-03-10T06:21:16.994850] generator: gpt
[Training] [2023-03-10T06:21:16.997032] in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths']
[Training] [2023-03-10T06:21:16.999271] out: ['loss_text_ce', 'loss_mel_ce', 'logits']
[Training] [2023-03-10T06:21:17.001422] ]
[Training] [2023-03-10T06:21:17.003635] ]
[Training] [2023-03-10T06:21:17.005755] losses:[
[Training] [2023-03-10T06:21:17.007987] text_ce:[
[Training] [2023-03-10T06:21:17.010159] type: direct
[Training] [2023-03-10T06:21:17.012379] weight: 0.01
[Training] [2023-03-10T06:21:17.014535] key: loss_text_ce
[Training] [2023-03-10T06:21:17.016732] ]
[Training] [2023-03-10T06:21:17.018870] mel_ce:[
[Training] [2023-03-10T06:21:17.021230] type: direct
[Training] [2023-03-10T06:21:17.023407] weight: 1
[Training] [2023-03-10T06:21:17.025621] key: loss_mel_ce
[Training] [2023-03-10T06:21:17.027851] ]
[Training] [2023-03-10T06:21:17.030045] ]
[Training] [2023-03-10T06:21:17.032294] ]
[Training] [2023-03-10T06:21:17.034498] ]
[Training] [2023-03-10T06:21:17.036720] networks:[
[Training] [2023-03-10T06:21:17.038879] gpt:[
[Training] [2023-03-10T06:21:17.041189] type: generator
[Training] [2023-03-10T06:21:17.043291] which_model_G: unified_voice2
[Training] [2023-03-10T06:21:17.045538] kwargs:[
[Training] [2023-03-10T06:21:17.047740] layers: 30
[Training] [2023-03-10T06:21:17.049951] model_dim: 1024
[Training] [2023-03-10T06:21:17.052386] heads: 16
[Training] [2023-03-10T06:21:17.054575] max_text_tokens: 402
[Training] [2023-03-10T06:21:17.056737] max_mel_tokens: 604
[Training] [2023-03-10T06:21:17.058974] max_conditioning_inputs: 2
[Training] [2023-03-10T06:21:17.061189] mel_length_compression: 1024
[Training] [2023-03-10T06:21:17.063381] number_text_tokens: 256
[Training] [2023-03-10T06:21:17.065580] number_mel_codes: 8194
[Training] [2023-03-10T06:21:17.067778] start_mel_token: 8192
[Training] [2023-03-10T06:21:17.069990] stop_mel_token: 8193
[Training] [2023-03-10T06:21:17.072242] start_text_token: 255
[Training] [2023-03-10T06:21:17.074523] train_solo_embeddings: False
[Training] [2023-03-10T06:21:17.076700] use_mel_codes_as_input: True
[Training] [2023-03-10T06:21:17.078932] checkpointing: True
[Training] [2023-03-10T06:21:17.081189] tortoise_compat: True
[Training] [2023-03-10T06:21:17.083450] ]
[Training] [2023-03-10T06:21:17.085585] ]
[Training] [2023-03-10T06:21:17.087856] ]
[Training] [2023-03-10T06:21:17.089995] path:[
[Training] [2023-03-10T06:21:17.092198] strict_load: True
[Training] [2023-03-10T06:21:17.094287] pretrain_model_gpt: /home/rodrigez/ai-voice-cloning/models/tortoise/autoregressive.pth
[Training] [2023-03-10T06:21:17.096397] root: ./
[Training] [2023-03-10T06:21:17.098435] experiments_root: ./training/dennis/finetune
[Training] [2023-03-10T06:21:17.100560] models: ./training/dennis/finetune/models
[Training] [2023-03-10T06:21:17.102660] training_state: ./training/dennis/finetune/training_state
[Training] [2023-03-10T06:21:17.104838] log: ./training/dennis/finetune
[Training] [2023-03-10T06:21:17.106915] val_images: ./training/dennis/finetune/val_images
[Training] [2023-03-10T06:21:17.109050] ]
[Training] [2023-03-10T06:21:17.111123] train:[
[Training] [2023-03-10T06:21:17.113243] niter: 4824
[Training] [2023-03-10T06:21:17.115348] warmup_iter: -1
[Training] [2023-03-10T06:21:17.117592] mega_batch_factor: 1
[Training] [2023-03-10T06:21:17.119701] val_freq: 4
[Training] [2023-03-10T06:21:17.121806] ema_enabled: False
[Training] [2023-03-10T06:21:17.123971] default_lr_scheme: MultiStepLR
[Training] [2023-03-10T06:21:17.126093] gen_lr_steps: [0, 0, 0, 0]
[Training] [2023-03-10T06:21:17.128227] lr_gamma: 0.5
[Training] [2023-03-10T06:21:17.130303] ]
[Training] [2023-03-10T06:21:17.132460] eval:[
[Training] [2023-03-10T06:21:17.134660] pure: True
[Training] [2023-03-10T06:21:17.136853] output_state: gen
[Training] [2023-03-10T06:21:17.138946] ]
[Training] [2023-03-10T06:21:17.141106] logger:[
[Training] [2023-03-10T06:21:17.143189] print_freq: 4
[Training] [2023-03-10T06:21:17.145360] save_checkpoint_freq: 4
[Training] [2023-03-10T06:21:17.147518] visuals: ['gen', 'mel']
[Training] [2023-03-10T06:21:17.149699] visual_debug_rate: 4
[Training] [2023-03-10T06:21:17.151744] is_mel_spectrogram: True
[Training] [2023-03-10T06:21:17.153917] ]
[Training] [2023-03-10T06:21:17.155993] is_train: True
[Training] [2023-03-10T06:21:17.158158] dist: False
[Training] [2023-03-10T06:21:17.160271]
[Training] [2023-03-10T06:21:17.162477] ./train.sh: line 12: 36088 Segmentation fault (core dumped) python3 ./src/train.py -opt "$CONFIG"

I installed using setup-cuda and update and update-force. What am I doing wrong?

I'm having this problem, I reinstalled everything several times and it keeps happening. When I go to train I immediately get this: > ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`. Loading specialized model for language: en Loading Whisper model: base.en Loaded Whisper model Transcribed file: ./voices/dennis/audio1.wav, 43 found. Unloaded Whisper Culled 15 lines Spawning process: ./train.sh 1 ./training/dennis/train.yaml [Training] [2023-03-10T06:21:16.798310] 23-03-10 06:21:16.798 - INFO: name: dennis [Training] [2023-03-10T06:21:16.801605] model: extensibletrainer [Training] [2023-03-10T06:21:16.804485] scale: 1 [Training] [2023-03-10T06:21:16.807994] gpu_ids: [0] [Training] [2023-03-10T06:21:16.810449] start_step: 0 [Training] [2023-03-10T06:21:16.812977] checkpointing_enabled: True [Training] [2023-03-10T06:21:16.815432] fp16: False [Training] [2023-03-10T06:21:16.817761] bitsandbytes: True [Training] [2023-03-10T06:21:16.820087] gpus: 1 [Training] [2023-03-10T06:21:16.822301] wandb: False [Training] [2023-03-10T06:21:16.824671] use_tb_logger: True [Training] [2023-03-10T06:21:16.827204] datasets:[ [Training] [2023-03-10T06:21:16.829419] train:[ [Training] [2023-03-10T06:21:16.831741] name: training [Training] [2023-03-10T06:21:16.833977] n_workers: 1 [Training] [2023-03-10T06:21:16.836294] batch_size: 29 [Training] [2023-03-10T06:21:16.838696] mode: paired_voice_audio [Training] [2023-03-10T06:21:16.841097] path: ./training/dennis/train.txt [Training] [2023-03-10T06:21:16.843316] fetcher_mode: ['lj'] [Training] [2023-03-10T06:21:16.845753] phase: train [Training] [2023-03-10T06:21:16.847942] max_wav_length: 255995 [Training] [2023-03-10T06:21:16.850195] max_text_length: 200 [Training] [2023-03-10T06:21:16.852767] sample_rate: 22050 [Training] [2023-03-10T06:21:16.855026] load_conditioning: True [Training] [2023-03-10T06:21:16.857346] num_conditioning_candidates: 2 [Training] [2023-03-10T06:21:16.859549] conditioning_length: 44000 [Training] [2023-03-10T06:21:16.861870] use_bpe_tokenizer: True [Training] [2023-03-10T06:21:16.864039] tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json [Training] [2023-03-10T06:21:16.866465] load_aligned_codes: False [Training] [2023-03-10T06:21:16.868817] data_type: img [Training] [2023-03-10T06:21:16.871223] ] [Training] [2023-03-10T06:21:16.873617] val:[ [Training] [2023-03-10T06:21:16.875959] name: validation [Training] [2023-03-10T06:21:16.878214] n_workers: 1 [Training] [2023-03-10T06:21:16.880626] batch_size: 15 [Training] [2023-03-10T06:21:16.883031] mode: paired_voice_audio [Training] [2023-03-10T06:21:16.885198] path: ./training/dennis/validation.txt [Training] [2023-03-10T06:21:16.887481] fetcher_mode: ['lj'] [Training] [2023-03-10T06:21:16.889639] phase: val [Training] [2023-03-10T06:21:16.891812] max_wav_length: 255995 [Training] [2023-03-10T06:21:16.893914] max_text_length: 200 [Training] [2023-03-10T06:21:16.896043] sample_rate: 22050 [Training] [2023-03-10T06:21:16.898170] load_conditioning: True [Training] [2023-03-10T06:21:16.900341] num_conditioning_candidates: 2 [Training] [2023-03-10T06:21:16.902578] conditioning_length: 44000 [Training] [2023-03-10T06:21:16.904891] use_bpe_tokenizer: True [Training] [2023-03-10T06:21:16.907171] tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json [Training] [2023-03-10T06:21:16.909332] load_aligned_codes: False [Training] [2023-03-10T06:21:16.911447] data_type: img [Training] [2023-03-10T06:21:16.913525] ] [Training] [2023-03-10T06:21:16.915688] ] [Training] [2023-03-10T06:21:16.917800] steps:[ [Training] [2023-03-10T06:21:16.919915] gpt_train:[ [Training] [2023-03-10T06:21:16.922000] training: gpt [Training] [2023-03-10T06:21:16.924152] loss_log_buffer: 500 [Training] [2023-03-10T06:21:16.926196] optimizer: adamw [Training] [2023-03-10T06:21:16.928294] optimizer_params:[ [Training] [2023-03-10T06:21:16.930324] lr: 1e-05 [Training] [2023-03-10T06:21:16.932394] weight_decay: 0.01 [Training] [2023-03-10T06:21:16.934428] beta1: 0.9 [Training] [2023-03-10T06:21:16.936702] beta2: 0.96 [Training] [2023-03-10T06:21:16.938800] ] [Training] [2023-03-10T06:21:16.940949] clip_grad_eps: 4 [Training] [2023-03-10T06:21:16.943008] injectors:[ [Training] [2023-03-10T06:21:16.945151] paired_to_mel:[ [Training] [2023-03-10T06:21:16.947228] type: torch_mel_spectrogram [Training] [2023-03-10T06:21:16.949388] mel_norm_file: ./models/tortoise/clips_mel_norms.pth [Training] [2023-03-10T06:21:16.951473] in: wav [Training] [2023-03-10T06:21:16.953635] out: paired_mel [Training] [2023-03-10T06:21:16.955667] ] [Training] [2023-03-10T06:21:16.957802] paired_cond_to_mel:[ [Training] [2023-03-10T06:21:16.959840] type: for_each [Training] [2023-03-10T06:21:16.961977] subtype: torch_mel_spectrogram [Training] [2023-03-10T06:21:16.964008] mel_norm_file: ./models/tortoise/clips_mel_norms.pth [Training] [2023-03-10T06:21:16.966262] in: conditioning [Training] [2023-03-10T06:21:16.968320] out: paired_conditioning_mel [Training] [2023-03-10T06:21:16.970511] ] [Training] [2023-03-10T06:21:16.972545] to_codes:[ [Training] [2023-03-10T06:21:16.974707] type: discrete_token [Training] [2023-03-10T06:21:16.976734] in: paired_mel [Training] [2023-03-10T06:21:16.978893] out: paired_mel_codes [Training] [2023-03-10T06:21:16.981017] dvae_config: ./models/tortoise/train_diffusion_vocoder_22k_level.yml [Training] [2023-03-10T06:21:16.983260] ] [Training] [2023-03-10T06:21:16.985487] paired_fwd_text:[ [Training] [2023-03-10T06:21:16.990201] type: generator [Training] [2023-03-10T06:21:16.994850] generator: gpt [Training] [2023-03-10T06:21:16.997032] in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths'] [Training] [2023-03-10T06:21:16.999271] out: ['loss_text_ce', 'loss_mel_ce', 'logits'] [Training] [2023-03-10T06:21:17.001422] ] [Training] [2023-03-10T06:21:17.003635] ] [Training] [2023-03-10T06:21:17.005755] losses:[ [Training] [2023-03-10T06:21:17.007987] text_ce:[ [Training] [2023-03-10T06:21:17.010159] type: direct [Training] [2023-03-10T06:21:17.012379] weight: 0.01 [Training] [2023-03-10T06:21:17.014535] key: loss_text_ce [Training] [2023-03-10T06:21:17.016732] ] [Training] [2023-03-10T06:21:17.018870] mel_ce:[ [Training] [2023-03-10T06:21:17.021230] type: direct [Training] [2023-03-10T06:21:17.023407] weight: 1 [Training] [2023-03-10T06:21:17.025621] key: loss_mel_ce [Training] [2023-03-10T06:21:17.027851] ] [Training] [2023-03-10T06:21:17.030045] ] [Training] [2023-03-10T06:21:17.032294] ] [Training] [2023-03-10T06:21:17.034498] ] [Training] [2023-03-10T06:21:17.036720] networks:[ [Training] [2023-03-10T06:21:17.038879] gpt:[ [Training] [2023-03-10T06:21:17.041189] type: generator [Training] [2023-03-10T06:21:17.043291] which_model_G: unified_voice2 [Training] [2023-03-10T06:21:17.045538] kwargs:[ [Training] [2023-03-10T06:21:17.047740] layers: 30 [Training] [2023-03-10T06:21:17.049951] model_dim: 1024 [Training] [2023-03-10T06:21:17.052386] heads: 16 [Training] [2023-03-10T06:21:17.054575] max_text_tokens: 402 [Training] [2023-03-10T06:21:17.056737] max_mel_tokens: 604 [Training] [2023-03-10T06:21:17.058974] max_conditioning_inputs: 2 [Training] [2023-03-10T06:21:17.061189] mel_length_compression: 1024 [Training] [2023-03-10T06:21:17.063381] number_text_tokens: 256 [Training] [2023-03-10T06:21:17.065580] number_mel_codes: 8194 [Training] [2023-03-10T06:21:17.067778] start_mel_token: 8192 [Training] [2023-03-10T06:21:17.069990] stop_mel_token: 8193 [Training] [2023-03-10T06:21:17.072242] start_text_token: 255 [Training] [2023-03-10T06:21:17.074523] train_solo_embeddings: False [Training] [2023-03-10T06:21:17.076700] use_mel_codes_as_input: True [Training] [2023-03-10T06:21:17.078932] checkpointing: True [Training] [2023-03-10T06:21:17.081189] tortoise_compat: True [Training] [2023-03-10T06:21:17.083450] ] [Training] [2023-03-10T06:21:17.085585] ] [Training] [2023-03-10T06:21:17.087856] ] [Training] [2023-03-10T06:21:17.089995] path:[ [Training] [2023-03-10T06:21:17.092198] strict_load: True [Training] [2023-03-10T06:21:17.094287] pretrain_model_gpt: /home/rodrigez/ai-voice-cloning/models/tortoise/autoregressive.pth [Training] [2023-03-10T06:21:17.096397] root: ./ [Training] [2023-03-10T06:21:17.098435] experiments_root: ./training/dennis/finetune [Training] [2023-03-10T06:21:17.100560] models: ./training/dennis/finetune/models [Training] [2023-03-10T06:21:17.102660] training_state: ./training/dennis/finetune/training_state [Training] [2023-03-10T06:21:17.104838] log: ./training/dennis/finetune [Training] [2023-03-10T06:21:17.106915] val_images: ./training/dennis/finetune/val_images [Training] [2023-03-10T06:21:17.109050] ] [Training] [2023-03-10T06:21:17.111123] train:[ [Training] [2023-03-10T06:21:17.113243] niter: 4824 [Training] [2023-03-10T06:21:17.115348] warmup_iter: -1 [Training] [2023-03-10T06:21:17.117592] mega_batch_factor: 1 [Training] [2023-03-10T06:21:17.119701] val_freq: 4 [Training] [2023-03-10T06:21:17.121806] ema_enabled: False [Training] [2023-03-10T06:21:17.123971] default_lr_scheme: MultiStepLR [Training] [2023-03-10T06:21:17.126093] gen_lr_steps: [0, 0, 0, 0] [Training] [2023-03-10T06:21:17.128227] lr_gamma: 0.5 [Training] [2023-03-10T06:21:17.130303] ] [Training] [2023-03-10T06:21:17.132460] eval:[ [Training] [2023-03-10T06:21:17.134660] pure: True [Training] [2023-03-10T06:21:17.136853] output_state: gen [Training] [2023-03-10T06:21:17.138946] ] [Training] [2023-03-10T06:21:17.141106] logger:[ [Training] [2023-03-10T06:21:17.143189] print_freq: 4 [Training] [2023-03-10T06:21:17.145360] save_checkpoint_freq: 4 [Training] [2023-03-10T06:21:17.147518] visuals: ['gen', 'mel'] [Training] [2023-03-10T06:21:17.149699] visual_debug_rate: 4 [Training] [2023-03-10T06:21:17.151744] is_mel_spectrogram: True [Training] [2023-03-10T06:21:17.153917] ] [Training] [2023-03-10T06:21:17.155993] is_train: True [Training] [2023-03-10T06:21:17.158158] dist: False [Training] [2023-03-10T06:21:17.160271] [Training] [2023-03-10T06:21:17.162477] ./train.sh: line 12: 36088 Segmentation fault (core dumped) python3 ./src/train.py -opt "$CONFIG" I installed using setup-cuda and update and update-force. What am I doing wrong?

mrq commented

2023-03-10 04:39:38 +00:00

I wouldn't have a clue. I haven't ran into any segfaults from Python, so I can't debug it.

However, I see:

gen_lr_steps: [0, 0, 0, 0]

where it shouldn't be that. I doubt it's the culprit, but you should change your schedule to something that is logical.

I wouldn't have a clue. I haven't ran into any segfaults from Python, so I can't debug it. However, I see: > `gen_lr_steps: [0, 0, 0, 0]` where it shouldn't be that. I doubt it's the culprit, but you should change your schedule to something that is logical.

pwcfhjofbkuno commented

2023-03-10 06:29:01 +00:00

Here's what I tried so far:

Do not sudo apt remove python3, this will brick your os.

Set gen_lr_steps to default values.
I used gdb to debug python, and got this:

Thread 1 "python3" received signal SIGINT, Interrupt.
0x00007ffff7d13a1c in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fffffffd8b0) at ../sysdeps/unix/sysv/linux/select.c:69
Download failed: Invalid argument.  Continuing without source file ./misc/../sysdeps/unix/sysv/linux/select.c.
69	../sysdeps/unix/sysv/linux/select.c: No such file or directory.
(gdb) backtrace
#0  0x00007ffff7d13a1c in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, 
    exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fffffffd8b0) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x00005555557c879a in pysleep (secs=100000000) at ../Modules/timemodule.c:2076
#2  time_sleep (self=<optimized out>, obj=<optimized out>) at ../Modules/timemodule.c:370
#3  0x00005555556b01f3 in cfunction_vectorcall_O (func=0x7ffff6c07e20, args=0x7ffedbb7d040, nargsf=<optimized out>, kwnames=<optimized out>)
    at ../Objects/methodobject.c:516
#4  0x00005555556a108c in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffedbb7d040, callable=0x7ffff6c07e20, 
    tstate=0x555555b51b60) at ../Include/cpython/abstract.h:114
#5  PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffedbb7d040, callable=0x7ffff6c07e20)
    at ../Include/cpython/abstract.h:123
#6  call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffffffda30, tstate=<optimized out>)
    at ../Python/ceval.c:5891
#7  _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:4181
#8  0x00005555556b11b9 in _PyEval_EvalFrame (throwflag=0, f=0x7ffedbb7cec0, tstate=0x555555b51b60) at ../Include/internal/pycore_ceval.h:46
#9  _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fff52b5cef0, 
    tstate=0x555555b51b60) at ../Python/ceval.c:5065
#10 _PyFunction_Vectorcall (func=0x7fff52b5cee0, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at ../Objects/call.c:342
#11 0x000055555569bbe6 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff6d29da8, callable=0x7fff52b5cee0, 
    tstate=0x555555b51b60) at ../Include/cpython/abstract.h:114
#12 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff6d29da8, callable=0x7fff52b5cee0)
    at ../Include/cpython/abstract.h:123
#13 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffffffdc00, tstate=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging--c
   /Python/ceval.c:5891
#14 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:4198
#15 0x00005555556981c4 in _PyEval_EvalFrame (throwflag=0, f=0x7ffff6d29c40, tstate=0x555555b51b60) at ../Include/internal/pycore_ceval.h:46
#16 _PyEval_Vector (tstate=0x555555b51b60, con=0x7fffffffdcb0, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>) at ../Python/ceval.c:5065
#17 0x0000555555784c26 in PyEval_EvalCode (co=0x7ffff6c4ead0, globals=0x7ffff6c3e180, locals=0x7ffff6c3e180) at ../Python/ceval.c:1134
#18 0x00005555557b1047 in run_eval_code_obj (tstate=0x555555b51b60, co=0x7ffff6c4ead0, globals=0x7ffff6c3e180, locals=0x7ffff6c3e180) at ../Python/pythonrun.c:1291
#19 0x00005555557aa04e in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff6c3e180, locals=0x7ffff6c3e180, flags=<optimized out>, arena=<optimized out>) at ../Python/pythonrun.c:1312
#20 0x00005555557b0d76 in pyrun_file (fp=fp@entry=0x555555b53f30, filename=filename@entry=0x7ffff6add410, start=start@entry=257, globals=globals@entry=0x7ffff6c3e180, locals=locals@entry=0x7ffff6c3e180, closeit=closeit@entry=1, flags=0x7fffffffde98) at ../Python/pythonrun.c:1208
#21 0x00005555557b025e in _PyRun_SimpleFileObject (fp=0x555555b53f30, filename=0x7ffff6add410, closeit=1, flags=0x7fffffffde98) at ../Python/pythonrun.c:456
#22 0x00005555557aff64 in _PyRun_AnyFileObject (fp=0x555555b53f30, filename=0x7ffff6add410, closeit=1, flags=0x7fffffffde98) at ../Python/pythonrun.c:90
#23 0x00005555557a1320 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff6add410, program_name=0x7ffff6c79b10) at ../Modules/main.c:353
#24 pymain_run_file (config=0x555555b36030) at ../Modules/main.c:372
#25 pymain_run_python (exitcode=0x7fffffffde94) at ../Modules/main.c:587
#26 Py_RunMain () at ../Modules/main.c:666
#27 0x0000555555778cbb in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at ../Modules/main.c:720
#28 0x00007ffff7c23510 in __libc_start_call_main (main=main@entry=0x555555778c80 <main>, argc=argc@entry=2, argv=argv@entry=0x7fffffffe0a8) at ../sysdeps/nptl/libc_start_call_main.h:58
#29 0x00007ffff7c235c9 in __libc_start_main_impl (main=0x555555778c80 <main>, argc=2, argv=0x7fffffffe0a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe098) at ../csu/libc-start.c:381
#30 0x0000555555778bb5 in _start ()

Which led me to suspect bitsandbytes. I installed several different version of it, no difference.
I tried installing cuda 12.0 and 11.7.1, no difference.
I checked out a commit from 3 days ago when it was working for me, recreated the training dataset, same deal.
New audio file, new dataset.
bitsandbytes off. fp16. whisper/whisperx. multistep / cos annealing.

Reinstalled again, notice I'm getting

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
voicefixer 0.1.2 requires librosa<0.9.0,>=0.8.1, but you have librosa 0.8.0 which is incompatible.
pyannote-audio 2.1.1 requires einops<0.4.0,>=0.3, but you have einops 0.6.0 which is incompatible.

Manually installed pyannote:

vector-quantize-pytorch 1.1.1 requires einops>=0.6, but you have einops 0.3.2 which is incompatible.
local-attention 1.8.4 requires einops>=0.6.0, but you have einops 0.3.2 which is incompatible.

Manually installed vector-quantize-pytorch:

pyannote-audio 2.1.1 requires einops<0.4.0,>=0.3, but you have einops 0.6.0 which is incompatible.

This is probably unrelated.

python 3.10.7 -> 3.10.9.

Installed pyannote-audio-1.1.2 which seems to remove conflicts.

Nvidia 530

Maybe one of the pip packages is to blame?
Anyone else experiencing this?

How do I call train.sh manually?
I used ./train.sh ./training/1/train.yaml
This is the output:

/train.sh ./training/1/train.yaml
./train.sh: line 8: ((: ./training/1/train.yaml > 1 : syntax error: operand expected (error token is "./training/1/train.yaml > 1 ")

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
Using BitsAndBytes ADAMW optimizations
Disabled distributed training.
Path already exists. Rename it to [./training/1/finetune_archived_230310-101758]
Traceback (most recent call last):
  File "/home/rodrigez/aivoice/./src/train.py", line 95, in <module>
    train(args.opt, args.launcher)
  File "/home/rodrigez/aivoice/./src/train.py", line 81, in train
    trainer.init(yaml, opt, launcher)
  File "/home/rodrigez/aivoice/./modules/dlas/codes/train.py", line 57, in init
    shutil.copy(opt_path, os.path.join(opt['path']['experiments_root'], f'{datetime.now().strftime("%d%m%Y_%H%M%S")}_{os.path.basename(opt_path)}'))
  File "/usr/lib/python3.10/shutil.py", line 417, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: ''

Commented >shutil.copy(opt_path, os.path.join(opt['path']['experiments_root'], f'{datetime.now().strftime("%d%m%Y_%H%M%S")}_{os.path.basename(opt_path)}'))

in dlas/codes/train.py then I segfault with terminal.

Is my python install borked somehow? Feels like I tried enough things that more people would complain if there was a commonality. Perhaps I'll reinstall os again

Here's what I tried so far: Do not sudo apt remove python3, this will brick your os. Set gen_lr_steps to default values. I used gdb to debug python, and got this: ``` Thread 1 "python3" received signal SIGINT, Interrupt. 0x00007ffff7d13a1c in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fffffffd8b0) at ../sysdeps/unix/sysv/linux/select.c:69 Download failed: Invalid argument. Continuing without source file ./misc/../sysdeps/unix/sysv/linux/select.c. 69 ../sysdeps/unix/sysv/linux/select.c: No such file or directory. (gdb) backtrace #0 0x00007ffff7d13a1c in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fffffffd8b0) at ../sysdeps/unix/sysv/linux/select.c:69 #1 0x00005555557c879a in pysleep (secs=100000000) at ../Modules/timemodule.c:2076 #2 time_sleep (self=<optimized out>, obj=<optimized out>) at ../Modules/timemodule.c:370 #3 0x00005555556b01f3 in cfunction_vectorcall_O (func=0x7ffff6c07e20, args=0x7ffedbb7d040, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/methodobject.c:516 #4 0x00005555556a108c in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffedbb7d040, callable=0x7ffff6c07e20, tstate=0x555555b51b60) at ../Include/cpython/abstract.h:114 #5 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffedbb7d040, callable=0x7ffff6c07e20) at ../Include/cpython/abstract.h:123 #6 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffffffda30, tstate=<optimized out>) at ../Python/ceval.c:5891 #7 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:4181 #8 0x00005555556b11b9 in _PyEval_EvalFrame (throwflag=0, f=0x7ffedbb7cec0, tstate=0x555555b51b60) at ../Include/internal/pycore_ceval.h:46 #9 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fff52b5cef0, tstate=0x555555b51b60) at ../Python/ceval.c:5065 #10 _PyFunction_Vectorcall (func=0x7fff52b5cee0, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:342 #11 0x000055555569bbe6 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff6d29da8, callable=0x7fff52b5cee0, tstate=0x555555b51b60) at ../Include/cpython/abstract.h:114 #12 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff6d29da8, callable=0x7fff52b5cee0) at ../Include/cpython/abstract.h:123 #13 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffffffdc00, tstate=<optimized out>) --Type <RET> for more, q to quit, c to continue without paging--c /Python/ceval.c:5891 #14 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:4198 #15 0x00005555556981c4 in _PyEval_EvalFrame (throwflag=0, f=0x7ffff6d29c40, tstate=0x555555b51b60) at ../Include/internal/pycore_ceval.h:46 #16 _PyEval_Vector (tstate=0x555555b51b60, con=0x7fffffffdcb0, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>) at ../Python/ceval.c:5065 #17 0x0000555555784c26 in PyEval_EvalCode (co=0x7ffff6c4ead0, globals=0x7ffff6c3e180, locals=0x7ffff6c3e180) at ../Python/ceval.c:1134 #18 0x00005555557b1047 in run_eval_code_obj (tstate=0x555555b51b60, co=0x7ffff6c4ead0, globals=0x7ffff6c3e180, locals=0x7ffff6c3e180) at ../Python/pythonrun.c:1291 #19 0x00005555557aa04e in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff6c3e180, locals=0x7ffff6c3e180, flags=<optimized out>, arena=<optimized out>) at ../Python/pythonrun.c:1312 #20 0x00005555557b0d76 in pyrun_file (fp=fp@entry=0x555555b53f30, filename=filename@entry=0x7ffff6add410, start=start@entry=257, globals=globals@entry=0x7ffff6c3e180, locals=locals@entry=0x7ffff6c3e180, closeit=closeit@entry=1, flags=0x7fffffffde98) at ../Python/pythonrun.c:1208 #21 0x00005555557b025e in _PyRun_SimpleFileObject (fp=0x555555b53f30, filename=0x7ffff6add410, closeit=1, flags=0x7fffffffde98) at ../Python/pythonrun.c:456 #22 0x00005555557aff64 in _PyRun_AnyFileObject (fp=0x555555b53f30, filename=0x7ffff6add410, closeit=1, flags=0x7fffffffde98) at ../Python/pythonrun.c:90 #23 0x00005555557a1320 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff6add410, program_name=0x7ffff6c79b10) at ../Modules/main.c:353 #24 pymain_run_file (config=0x555555b36030) at ../Modules/main.c:372 #25 pymain_run_python (exitcode=0x7fffffffde94) at ../Modules/main.c:587 #26 Py_RunMain () at ../Modules/main.c:666 #27 0x0000555555778cbb in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at ../Modules/main.c:720 #28 0x00007ffff7c23510 in __libc_start_call_main (main=main@entry=0x555555778c80 <main>, argc=argc@entry=2, argv=argv@entry=0x7fffffffe0a8) at ../sysdeps/nptl/libc_start_call_main.h:58 #29 0x00007ffff7c235c9 in __libc_start_main_impl (main=0x555555778c80 <main>, argc=2, argv=0x7fffffffe0a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe098) at ../csu/libc-start.c:381 #30 0x0000555555778bb5 in _start () ``` Which led me to suspect bitsandbytes. I installed several different version of it, no difference. I tried installing cuda 12.0 and 11.7.1, no difference. I checked out a commit from 3 days ago when it was working for me, recreated the training dataset, same deal. New audio file, new dataset. bitsandbytes off. fp16. whisper/whisperx. multistep / cos annealing. Reinstalled again, notice I'm getting > ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. voicefixer 0.1.2 requires librosa<0.9.0,>=0.8.1, but you have librosa 0.8.0 which is incompatible. pyannote-audio 2.1.1 requires einops<0.4.0,>=0.3, but you have einops 0.6.0 which is incompatible. Manually installed pyannote: > vector-quantize-pytorch 1.1.1 requires einops>=0.6, but you have einops 0.3.2 which is incompatible. local-attention 1.8.4 requires einops>=0.6.0, but you have einops 0.3.2 which is incompatible. Manually installed vector-quantize-pytorch: > pyannote-audio 2.1.1 requires einops<0.4.0,>=0.3, but you have einops 0.6.0 which is incompatible. This is probably unrelated. python 3.10.7 -> 3.10.9. Installed pyannote-audio-1.1.2 which seems to remove conflicts. Nvidia 530 Maybe one of the pip packages is to blame? Anyone else experiencing this? How do I call train.sh manually? I used ./train.sh ./training/1/train.yaml This is the output: ``` /train.sh ./training/1/train.yaml ./train.sh: line 8: ((: ./training/1/train.yaml > 1 : syntax error: operand expected (error token is "./training/1/train.yaml > 1 ") ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ Using BitsAndBytes ADAMW optimizations Disabled distributed training. Path already exists. Rename it to [./training/1/finetune_archived_230310-101758] Traceback (most recent call last): File "/home/rodrigez/aivoice/./src/train.py", line 95, in <module> train(args.opt, args.launcher) File "/home/rodrigez/aivoice/./src/train.py", line 81, in train trainer.init(yaml, opt, launcher) File "/home/rodrigez/aivoice/./modules/dlas/codes/train.py", line 57, in init shutil.copy(opt_path, os.path.join(opt['path']['experiments_root'], f'{datetime.now().strftime("%d%m%Y_%H%M%S")}_{os.path.basename(opt_path)}')) File "/usr/lib/python3.10/shutil.py", line 417, in copy copyfile(src, dst, follow_symlinks=follow_symlinks) File "/usr/lib/python3.10/shutil.py", line 254, in copyfile with open(src, 'rb') as fsrc: FileNotFoundError: [Errno 2] No such file or directory: '' ``` Commented >shutil.copy(opt_path, os.path.join(opt['path']['experiments_root'], f'{datetime.now().strftime("%d%m%Y_%H%M%S")}_{os.path.basename(opt_path)}')) in dlas/codes/train.py then I segfault with terminal. Is my python install borked somehow? Feels like I tried enough things that more people would complain if there was a commonality. Perhaps I'll reinstall os again

mrq commented

2023-03-10 14:20:37 +00:00

Oh nice, you were able to get me a gdb trace. I'll thumb through it when I get a chance, but for now I'll give some brief help until I can:

This is probably unrelated.

Yeah, that red block is just from whisperx having shit dependencies that mess with everything else. einops getting set to the right version makes PIP scream despite it working fine for whisperx.

Which led me to suspect bitsandbytes.

Have you tried running it with it uninstalled? desu I need to revalidate that disabling bitsandbytes in the settings actually makes it train without bitsandbytes. You could also do some environment variable settings, as referenced in the ./src/train.py, which should work, but I don't trust that it's still able to be overridden like that.

How do I call train.sh manually?
I used ./train.sh ./training/1/train.yaml

Invoke the train script was ./train.sh 1 ./training/1/train.yaml. desu it's an artifact on when Run Training handles passing GPU count, as it will appropriately spawn the necessary torchrun processes per GPU, but I can do better and clean it up.

I'd try running without BitsAndBytes, although I'm not sure how that would suddenly be the problem. As I mentioned I'll thumb through the stack trace when I get a moment.

Oh nice, you were able to get me a gdb trace. I'll thumb through it when I get a chance, but for now I'll give some brief help until I can: > This is probably unrelated. Yeah, that red block is just from whisperx having shit dependencies that mess with everything else. einops getting set to the right version makes PIP scream despite it working fine for whisperx. > Which led me to suspect bitsandbytes. Have you tried running it with it uninstalled? desu I need to revalidate that disabling bitsandbytes in the settings actually makes it train without bitsandbytes. You *could* also do some environment variable settings, as referenced in the `./src/train.py`, which *should* work, but I don't trust that it's still able to be overridden like that. > How do I call train.sh manually? > I used ./train.sh ./training/1/train.yaml Invoke the train script was `./train.sh 1 ./training/1/train.yaml`. desu it's an artifact on when Run Training handles passing GPU count, as it will appropriately spawn the necessary torchrun processes per GPU, but I can do better and clean it up. I'd try running without BitsAndBytes, although I'm not sure how that would suddenly be the problem. As I mentioned I'll thumb through the stack trace when I get a moment.

pwcfhjofbkuno commented

2023-03-10 14:57:31 +00:00

Can confirm same error without bitsandbytes installed.

I've reinstalled OS on a formatted drive so its not anything screwy with my installation. I could try older graphic drives and cudas, but with no leads and me apparently being the only one with this problem it's hard to say.

Tried Firefox and chromium, and now also through terminal.

Actually got quite far uninstalling all pip packages and installing them 1 by 1 each time it complained about missing package and I did get a few extra lines in before I called it quits. Usually I get to

    is_mel_spectrogram: True
  ]
  is_train: True
  dist: False

./train.sh: line 12: 57026 Segmentation fault      (core dumped) python3 ./src/train.py -opt "$CONFIG"

And installing pips 1 by 1 I saw the line about seed number and 2-3 after it before it complained about missing pip packages (I got to when it asked for transformer_x which I installed, but it still wasn't happy with.) Not sure if that will lead to something or merely a phantom of me finageling with the code to force it through.

Went ahead and commented out lines in train.yaml until it no longer seg faulted. Here is the minimal yaml that leads to the segfault:

name: 'dennis'
gpu_ids: [0] 
use_tb_logger: true
datasets:
  train:
    mode: paired_voice_audio
  val: 
    mode: paired_voice_audio
path: 
  pretrain_model_gpt: '/home/rodrigez/aivoice/models/tortoise/autoregressive.pth'

all the files/dirs are useless too. Turns out you only really need 10 lines of text to train a model. Or segfault on one anyway.

Can confirm same error without bitsandbytes installed. I've reinstalled OS on a formatted drive so its not anything screwy with my installation. I could try older graphic drives and cudas, but with no leads and me apparently being the only one with this problem it's hard to say. Tried Firefox and chromium, and now also through terminal. Actually got quite far uninstalling all pip packages and installing them 1 by 1 each time it complained about missing package and I did get a few extra lines in before I called it quits. Usually I get to ``` is_mel_spectrogram: True ] is_train: True dist: False ./train.sh: line 12: 57026 Segmentation fault (core dumped) python3 ./src/train.py -opt "$CONFIG" ``` And installing pips 1 by 1 I saw the line about seed number and 2-3 after it before it complained about missing pip packages (I got to when it asked for transformer_x which I installed, but it still wasn't happy with.) Not sure if that will lead to something or merely a phantom of me finageling with the code to force it through. Went ahead and commented out lines in train.yaml until it no longer seg faulted. Here is the minimal yaml that leads to the segfault: ``` name: 'dennis' gpu_ids: [0] use_tb_logger: true datasets: train: mode: paired_voice_audio val: mode: paired_voice_audio path: pretrain_model_gpt: '/home/rodrigez/aivoice/models/tortoise/autoregressive.pth' ``` all the files/dirs are useless too. Turns out you only really need 10 lines of text to train a model. Or segfault on one anyway.

pwcfhjofbkuno commented

2023-03-10 15:41:12 +00:00

in train.yaml: use_tb_logger: true -> false fixes the segfault and allows training (As long as gradio doesn't set use_tb_logger to true again)

mrq commented

2023-03-10 17:06:20 +00:00

Busy all morning, forgot to follow up. The stack trace didn't prove much useful, as I forgot that it won't really expose anything in Python itself, oops.

How strange it's related to tensorboard. For now I'll probably just add a bandaid option to disable logging and fallback to parsing the text output again, then later figure out if there's anything specific that causes logging to fail (or at least modify the logging routine to just spit out JSON instead of using a seemingly-proprietary binary).

I'll leave this open until I do the above.

Busy all morning, forgot to follow up. The stack trace didn't prove much useful, as I forgot that it won't really expose anything in Python itself, oops. How strange it's related to tensorboard. For now I'll probably just add a bandaid option to disable logging and fallback to parsing the text output again, then later figure out if there's anything specific that causes logging to fail (or at least modify the logging routine to just spit out JSON instead of using a seemingly-proprietary binary). I'll leave this open until I do the above.

mrq commented

2023-03-10 22:37:08 +00:00

Alright, I overhauled how metrics are relayed and stored. Tensorboard logging is no longer used, so no more segfaults.

I still need to do more validation to make sure nothing breaks, but it should be fine from my cursory tests.

Alright, I overhauled how metrics are relayed and stored. Tensorboard logging is no longer used, so no more segfaults. I still need to do more validation to make sure nothing breaks, but it should be fine from my cursory tests.

mrq commented

2023-03-13 17:41:41 +00:00

Incidentally, I've found that TensorBoard has caused a similar issue causing this VALL-E implementation to segfault as well, although the propose fix didn't fix it. I just found it a little incidental.

Anyways, DLAS and the web UI don't use TensborBoard for metrics, so it shouldn't segfault from it.

Incidentally, I've found that TensorBoard has caused a similar issue causing this [VALL-E implementation](https://github.com/lifeiteng/vall-e#troubleshooting) to segfault as well, although the propose fix didn't fix it. I just found it a little incidental. Anyways, DLAS and the web UI don't use TensborBoard for metrics, so it shouldn't segfault from it.

mrq closed this issue

2023-03-13 17:41:41 +00:00

Sign in to join this conversation.