./train.sh: line 12: 36088 Segmentation fault #107
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#107
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm having this problem, I reinstalled everything several times and it keeps happening. When I go to train I immediately get this:
To create a public link, set
share=True
inlaunch()
.Loading specialized model for language: en
Loading Whisper model: base.en
Loaded Whisper model
Transcribed file: ./voices/dennis/audio1.wav, 43 found.
Unloaded Whisper
Culled 15 lines
Spawning process: ./train.sh 1 ./training/dennis/train.yaml
[Training] [2023-03-10T06:21:16.798310] 23-03-10 06:21:16.798 - INFO: name: dennis
[Training] [2023-03-10T06:21:16.801605] model: extensibletrainer
[Training] [2023-03-10T06:21:16.804485] scale: 1
[Training] [2023-03-10T06:21:16.807994] gpu_ids: [0]
[Training] [2023-03-10T06:21:16.810449] start_step: 0
[Training] [2023-03-10T06:21:16.812977] checkpointing_enabled: True
[Training] [2023-03-10T06:21:16.815432] fp16: False
[Training] [2023-03-10T06:21:16.817761] bitsandbytes: True
[Training] [2023-03-10T06:21:16.820087] gpus: 1
[Training] [2023-03-10T06:21:16.822301] wandb: False
[Training] [2023-03-10T06:21:16.824671] use_tb_logger: True
[Training] [2023-03-10T06:21:16.827204] datasets:[
[Training] [2023-03-10T06:21:16.829419] train:[
[Training] [2023-03-10T06:21:16.831741] name: training
[Training] [2023-03-10T06:21:16.833977] n_workers: 1
[Training] [2023-03-10T06:21:16.836294] batch_size: 29
[Training] [2023-03-10T06:21:16.838696] mode: paired_voice_audio
[Training] [2023-03-10T06:21:16.841097] path: ./training/dennis/train.txt
[Training] [2023-03-10T06:21:16.843316] fetcher_mode: ['lj']
[Training] [2023-03-10T06:21:16.845753] phase: train
[Training] [2023-03-10T06:21:16.847942] max_wav_length: 255995
[Training] [2023-03-10T06:21:16.850195] max_text_length: 200
[Training] [2023-03-10T06:21:16.852767] sample_rate: 22050
[Training] [2023-03-10T06:21:16.855026] load_conditioning: True
[Training] [2023-03-10T06:21:16.857346] num_conditioning_candidates: 2
[Training] [2023-03-10T06:21:16.859549] conditioning_length: 44000
[Training] [2023-03-10T06:21:16.861870] use_bpe_tokenizer: True
[Training] [2023-03-10T06:21:16.864039] tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json
[Training] [2023-03-10T06:21:16.866465] load_aligned_codes: False
[Training] [2023-03-10T06:21:16.868817] data_type: img
[Training] [2023-03-10T06:21:16.871223] ]
[Training] [2023-03-10T06:21:16.873617] val:[
[Training] [2023-03-10T06:21:16.875959] name: validation
[Training] [2023-03-10T06:21:16.878214] n_workers: 1
[Training] [2023-03-10T06:21:16.880626] batch_size: 15
[Training] [2023-03-10T06:21:16.883031] mode: paired_voice_audio
[Training] [2023-03-10T06:21:16.885198] path: ./training/dennis/validation.txt
[Training] [2023-03-10T06:21:16.887481] fetcher_mode: ['lj']
[Training] [2023-03-10T06:21:16.889639] phase: val
[Training] [2023-03-10T06:21:16.891812] max_wav_length: 255995
[Training] [2023-03-10T06:21:16.893914] max_text_length: 200
[Training] [2023-03-10T06:21:16.896043] sample_rate: 22050
[Training] [2023-03-10T06:21:16.898170] load_conditioning: True
[Training] [2023-03-10T06:21:16.900341] num_conditioning_candidates: 2
[Training] [2023-03-10T06:21:16.902578] conditioning_length: 44000
[Training] [2023-03-10T06:21:16.904891] use_bpe_tokenizer: True
[Training] [2023-03-10T06:21:16.907171] tokenizer_vocab: ./models/tortoise/bpe_lowercase_asr_256.json
[Training] [2023-03-10T06:21:16.909332] load_aligned_codes: False
[Training] [2023-03-10T06:21:16.911447] data_type: img
[Training] [2023-03-10T06:21:16.913525] ]
[Training] [2023-03-10T06:21:16.915688] ]
[Training] [2023-03-10T06:21:16.917800] steps:[
[Training] [2023-03-10T06:21:16.919915] gpt_train:[
[Training] [2023-03-10T06:21:16.922000] training: gpt
[Training] [2023-03-10T06:21:16.924152] loss_log_buffer: 500
[Training] [2023-03-10T06:21:16.926196] optimizer: adamw
[Training] [2023-03-10T06:21:16.928294] optimizer_params:[
[Training] [2023-03-10T06:21:16.930324] lr: 1e-05
[Training] [2023-03-10T06:21:16.932394] weight_decay: 0.01
[Training] [2023-03-10T06:21:16.934428] beta1: 0.9
[Training] [2023-03-10T06:21:16.936702] beta2: 0.96
[Training] [2023-03-10T06:21:16.938800] ]
[Training] [2023-03-10T06:21:16.940949] clip_grad_eps: 4
[Training] [2023-03-10T06:21:16.943008] injectors:[
[Training] [2023-03-10T06:21:16.945151] paired_to_mel:[
[Training] [2023-03-10T06:21:16.947228] type: torch_mel_spectrogram
[Training] [2023-03-10T06:21:16.949388] mel_norm_file: ./models/tortoise/clips_mel_norms.pth
[Training] [2023-03-10T06:21:16.951473] in: wav
[Training] [2023-03-10T06:21:16.953635] out: paired_mel
[Training] [2023-03-10T06:21:16.955667] ]
[Training] [2023-03-10T06:21:16.957802] paired_cond_to_mel:[
[Training] [2023-03-10T06:21:16.959840] type: for_each
[Training] [2023-03-10T06:21:16.961977] subtype: torch_mel_spectrogram
[Training] [2023-03-10T06:21:16.964008] mel_norm_file: ./models/tortoise/clips_mel_norms.pth
[Training] [2023-03-10T06:21:16.966262] in: conditioning
[Training] [2023-03-10T06:21:16.968320] out: paired_conditioning_mel
[Training] [2023-03-10T06:21:16.970511] ]
[Training] [2023-03-10T06:21:16.972545] to_codes:[
[Training] [2023-03-10T06:21:16.974707] type: discrete_token
[Training] [2023-03-10T06:21:16.976734] in: paired_mel
[Training] [2023-03-10T06:21:16.978893] out: paired_mel_codes
[Training] [2023-03-10T06:21:16.981017] dvae_config: ./models/tortoise/train_diffusion_vocoder_22k_level.yml
[Training] [2023-03-10T06:21:16.983260] ]
[Training] [2023-03-10T06:21:16.985487] paired_fwd_text:[
[Training] [2023-03-10T06:21:16.990201] type: generator
[Training] [2023-03-10T06:21:16.994850] generator: gpt
[Training] [2023-03-10T06:21:16.997032] in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths']
[Training] [2023-03-10T06:21:16.999271] out: ['loss_text_ce', 'loss_mel_ce', 'logits']
[Training] [2023-03-10T06:21:17.001422] ]
[Training] [2023-03-10T06:21:17.003635] ]
[Training] [2023-03-10T06:21:17.005755] losses:[
[Training] [2023-03-10T06:21:17.007987] text_ce:[
[Training] [2023-03-10T06:21:17.010159] type: direct
[Training] [2023-03-10T06:21:17.012379] weight: 0.01
[Training] [2023-03-10T06:21:17.014535] key: loss_text_ce
[Training] [2023-03-10T06:21:17.016732] ]
[Training] [2023-03-10T06:21:17.018870] mel_ce:[
[Training] [2023-03-10T06:21:17.021230] type: direct
[Training] [2023-03-10T06:21:17.023407] weight: 1
[Training] [2023-03-10T06:21:17.025621] key: loss_mel_ce
[Training] [2023-03-10T06:21:17.027851] ]
[Training] [2023-03-10T06:21:17.030045] ]
[Training] [2023-03-10T06:21:17.032294] ]
[Training] [2023-03-10T06:21:17.034498] ]
[Training] [2023-03-10T06:21:17.036720] networks:[
[Training] [2023-03-10T06:21:17.038879] gpt:[
[Training] [2023-03-10T06:21:17.041189] type: generator
[Training] [2023-03-10T06:21:17.043291] which_model_G: unified_voice2
[Training] [2023-03-10T06:21:17.045538] kwargs:[
[Training] [2023-03-10T06:21:17.047740] layers: 30
[Training] [2023-03-10T06:21:17.049951] model_dim: 1024
[Training] [2023-03-10T06:21:17.052386] heads: 16
[Training] [2023-03-10T06:21:17.054575] max_text_tokens: 402
[Training] [2023-03-10T06:21:17.056737] max_mel_tokens: 604
[Training] [2023-03-10T06:21:17.058974] max_conditioning_inputs: 2
[Training] [2023-03-10T06:21:17.061189] mel_length_compression: 1024
[Training] [2023-03-10T06:21:17.063381] number_text_tokens: 256
[Training] [2023-03-10T06:21:17.065580] number_mel_codes: 8194
[Training] [2023-03-10T06:21:17.067778] start_mel_token: 8192
[Training] [2023-03-10T06:21:17.069990] stop_mel_token: 8193
[Training] [2023-03-10T06:21:17.072242] start_text_token: 255
[Training] [2023-03-10T06:21:17.074523] train_solo_embeddings: False
[Training] [2023-03-10T06:21:17.076700] use_mel_codes_as_input: True
[Training] [2023-03-10T06:21:17.078932] checkpointing: True
[Training] [2023-03-10T06:21:17.081189] tortoise_compat: True
[Training] [2023-03-10T06:21:17.083450] ]
[Training] [2023-03-10T06:21:17.085585] ]
[Training] [2023-03-10T06:21:17.087856] ]
[Training] [2023-03-10T06:21:17.089995] path:[
[Training] [2023-03-10T06:21:17.092198] strict_load: True
[Training] [2023-03-10T06:21:17.094287] pretrain_model_gpt: /home/rodrigez/ai-voice-cloning/models/tortoise/autoregressive.pth
[Training] [2023-03-10T06:21:17.096397] root: ./
[Training] [2023-03-10T06:21:17.098435] experiments_root: ./training/dennis/finetune
[Training] [2023-03-10T06:21:17.100560] models: ./training/dennis/finetune/models
[Training] [2023-03-10T06:21:17.102660] training_state: ./training/dennis/finetune/training_state
[Training] [2023-03-10T06:21:17.104838] log: ./training/dennis/finetune
[Training] [2023-03-10T06:21:17.106915] val_images: ./training/dennis/finetune/val_images
[Training] [2023-03-10T06:21:17.109050] ]
[Training] [2023-03-10T06:21:17.111123] train:[
[Training] [2023-03-10T06:21:17.113243] niter: 4824
[Training] [2023-03-10T06:21:17.115348] warmup_iter: -1
[Training] [2023-03-10T06:21:17.117592] mega_batch_factor: 1
[Training] [2023-03-10T06:21:17.119701] val_freq: 4
[Training] [2023-03-10T06:21:17.121806] ema_enabled: False
[Training] [2023-03-10T06:21:17.123971] default_lr_scheme: MultiStepLR
[Training] [2023-03-10T06:21:17.126093] gen_lr_steps: [0, 0, 0, 0]
[Training] [2023-03-10T06:21:17.128227] lr_gamma: 0.5
[Training] [2023-03-10T06:21:17.130303] ]
[Training] [2023-03-10T06:21:17.132460] eval:[
[Training] [2023-03-10T06:21:17.134660] pure: True
[Training] [2023-03-10T06:21:17.136853] output_state: gen
[Training] [2023-03-10T06:21:17.138946] ]
[Training] [2023-03-10T06:21:17.141106] logger:[
[Training] [2023-03-10T06:21:17.143189] print_freq: 4
[Training] [2023-03-10T06:21:17.145360] save_checkpoint_freq: 4
[Training] [2023-03-10T06:21:17.147518] visuals: ['gen', 'mel']
[Training] [2023-03-10T06:21:17.149699] visual_debug_rate: 4
[Training] [2023-03-10T06:21:17.151744] is_mel_spectrogram: True
[Training] [2023-03-10T06:21:17.153917] ]
[Training] [2023-03-10T06:21:17.155993] is_train: True
[Training] [2023-03-10T06:21:17.158158] dist: False
[Training] [2023-03-10T06:21:17.160271]
[Training] [2023-03-10T06:21:17.162477] ./train.sh: line 12: 36088 Segmentation fault (core dumped) python3 ./src/train.py -opt "$CONFIG"
I installed using setup-cuda and update and update-force. What am I doing wrong?
I wouldn't have a clue. I haven't ran into any segfaults from Python, so I can't debug it.
However, I see:
where it shouldn't be that. I doubt it's the culprit, but you should change your schedule to something that is logical.
Here's what I tried so far:
Do not sudo apt remove python3, this will brick your os.
Set gen_lr_steps to default values.
I used gdb to debug python, and got this:
Which led me to suspect bitsandbytes. I installed several different version of it, no difference.
I tried installing cuda 12.0 and 11.7.1, no difference.
I checked out a commit from 3 days ago when it was working for me, recreated the training dataset, same deal.
New audio file, new dataset.
bitsandbytes off. fp16. whisper/whisperx. multistep / cos annealing.
Reinstalled again, notice I'm getting
Manually installed pyannote:
Manually installed vector-quantize-pytorch:
This is probably unrelated.
python 3.10.7 -> 3.10.9.
Installed pyannote-audio-1.1.2 which seems to remove conflicts.
Nvidia 530
Maybe one of the pip packages is to blame?
Anyone else experiencing this?
How do I call train.sh manually?
I used ./train.sh ./training/1/train.yaml
This is the output:
Commented >shutil.copy(opt_path, os.path.join(opt['path']['experiments_root'], f'{datetime.now().strftime("%d%m%Y_%H%M%S")}_{os.path.basename(opt_path)}'))
in dlas/codes/train.py then I segfault with terminal.
Is my python install borked somehow? Feels like I tried enough things that more people would complain if there was a commonality. Perhaps I'll reinstall os again
Oh nice, you were able to get me a gdb trace. I'll thumb through it when I get a chance, but for now I'll give some brief help until I can:
Yeah, that red block is just from whisperx having shit dependencies that mess with everything else. einops getting set to the right version makes PIP scream despite it working fine for whisperx.
Have you tried running it with it uninstalled? desu I need to revalidate that disabling bitsandbytes in the settings actually makes it train without bitsandbytes. You could also do some environment variable settings, as referenced in the
./src/train.py
, which should work, but I don't trust that it's still able to be overridden like that.Invoke the train script was
./train.sh 1 ./training/1/train.yaml
. desu it's an artifact on when Run Training handles passing GPU count, as it will appropriately spawn the necessary torchrun processes per GPU, but I can do better and clean it up.I'd try running without BitsAndBytes, although I'm not sure how that would suddenly be the problem. As I mentioned I'll thumb through the stack trace when I get a moment.
Can confirm same error without bitsandbytes installed.
I've reinstalled OS on a formatted drive so its not anything screwy with my installation. I could try older graphic drives and cudas, but with no leads and me apparently being the only one with this problem it's hard to say.
Tried Firefox and chromium, and now also through terminal.
Actually got quite far uninstalling all pip packages and installing them 1 by 1 each time it complained about missing package and I did get a few extra lines in before I called it quits. Usually I get to
And installing pips 1 by 1 I saw the line about seed number and 2-3 after it before it complained about missing pip packages (I got to when it asked for transformer_x which I installed, but it still wasn't happy with.) Not sure if that will lead to something or merely a phantom of me finageling with the code to force it through.
Went ahead and commented out lines in train.yaml until it no longer seg faulted. Here is the minimal yaml that leads to the segfault:
all the files/dirs are useless too. Turns out you only really need 10 lines of text to train a model. Or segfault on one anyway.
in train.yaml: use_tb_logger: true -> false fixes the segfault and allows training (As long as gradio doesn't set use_tb_logger to true again)
Busy all morning, forgot to follow up. The stack trace didn't prove much useful, as I forgot that it won't really expose anything in Python itself, oops.
How strange it's related to tensorboard. For now I'll probably just add a bandaid option to disable logging and fallback to parsing the text output again, then later figure out if there's anything specific that causes logging to fail (or at least modify the logging routine to just spit out JSON instead of using a seemingly-proprietary binary).
I'll leave this open until I do the above.
Alright, I overhauled how metrics are relayed and stored. Tensorboard logging is no longer used, so no more segfaults.
I still need to do more validation to make sure nothing breaks, but it should be fine from my cursory tests.
Incidentally, I've found that TensorBoard has caused a similar issue causing this VALL-E implementation to segfault as well, although the propose fix didn't fix it. I just found it a little incidental.
Anyways, DLAS and the web UI don't use TensborBoard for metrics, so it shouldn't segfault from it.