apparently I got an error for trying to serialize an errant tensor that made its way into the json, this could be remedied easily with recursively traversing the dict and coercing any objects to primitives, but I'm tired and I just want to start training and nap

This commit is contained in:
mrq 2024-05-04 12:33:43 -05:00
parent ffa200eec7
commit 277dcec484
2 changed files with 5 additions and 1 deletions

View File

@ -64,6 +64,7 @@ If you're interested in creating an HDF5 copy of your dataset, simply invoke: `p
5. Train the model using the following scripts: `python -m vall_e.train yaml=./data/config.yaml`
* If distributing your training (for example, multi-GPU), use `deepspeed --module vall_e.train yaml="./data/config.yaml"`
+ if you're not using the `deepspeed` backend, set `trainer.ddp = True` in the config YAML, then launch with `torchrun --nnodes=1 --nproc-per-node=4 -m vall_e.train yaml="./data/config.yaml"`
You may quit your training any time by just entering `quit` in your CLI. The latest checkpoint will be automatically saved.

View File

@ -173,7 +173,10 @@ def train(
elapsed_time = stats.get("elapsed_time", 0)
metrics = json.dumps(stats)
try:
metrics = json.dumps(stats)
except Exception as e:
metrics = str(stats)
if cfg.trainer.no_logger:
tqdm.write(f"Training Metrics: {truncate_json(metrics)}.")