apparently I got an error for trying to serialize an errant tensor that made its way into the json, this could be remedied easily with recursively traversing the dict and coercing any objects to primitives, but I'm tired and I just want to start training and nap

2024-05-04 12:33:43 -05:00 · 2024-05-04 12:33:43 -05:00 · 277dcec484
commit 277dcec484
parent ffa200eec7
2 changed files with 5 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -64,6 +64,7 @@ If you're interested in creating an HDF5 copy of your dataset, simply invoke: `p

 5. Train the model using the following scripts: `python -m vall_e.train yaml=./data/config.yaml`
 * If distributing your training (for example, multi-GPU), use `deepspeed --module vall_e.train yaml="./data/config.yaml"`
+  + if you're not using the `deepspeed` backend, set `trainer.ddp = True` in the config YAML, then launch with `torchrun --nnodes=1 --nproc-per-node=4 -m vall_e.train yaml="./data/config.yaml"`

 You may quit your training any time by just entering `quit` in your CLI. The latest checkpoint will be automatically saved.

--- a/vall_e/utils/trainer.py
+++ b/vall_e/utils/trainer.py
@ -173,7 +173,10 @@ def train(


 		elapsed_time = stats.get("elapsed_time", 0)
-		metrics = json.dumps(stats)
+		try:
+			metrics = json.dumps(stats)
+		except Exception as e:
+			metrics = str(stats)

 		if cfg.trainer.no_logger:
 			tqdm.write(f"Training Metrics: {truncate_json(metrics)}.")