apparently I got an error for trying to serialize an errant tensor that made its way into the json, this could be remedied easily with recursively traversing the dict and coercing any objects to primitives, but I'm tired and I just want to start training and nap
This commit is contained in:
parent
ffa200eec7
commit
277dcec484
|
@ -64,6 +64,7 @@ If you're interested in creating an HDF5 copy of your dataset, simply invoke: `p
|
|||
|
||||
5. Train the model using the following scripts: `python -m vall_e.train yaml=./data/config.yaml`
|
||||
* If distributing your training (for example, multi-GPU), use `deepspeed --module vall_e.train yaml="./data/config.yaml"`
|
||||
+ if you're not using the `deepspeed` backend, set `trainer.ddp = True` in the config YAML, then launch with `torchrun --nnodes=1 --nproc-per-node=4 -m vall_e.train yaml="./data/config.yaml"`
|
||||
|
||||
You may quit your training any time by just entering `quit` in your CLI. The latest checkpoint will be automatically saved.
|
||||
|
||||
|
|
|
@ -173,7 +173,10 @@ def train(
|
|||
|
||||
|
||||
elapsed_time = stats.get("elapsed_time", 0)
|
||||
metrics = json.dumps(stats)
|
||||
try:
|
||||
metrics = json.dumps(stats)
|
||||
except Exception as e:
|
||||
metrics = str(stats)
|
||||
|
||||
if cfg.trainer.no_logger:
|
||||
tqdm.write(f"Training Metrics: {truncate_json(metrics)}.")
|
||||
|
|
Loading…
Reference in New Issue
Block a user