James Betker
da53090ce6
More adjustments to support distributed training with teco & on multi_modal_train
2020-10-27 20:58:03 -06:00
James Betker
2a3eec8fd7
Fix some distributed training snafus
2020-10-27 15:24:05 -06:00
James Betker
15e00e9014
Finish integration with autocast
...
Note: autocast is broken when also using checkpoint(). Overcome this by modifying
torch's checkpoint() function in place to also use autocast.
2020-10-22 14:39:19 -06:00
James Betker
d7ee14f721
Move to torch.cuda.amp (not working)
...
Running into OOM errors, needs diagnosing. Checkpointing here.
2020-10-22 13:58:05 -06:00
James Betker
24792bdb4f
Codebase cleanup
...
Removed a lot of legacy stuff I have no intent on using again.
Plan is to shape this repo into something more extensible (get it? hah!)
2020-10-13 20:56:39 -06:00
James Betker
8014f050ac
Clear metrics properly
...
Holy cow, what a PITA bug.
2020-10-13 10:07:49 -06:00
James Betker
8197fd646f
Don't accumulate losses for metrics when the loss isn't a tensor
2020-10-03 11:03:55 -06:00
James Betker
39865ca3df
TOTAL_loss, dumbo
2020-10-02 21:06:10 -06:00
James Betker
4e44fcd655
Loss accumulator fix
2020-10-02 20:55:33 -06:00
James Betker
567b4d50a4
ExtensibleTrainer - don't compute backward when there is no loss
2020-10-02 20:54:06 -06:00
James Betker
dc8f3b24de
Don't let duplicate keys be used for injectors and losses
2020-09-29 16:59:44 -06:00
James Betker
f9b83176f1
Fix bugs in extensibletrainer
2020-09-28 22:09:42 -06:00
James Betker
31641d7f63
Add ImagePatchInjector and TranslationalLoss
2020-09-26 21:25:32 -06:00
James Betker
6d0490a0e6
Tecogan implementation work
2020-09-25 16:38:23 -06:00
James Betker
f40beb5460
Add 'before' and 'after' defs to injections, steps and optimizers
2020-09-22 17:03:22 -06:00
James Betker
e9a39bfa14
Recursively detach all outputs, even if they are nested in data structures
2020-09-19 21:47:34 -06:00
James Betker
9a17ade550
Some convenience adjustments to ExtensibleTrainer
2020-09-17 21:05:32 -06:00
James Betker
5b85f891af
Only log the name of the first network in the total_loss training set
2020-09-12 16:07:09 -06:00
James Betker
fb595e72a4
Supporting infrastructure in ExtensibleTrainer to train spsr4
...
Need to be able to train 2 nets in one step: the backbone will be entirely separate
with its own optimizer (for an extremely low LR).
This functionality was already present, just not implemented correctly.
2020-09-11 22:57:06 -06:00
James Betker
5189b11dac
Add combined dataset for training across multiple datasets
2020-09-11 08:44:06 -06:00
James Betker
3027e6e27d
Enable amp to be disabled
2020-09-09 10:45:59 -06:00
James Betker
c04f244802
More mods
2020-09-08 20:36:27 -06:00
James Betker
e8613041c0
Add novograd optimizer
2020-09-06 17:27:08 -06:00
James Betker
21ae135f23
Allow Novograd to be used as an optimizer
2020-09-05 16:50:13 -06:00
James Betker
0dfd8eaf3b
Support injectors that run in eval only
2020-09-05 07:59:45 -06:00
James Betker
4b4d08bdec
Enable testing in ExtensibleTrainer, fix it in SRGAN_model
...
Also compute fea loss for this.
2020-08-31 09:41:48 -06:00
James Betker
dffc15184d
More ExtensibleTrainer work
...
It runs now, just need to debug it to reach performance parity with SRGAN. Sweet.
2020-08-23 17:22:45 -06:00
James Betker
e59e712e39
More ExtensibleTrainer work
2020-08-22 13:08:33 -06:00
James Betker
f40545f235
ExtensibleTrainer work
2020-08-22 08:24:34 -06:00
James Betker
74cdaa2226
Some work on extensible trainer
2020-08-18 08:49:32 -06:00
James Betker
ab04ca1778
Extensible trainer (in progress)
2020-08-12 08:45:23 -06:00