James Betker
44a19cd37c
ExtensibleTrainer mods to support advanced checkpointing for stylegan2
...
Basically: stylegan2 makes use of gradient-based normalizers. These
make it so that I cannot use gradient checkpointing. But I love gradient
checkpointing. It makes things really, really fast and memory conscious.
So - only don't checkpoint when we run the regularizer loss. This is a
bit messy, but speeds up training by at least 20%.
Also: pytorch: please make checkpointing a first class citizen.
2020-11-12 15:45:07 -07:00
James Betker
db9e9e28a0
Fix an issue where GPU0 was always being used in non-ddp
...
Frankly, I don't understand how this has ever worked. WTF.
2020-11-12 15:43:01 -07:00
James Betker
88f349bdf1
Enable usage of wandb
2020-11-11 21:48:56 -07:00
James Betker
b742d1e5a5
When skipping steps via "every", still run nontrainable injection points
2020-11-10 16:09:17 -07:00
James Betker
df47d6cbbb
More work in support of training flow networks in tandem with generators
2020-11-04 18:07:48 -07:00
James Betker
74738489b9
Fixes and additional support for progressive zoom
2020-10-30 09:59:54 -06:00
James Betker
a3918fa808
Tecogan & other fixes
2020-10-30 00:19:58 -06:00
James Betker
da53090ce6
More adjustments to support distributed training with teco & on multi_modal_train
2020-10-27 20:58:03 -06:00
James Betker
d7ee14f721
Move to torch.cuda.amp (not working)
...
Running into OOM errors, needs diagnosing. Checkpointing here.
2020-10-22 13:58:05 -06:00
James Betker
3e3d2af1f3
Add multi-modal trainer
2020-10-22 13:27:32 -06:00
James Betker
680d635420
Enable ExtensibleTrainer to skip steps when state keys are missing
2020-10-21 22:22:28 -06:00
James Betker
3c6e600e48
Add capacity for models to self-report visuals
2020-10-21 11:08:03 -06:00
James Betker
981d64413b
Support validation over a custom injector
...
Also re-enable PSNR
2020-10-19 11:01:56 -06:00
James Betker
d1c63ae339
Go back to torch's DDP
...
Apex was having some weird crashing issues.
2020-10-16 20:47:35 -06:00
James Betker
e785029936
Mods needed to support SPSR archs with teco gan
2020-10-10 22:39:55 -06:00
James Betker
7e777ea34c
Allow tecogan to be used in process_video
2020-10-09 19:21:43 -06:00
James Betker
1eb516d686
Fix more distributed bugs
2020-10-08 14:32:45 -06:00
James Betker
fba29d7dcc
Move to apex distributeddataparallel and add switch all_reduce
...
Torch's distributed_data_parallel is missing "delay_allreduce", which is
necessary to get gradient checkpointing to work with recurrent models.
2020-10-08 11:20:05 -06:00
James Betker
c96f5b2686
Import switched_conv as a submodule
2020-10-07 23:10:54 -06:00
James Betker
8a7e993aea
Merge remote-tracking branch 'origin/gan_lab' into gan_lab
2020-10-06 20:41:58 -06:00
James Betker
1e415b249b
Add tag that can be applied to prevent parameter training
2020-10-06 20:39:49 -06:00
James Betker
cffc596141
Integrate flownet2 into codebase, add teco visual debugs
2020-10-06 20:35:39 -06:00
James Betker
3561cc164d
Fix up fea_loss calculator (for validation)
...
Not sure how this was working in regular training mode, but it
was failing in DDP.
2020-10-03 11:19:20 -06:00
James Betker
6c9718ad64
Don't log if you aren't 0 rank
2020-10-03 11:14:13 -06:00
James Betker
922b1d76df
Don't record visuals when not on rank 0
2020-10-03 11:10:03 -06:00
James Betker
7986185fcb
Change 'mod_step' to 'every'
2020-10-01 11:28:06 -06:00
James Betker
05963157c1
Several things
...
- Fixes to 'after' and 'before' defs for steps (turns out they werent working)
- Feature nets take in a list of layers to extract. Not fully implemented yet.
- Fixes bugs with RAGAN
- Allows real input into generator gan to not be detached by param
2020-09-23 11:56:36 -06:00
James Betker
f40beb5460
Add 'before' and 'after' defs to injections, steps and optimizers
2020-09-22 17:03:22 -06:00
James Betker
e2a146abc7
Add in experiments hook
2020-09-19 10:05:25 -06:00
James Betker
9a17ade550
Some convenience adjustments to ExtensibleTrainer
2020-09-17 21:05:32 -06:00
James Betker
df59d6c99d
More spsr3 mods
...
- Most branches get their own noise vector now.
- First attention branch has the intended sole purpose of raw image processing
- Remove norms from joiner block
2020-09-09 16:46:38 -06:00
James Betker
3027e6e27d
Enable amp to be disabled
2020-09-09 10:45:59 -06:00
James Betker
e6207d4c50
SPSR3 work
...
SPSR3 is meant to fix whatever is causing the switching units
inside of the newer SPSR architectures to fail and basically
not use the multiplexers.
2020-09-08 15:14:23 -06:00
James Betker
f43df7f5f7
Make ExtensibleTrainer compatible with process_video
2020-09-08 08:03:41 -06:00
James Betker
b1238d29cb
Fix trainable not applying to discriminators
2020-09-05 20:31:26 -06:00
James Betker
0dfd8eaf3b
Support injectors that run in eval only
2020-09-05 07:59:45 -06:00
James Betker
6657a406ac
Mods needed to support training a corruptor again:
...
- Allow original SPSRNet to have a specifiable block increment
- Cleanup
- Bug fixes in code that hasnt been touched in awhile.
2020-09-04 15:33:39 -06:00
James Betker
886d59d5df
Misc fixes & adjustments
2020-09-01 07:58:11 -06:00
James Betker
4b4d08bdec
Enable testing in ExtensibleTrainer, fix it in SRGAN_model
...
Also compute fea loss for this.
2020-08-31 09:41:48 -06:00
James Betker
f35b3ad28f
Fix val behavior for ExtensibleTrainer
2020-08-26 08:44:22 -06:00
James Betker
a65b07607c
Reference network
2020-08-25 11:56:59 -06:00
James Betker
dffc15184d
More ExtensibleTrainer work
...
It runs now, just need to debug it to reach performance parity with SRGAN. Sweet.
2020-08-23 17:22:45 -06:00
James Betker
e59e712e39
More ExtensibleTrainer work
2020-08-22 13:08:33 -06:00
James Betker
f40545f235
ExtensibleTrainer work
2020-08-22 08:24:34 -06:00
James Betker
74cdaa2226
Some work on extensible trainer
2020-08-18 08:49:32 -06:00
James Betker
ab04ca1778
Extensible trainer (in progress)
2020-08-12 08:45:23 -06:00