Add persistent_workers options in DataLoader to make training faster by removing pauses between epochs. #42

New Issue

gannybal · 2023-02-26T14:06:14Z

gannybal commented

2023-02-26 14:06:14 +00:00

The slight issue i've had in training is pretty substantial delay between epochs during training. And GPU basically doing nothing in-between.
Basically i've had the same issue with LORA training in stable diffusion using kohya sd-scripts: https://github.com/kohya-ss/sd-scripts

Then that problem got solved by adding one parameter to a simple line:
https://github.com/kohya-ss/sd-scripts/pull/140

So, by adding

persistent_workers=True

to line 21 in dlas\codes\data\__init__.py

return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, sampler=sampler, drop_last=True, pin_memory=pin_memory,  collate_fn=collate_fn)

resulting in

return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, sampler=sampler, drop_last=True, pin_memory=pin_memory, collate_fn=collate_fn, persistent_workers=True)

I tried applying the same thing here, and i think it solves the problem, accelerating the training. With large batch sizes this could potentially accelerate the training by several times. (x4 times with my test)
Might need someone to test it out on their machines.

The slight issue i've had in training is pretty substantial delay between epochs during training. And GPU basically doing nothing in-between. Basically i've had the same issue with LORA training in stable diffusion using kohya sd-scripts: https://github.com/kohya-ss/sd-scripts Then that problem got solved by adding one parameter to a simple line: https://github.com/kohya-ss/sd-scripts/pull/140 So, by adding ``` persistent_workers=True ``` to line 21 in `dlas\codes\data\__init__.py` ``` return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, sampler=sampler, drop_last=True, pin_memory=pin_memory, collate_fn=collate_fn) ``` resulting in ``` return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, sampler=sampler, drop_last=True, pin_memory=pin_memory, collate_fn=collate_fn, persistent_workers=True) ``` I tried applying the same thing here, and i think it solves the problem, accelerating the training. With large batch sizes this could potentially accelerate the training by several times. (x4 times with my test) Might need someone to test it out on their machines.

mrq commented

2023-02-26 14:53:16 +00:00

Testing it now on a paperspace instance. It hasn't outright died yet so I suppose it's safe to push out.

Added in DLAS commit 71cc43e65c.

Testing it now on a paperspace instance. It hasn't outright died yet so I suppose it's safe to push out. --- Added in DLAS commit https://git.ecker.tech/mrq/DL-Art-School/commit/71cc43e65cd47c6704d20c99006a3e78feb2400d.

mrq closed this issue

2023-02-26 15:01:40 +00:00

Sign in to join this conversation.