Add persistent_workers options in DataLoader to make training faster by removing pauses between epochs. #42

Closed
opened 2023-02-26 14:06:14 +00:00 by gannybal · 1 comment

The slight issue i've had in training is pretty substantial delay between epochs during training. And GPU basically doing nothing in-between.
Basically i've had the same issue with LORA training in stable diffusion using kohya sd-scripts: https://github.com/kohya-ss/sd-scripts

Then that problem got solved by adding one parameter to a simple line:
https://github.com/kohya-ss/sd-scripts/pull/140

So, by adding

persistent_workers=True

to line 21 in dlas\codes\data\__init__.py

return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, sampler=sampler, drop_last=True, pin_memory=pin_memory,  collate_fn=collate_fn)

resulting in

return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, sampler=sampler, drop_last=True, pin_memory=pin_memory, collate_fn=collate_fn, persistent_workers=True)

I tried applying the same thing here, and i think it solves the problem, accelerating the training. With large batch sizes this could potentially accelerate the training by several times. (x4 times with my test)
Might need someone to test it out on their machines.

The slight issue i've had in training is pretty substantial delay between epochs during training. And GPU basically doing nothing in-between. Basically i've had the same issue with LORA training in stable diffusion using kohya sd-scripts: https://github.com/kohya-ss/sd-scripts Then that problem got solved by adding one parameter to a simple line: https://github.com/kohya-ss/sd-scripts/pull/140 So, by adding ``` persistent_workers=True ``` to line 21 in `dlas\codes\data\__init__.py` ``` return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, sampler=sampler, drop_last=True, pin_memory=pin_memory, collate_fn=collate_fn) ``` resulting in ``` return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, sampler=sampler, drop_last=True, pin_memory=pin_memory, collate_fn=collate_fn, persistent_workers=True) ``` I tried applying the same thing here, and i think it solves the problem, accelerating the training. With large batch sizes this could potentially accelerate the training by several times. (x4 times with my test) Might need someone to test it out on their machines.
Owner

Testing it now on a paperspace instance. It hasn't outright died yet so I suppose it's safe to push out.


Added in DLAS commit 71cc43e65c.

Testing it now on a paperspace instance. It hasn't outright died yet so I suppose it's safe to push out. --- Added in DLAS commit https://git.ecker.tech/mrq/DL-Art-School/commit/71cc43e65cd47c6704d20c99006a3e78feb2400d.
mrq closed this issue 2023-02-26 15:01:40 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#42
No description provided.