Memory Leak #61
Labels
No Label
bug
duplicate
enhancement
help wanted
insufficient info
invalid
news
not a bug
question
wontfix
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mrq/ai-voice-cloning#61
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
There is a typo in
do_gc
that causes the exception block to always run, but the error is not printed so it was never discovered. The exception says that the symboltrytorch
could not be found.https://git.ecker.tech/mrq/ai-voice-cloning/src/branch/master/src/utils.py#L1292
As a result, garbage is never collected, and memory only ever increases, until you eventually OOM.
For example: My 6GB GPU goes from 0.3GB to about 3.5GB after a quick short generation, and if I try another one, I OOM at about 5.8GB, then any subsequent generations instantly OOM. Watching my memory graph on Task Manager reveals that it stays flat at 5.8GB indefinitely, and "Reload TTS" and all other options don't fix it. Only killing the process fixes it.
Haha, I wonder if that was the whole reason it never seemed to do anything. Fixed in
cd8702ab0d
.desu, I don't expect it actually make a difference, since I added it specifically to make the TTS model leave VRAM when I unload it without any invasive measures for training. I do remember there being a problem on my 2060 before forcing GC to run, but I don't know if it went away or not.
I'll leave this open in the event it does persist.
Well, I suppose it's better than nothing. At least, in my original scope of "I want TTS to unload when I train", I guess it does, but there's this lingering 500MiB I'm not sure what is for. It jumped up 200MiB whenever I loaded the TTS model again before unloading it for training.
Might be VoiceFixer not actually clearing (which makes sense, as I have it unload then load for TTS to not OOM during generation).It is not VoiceFixer, I disabled it and I'm still getting a leak. Wonder what it could be if GC isn't catching it.I get the feeling different builds of torch that people are using might GC by themselves more or less eagerly than others. Maybe the installation of dependencies gives us different versions based on our specific hardware or python versions that pip detects. I am on Python 3.8.6
If it wasn't environmental, I feel like lots more people would have complained/not been able to use the app.
Anyway, on my environment, changing trytorch to torch allowed me to run things pretty liberally.
Spent some time trying to track the leak, I believe there's more than one.
Under ai-voice-cloning/tortoise-tts/tortoise/api.py load_autoregressive_model
Loading another autoregressive model than the one that the program was started with adds an additional ~1.7gb to ram, the size of the autoregressive model file. I tried calling garbage collection after the del statement, but it only drops 200mb which get filled again immediately after. Perhaps there's another reference to the object that originally loaded somewhere preventing gc from doing its thing, but I've been unable to find it so far.
Cuda can keep things cached, I have torch.cuda.empty_cache() added to get_device so it trigger's every time the TTS system is reloaded. Supposedly stuff can remained cached, which can cause weird OOMS, it's caused my system to become a lot more responsive and i've been getting less OOMs and crashes with this edit. Not sure if it's the issue but it does help somewhat.
Just chiming in to say that appending this sped up my setup considerably, thanks a bunch!
Forgot to mention the cache clearing has been implemented to mrq/tortoise-tts in commit
cc36c0997c
on grabbing the name, just to be safe.I'll need to validate it myself if it makes a difference.
Figured to mention it here rather than make a new issue, as I feel like it'd be weird for me to open an issue for an actual issue for once.
My autism with constantly checking my GPU metrics when inferencing noticed these VRAM spikes every time it does the "compare against the CLVP" pass. I'm not too sure what it could be outside of it duplicating the AR sample tensors.
I'll have to look for it in the morning, as it's nearing 1AM and I should start to wind down and throw something to train against.