Memory Leak #61

Closed
opened 2023-03-05 11:04:40 +00:00 by Xiao · 8 comments

There is a typo in do_gc that causes the exception block to always run, but the error is not printed so it was never discovered. The exception says that the symbol trytorch could not be found.

https://git.ecker.tech/mrq/ai-voice-cloning/src/branch/master/src/utils.py#L1292

As a result, garbage is never collected, and memory only ever increases, until you eventually OOM.

For example: My 6GB GPU goes from 0.3GB to about 3.5GB after a quick short generation, and if I try another one, I OOM at about 5.8GB, then any subsequent generations instantly OOM. Watching my memory graph on Task Manager reveals that it stays flat at 5.8GB indefinitely, and "Reload TTS" and all other options don't fix it. Only killing the process fixes it.

There is a typo in `do_gc` that causes the exception block to always run, but the error is not printed so it was never discovered. The exception says that the symbol `trytorch` could not be found. https://git.ecker.tech/mrq/ai-voice-cloning/src/branch/master/src/utils.py#L1292 As a result, garbage is never collected, and memory only ever increases, until you eventually OOM. For example: My 6GB GPU goes from 0.3GB to about 3.5GB after a quick short generation, and if I try another one, I OOM at about 5.8GB, then any subsequent generations instantly OOM. Watching my memory graph on Task Manager reveals that it stays flat at 5.8GB indefinitely, and "Reload TTS" and all other options don't fix it. Only killing the process fixes it.
Owner

Haha, I wonder if that was the whole reason it never seemed to do anything. Fixed in cd8702ab0d.

desu, I don't expect it actually make a difference, since I added it specifically to make the TTS model leave VRAM when I unload it without any invasive measures for training. I do remember there being a problem on my 2060 before forcing GC to run, but I don't know if it went away or not.

I'll leave this open in the event it does persist.

Haha, I wonder if that was the whole reason it never seemed to do anything. Fixed in cd8702ab0dab9bb156864e7b5da69060910ece3a. desu, I don't expect it actually make a difference, since I added it specifically to make the TTS model leave VRAM when I unload it without any invasive measures for training. I do remember there being a problem on my 2060 before forcing GC to run, but I don't know if it went away or not. I'll leave this open in the event it does persist.
Owner

Well, I suppose it's better than nothing. At least, in my original scope of "I want TTS to unload when I train", I guess it does, but there's this lingering 500MiB I'm not sure what is for. It jumped up 200MiB whenever I loaded the TTS model again before unloading it for training. Might be VoiceFixer not actually clearing (which makes sense, as I have it unload then load for TTS to not OOM during generation). It is not VoiceFixer, I disabled it and I'm still getting a leak. Wonder what it could be if GC isn't catching it.

image

Well, I suppose it's better than nothing. At least, in my original scope of "I want TTS to unload when I train", I guess it does, but there's this lingering 500MiB I'm not sure what is for. It jumped up 200MiB whenever I loaded the TTS model again before unloading it for training. ~~Might be VoiceFixer not actually clearing (which makes sense, as I have it unload then load for TTS to not OOM during generation).~~ It is not VoiceFixer, I disabled it and I'm still getting a leak. Wonder what it could be if GC isn't catching it. ![image](/attachments/c6746f97-d71d-40a2-975d-2b9c6b8691b9)
Author

I get the feeling different builds of torch that people are using might GC by themselves more or less eagerly than others. Maybe the installation of dependencies gives us different versions based on our specific hardware or python versions that pip detects. I am on Python 3.8.6

If it wasn't environmental, I feel like lots more people would have complained/not been able to use the app.

Anyway, on my environment, changing trytorch to torch allowed me to run things pretty liberally.

I get the feeling different builds of torch that people are using might GC by themselves more or less eagerly than others. Maybe the installation of dependencies gives us different versions based on our specific hardware or python versions that pip detects. I am on Python 3.8.6 If it wasn't environmental, I feel like lots more people would have complained/not been able to use the app. Anyway, on my environment, changing trytorch to torch allowed me to run things pretty liberally.

Spent some time trying to track the leak, I believe there's more than one.
Under ai-voice-cloning/tortoise-tts/tortoise/api.py load_autoregressive_model

        if hasattr(self, 'autoregressive'):
            del self.autoregressive

        self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30,
                                          model_dim=1024,
                                          heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False,
                                          train_solo_embeddings=False).cpu().eval()

Loading another autoregressive model than the one that the program was started with adds an additional ~1.7gb to ram, the size of the autoregressive model file. I tried calling garbage collection after the del statement, but it only drops 200mb which get filled again immediately after. Perhaps there's another reference to the object that originally loaded somewhere preventing gc from doing its thing, but I've been unable to find it so far.

Spent some time trying to track the leak, I believe there's more than one. Under ai-voice-cloning/tortoise-tts/tortoise/api.py load_autoregressive_model ``` if hasattr(self, 'autoregressive'): del self.autoregressive self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30, model_dim=1024, heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False, train_solo_embeddings=False).cpu().eval() ``` Loading another autoregressive model than the one that the program was started with adds an additional ~1.7gb to ram, the size of the autoregressive model file. I tried calling garbage collection after the del statement, but it only drops 200mb which get filled again immediately after. Perhaps there's another reference to the object that originally loaded somewhere preventing gc from doing its thing, but I've been unable to find it so far.

Cuda can keep things cached, I have torch.cuda.empty_cache() added to get_device so it trigger's every time the TTS system is reloaded. Supposedly stuff can remained cached, which can cause weird OOMS, it's caused my system to become a lot more responsive and i've been getting less OOMs and crashes with this edit. Not sure if it's the issue but it does help somewhat.

def get_device(verbose=False):
    name = get_device_name()

    if verbose:
        if name == 'cpu':
            print("No hardware acceleration is available, falling back to CPU...")    
        else:
            print(f"Hardware acceleration found: {name}")

    if name == "dml":
        import torch_directml
        return torch_directml.device()
    if name == 'cuda':
        torch.cuda.empty_cache()

    return torch.device(name)
Cuda can keep things cached, I have torch.cuda.empty_cache() added to get_device so it trigger's every time the TTS system is reloaded. Supposedly stuff can remained cached, which can cause weird OOMS, it's caused my system to become a lot more responsive and i've been getting less OOMs and crashes with this edit. Not sure if it's the issue but it does help somewhat. ``` def get_device(verbose=False): name = get_device_name() if verbose: if name == 'cpu': print("No hardware acceleration is available, falling back to CPU...") else: print(f"Hardware acceleration found: {name}") if name == "dml": import torch_directml return torch_directml.device() if name == 'cuda': torch.cuda.empty_cache() return torch.device(name) ```
        torch.cuda.empty_cache()

Just chiming in to say that appending this sped up my setup considerably, thanks a bunch!

> ``` > torch.cuda.empty_cache() > ``` Just chiming in to say that appending this sped up my setup considerably, thanks a bunch!
Owner

Forgot to mention the cache clearing has been implemented to mrq/tortoise-tts in commit cc36c0997c on grabbing the name, just to be safe.

I'll need to validate it myself if it makes a difference.

Forgot to mention the cache clearing has been implemented to mrq/tortoise-tts in commit https://git.ecker.tech/mrq/tortoise-tts/commit/cc36c0997c8711889ef8028002fc9e41abd5c5f0 on grabbing the name, just to be safe. I'll need to validate it myself if it makes a difference.
Owner

Figured to mention it here rather than make a new issue, as I feel like it'd be weird for me to open an issue for an actual issue for once.

My autism with constantly checking my GPU metrics when inferencing noticed these VRAM spikes every time it does the "compare against the CLVP" pass. I'm not too sure what it could be outside of it duplicating the AR sample tensors.

I'll have to look for it in the morning, as it's nearing 1AM and I should start to wind down and throw something to train against.

image

Figured to mention it here rather than make a new issue, as I feel like it'd be weird for me to open an issue for an actual issue for once. My autism with constantly checking my GPU metrics when inferencing noticed these VRAM spikes every time it does the "compare against the CLVP" pass. I'm not too sure what it could be outside of it duplicating the AR sample tensors. I'll have to look for it in the morning, as it's nearing 1AM and I should start to wind down and throw something to train against. ![image](/attachments/0d360903-b630-44d8-b455-1017fd7a7595)
mrq closed this issue 2023-03-13 17:45:03 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/ai-voice-cloning#61
No description provided.