-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate Resident Memory Increase during Inference #18640
Comments
Thanks for investigating this and providing an example code @ZekunZh! I'm not sure we can remove |
I ran a couple of tests with your script, removing Lightning and only running with the raw PyTorch model: ...
for i in range(N_ITERATIONS):
torch.cuda.empty_cache()
gc.collect()
model = model.to("cuda:0")
with torch.inference_mode():
for batch in dataloader:
model(batch.to("cuda:0"))
# model.cpu()
current_memory = process.memory_info().rss
memory_usage = convert_bytes_to_megabytes(current_memory - initial_memory)
print(f"Iteration {i + 1}: Resident Memory used: {memory_usage:.3f} MB")
memory_usages.append(memory_usage)
... (results produced with torch nightly 2.2.0.dev20230920+cu121) While the memory increase is definitely smaller, it is still a steady slope. I suppose on a production system with thousands of requests these few MB could add up. I'm definitely not familiar with memory management in Python and PyTorch, but there seems to be some hidden state somewhere that's not just in Lightning. Perhaps the impact is just amplified with Lightning and the root cause something else. |
Thanks for collecting more data here. So then if Hypothetically, if we were to insert a |
Yes, it seems that the result is the same.
|
@carmocca What are your thoughts on adding a |
I'm learning towards not adding it. Instantiating trainers like this in a loop is very unconventional and there is a cost to triggering If somebody can explain the cause of this, we would be better informed to create a fix: either by improving the reference counts or by adding this |
@carmocca Just to clarify. Above we've determined that the Trainer releases these objects. So their refcount is actually 0. It's just that the GC does not collect them from memory quick enough. By adding
I agree. In light of this I am also ok closing this issue. But for the same argumentation, I am also ok adding the |
Bug description
The memory consumption (RSS memory) continues to grow when
Trainer
is instantiated multiple times during the inference.In our production environment, currently we need to instantiate a
Trainer
for each request which contains 1 image. That's why we observed the OOM issue.We understand that it's might not be the best practice to use Lighting in production, any suggestions / comments are welcome ! 😃
The following curve can be reproduced with the provided python script, running 1000 iterations.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
The temporary solution to fix this issue is to add
gc.collect()
at the end of teardown method, while commentingself.lightning_module.cpu()
.Things that I've tried:
Only comment
self.lightning_module.cpu()
-> not work 🛑Only comment
_optimizers_to_device(self.optimizers, torch.device("cpu"))
-> not work 🛑Comment both
module to cpu
andoptimiser to cpu
-> not work 🛑Only add
gc.collect()
-> partially work 🟡Comment
_optimizers_to_device(self.optimizers, torch.device("cpu"))
+ addgc.collect()
-> partially work 🟡Comment
self.lightning_module.cpu()
+ addgc.collect()
-> work better 🟢Comment
self.lightning_module.cpu()
and_optimizers_to_device(self.optimizers, torch.device("cpu"))
+ addgc.collect()
-> Similar to the previous one 🟢cc @Borda
The text was updated successfully, but these errors were encountered: