Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak and Performance Degradation in GraphMemoryManagement due to Unreleased Nodes #2487

Closed
jnamika opened this issue Nov 14, 2024 · 4 comments · Fixed by #2488
Closed

Comments

@jnamika
Copy link
Contributor

jnamika commented Nov 14, 2024

Describe the bug
There appears to be a node leak in the GraphMemoryManagement. By adding a debug line at the end of GraphMemoryManagement::free_unavailable_nodes() to display self.nodes.len(), we can observe that the size of self.nodes increases by one with each call.

To Reproduce
The issue can be reproduced regardless of the backend. For example, running burn/examples/mnist demonstrates the memory leak.

Expected behavior
GraphMemoryManagement should delete unneeded nodes as expected, preventing any memory leaks.

Additional context
The GraphMemoryManagement::clear_unused_roots() method performs a linear search over self.nodes, so as training progresses and leaked nodes accumulate, the overall performance gradually degrades.
grafana

@AsherJingkongChen
Copy link
Contributor

Is it relevant to #2042?

@jnamika
Copy link
Contributor Author

jnamika commented Nov 14, 2024

Thank you for pointing this out! This issue may indeed be related to #2042, but it is definitely connected to #2333. In fact, the clear_unused_roots() function was introduced in pull request #2347 to address the memory leak reported in #2333. This function does introduce a linear scan over self.nodes, leading to a performance regression over time.

However, the root cause here seems to be that self.nodes continues to grow due to nodes not being released, which is an issue that predates the introduction of clear_unused_roots(). So ideally, the regression should not be addressed by reverting the changes from #2347. Instead, we should focus on fixing the underlying node management problem to ensure that self.nodes is properly pruned, preventing both memory leaks and performance degradation.

Let me know if I can provide additional details!

@AsherJingkongChen
Copy link
Contributor

AsherJingkongChen commented Nov 14, 2024

@jnamika Could you provide the minimal example? or is MNIST enough? How long does it take to reproduce? 2 days? I'd like to track the issue, but I can't solve it since I am not familiar with GMM.

@jnamika
Copy link
Contributor Author

jnamika commented Nov 14, 2024

Before I could create a minimal example, I found a solution and have submitted the pull request #2488 with the fix.
It’s a very small change; could you review it when convenient?
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants