-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] tracking memory adaptor to catch memory leaks #596
[REVIEW] tracking memory adaptor to catch memory leaks #596
Conversation
… adaptors will track allocations to detect leaks and limit the GPU allocations available to the process, respectively. Fixed a small copy/paste issue inside logging_resource_adaptor as well.
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
So the current functionality doesn't yet do what I described in the issue. The original intention of the |
This is a good point and I will add it. Originally, I expected the limiting allocator to handle this, but it really is just about limits at this point. I'll get this added. Where do we stand with respect to boost? I'm unsure if I should take time to try and find a way to pull in the header only or if that part has been dismissed. |
I'm not sure what the latest status is. If there isn't a way to get it via cmake, then we should just vendor the stackframe headers into RMM. |
It would be nice if CPM could help with this. There is FindBoost which can be used to find specific modules locally, but I don't think Find* is meant to fetch packages like CPM does. Boost has a tool for extracting the source of specific modules: https://www.boost.org/doc/libs/1_74_0/tools/bcp/doc/html/index.html , which would help get the right set of headers for vendoring. But I really hope we don't have to vendor. |
I agree and worry about keeping it up to date if we just copy the headers into a vendor directory. I ran |
So I'd summarize the stacktrace situation like this:
Basically, if boost is not installed, it's a pain. So an alternate suggestion would be to make the stack trace functionality optional. This has its own drawbacks, of course, in that one may now have to deal with the fact that RMM sometimes supports back traces and sometimes it does not. But it may still be the better option. In the usual case, which seems to be using conda, boost is available, because it's used in cudf, too. In fact, in the usual case, none of the FetchContent should be necessary, when using a package manager, that should provide dependencies. (I say "should" because that' s not actually quite the case yet.) |
Another option is to decide (as I usually do) to avoid boost, and go back to the original solution (execinfo.h -- GLIBC only) |
I didn't feel the original method was that much more complex than the Boost method and could be made to behave like the Boost one by being an object that is created as well. Note that it didn't include much debug information about the stacks, so there could be something that Boost is providing for us. |
Looks like xgboost uses execinfo. Here's some info on how they protect the include (e.g. GCC only): dmlc/xgboost#3365 I think for now this could be a linux + GCC only feature... |
Ok, I took a stab at changing this back to stdlib stack tracing. The output isn't amazing as we have no debug symbols, but I think it's a step in the right direction:
I made the stack trace grabbing code a class to encapsulate all the trace-specific code. I'm unsure if I should comment out the stack traces when not compiling in linux or not. I feel like they should be gone, but I am worried about code that might be using it now needing to know if they're enabled or not, so I left it stubbed out. See what you think of this. |
Hmmm, you can't get symbols in a debug build? |
…rmm into mwilson/limiting_adapter
Co-authored-by: Conor Hoekstra <36027403+codereport@users.noreply.github.com> Co-authored-by: Mark Harris <mharris@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple more small changes
Co-authored-by: Conor Hoekstra <36027403+codereport@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm (pending other reviewers changes) 👍
Not sure if you are looking for input, but we needed something like this to track memory usage per test in https://github.com/rapidsai/cuml so I made something very similar (even named it very closely): branch-0.17...mdemoret-nv:enh-add-callback-memory-manager The biggest changes between your PR and my branch are:
For the cuml team, it would be really helpful if (P.S. There is also another adaptor in there called |
The high water mark and total number of allocations would be useful, but can be a separate PR. The Cython bindings should be added in a later PR (since @hyperbolic2346 works on the Spark side, he may not be interested in the Python requirements). @hyperbolic2346 suggest reverting the #ifdef change to make @jrhemstad happy and then we can get this merged. |
Opened a new issue to keep track of the requests from @mdemoret-nv |
Excellent, I'll start a PR here shortly if that works for you. |
Tracking adaptor will track memory allocations and can return or log remaining allocations to track leaks. Fixes #467.
Rebased on branch 0.17. Invalidates #575.