Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] tracking memory adaptor to catch memory leaks #596

Merged
merged 27 commits into from
Nov 2, 2020

Conversation

hyperbolic2346
Copy link
Contributor

Tracking adaptor will track memory allocations and can return or log remaining allocations to track leaks. Fixes #467.

Rebased on branch 0.17. Invalidates #575.

@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@hyperbolic2346 hyperbolic2346 changed the title Mwilson/limiting adapter [REVIEW] tracking memory adaptor to catch memory leaks Oct 6, 2020
@jrhemstad
Copy link
Contributor

jrhemstad commented Oct 7, 2020

So the current functionality doesn't yet do what I described in the issue. The original intention of the tracking_resource_adaptor was to track sum(allocations) - sum(deallocations). It doesn't look like that is being done.

@hyperbolic2346
Copy link
Contributor Author

So the current functionality doesn't yet do what I described in the issue. The original intention of the tracking_resource_adaptor was to track sum(allocations) - sum(deallocations). It doesn't look like that is being done.

This is a good point and I will add it. Originally, I expected the limiting allocator to handle this, but it really is just about limits at this point. I'll get this added.

Where do we stand with respect to boost? I'm unsure if I should take time to try and find a way to pull in the header only or if that part has been dismissed.

@jrhemstad
Copy link
Contributor

Where do we stand with respect to boost?

I'm not sure what the latest status is. If there isn't a way to get it via cmake, then we should just vendor the stackframe headers into RMM.

@harrism
Copy link
Member

harrism commented Oct 8, 2020

It would be nice if CPM could help with this. There is FindBoost which can be used to find specific modules locally, but I don't think Find* is meant to fetch packages like CPM does.

Boost has a tool for extracting the source of specific modules: https://www.boost.org/doc/libs/1_74_0/tools/bcp/doc/html/index.html , which would help get the right set of headers for vendoring. But I really hope we don't have to vendor.

@hyperbolic2346
Copy link
Contributor Author

Boost has a tool for extracting the source of specific modules: https://www.boost.org/doc/libs/1_74_0/tools/bcp/doc/html/index.html , which would help get the right set of headers for vendoring. But I really hope we don't have to vendor.

I agree and worry about keeping it up to date if we just copy the headers into a vendor directory. I ran bcp and imported stacktrace and it pulls in 355 header files. :(

@germasch
Copy link
Contributor

So I'd summarize the stacktrace situation like this:

  • If boost is installed, it's easily pulled in via find_package()
  • If it is not, it could be fetched, but it's huge -- and it can't directly be used via add_subdirectory(), though that's not too much of a hassle to make work.
  • Alternatively, one could make an alternate repo that is fetched that just contains the 355 needed header files, or vendor them directly, which both come with maintenance needs.

Basically, if boost is not installed, it's a pain. So an alternate suggestion would be to make the stack trace functionality optional. This has its own drawbacks, of course, in that one may now have to deal with the fact that RMM sometimes supports back traces and sometimes it does not. But it may still be the better option. In the usual case, which seems to be using conda, boost is available, because it's used in cudf, too. In fact, in the usual case, none of the FetchContent should be necessary, when using a package manager, that should provide dependencies. (I say "should" because that' s not actually quite the case yet.)

@harrism
Copy link
Member

harrism commented Oct 12, 2020

Another option is to decide (as I usually do) to avoid boost, and go back to the original solution (execinfo.h -- GLIBC only)

@hyperbolic2346
Copy link
Contributor Author

I didn't feel the original method was that much more complex than the Boost method and could be made to behave like the Boost one by being an object that is created as well. Note that it didn't include much debug information about the stacks, so there could be something that Boost is providing for us.

@harrism
Copy link
Member

harrism commented Oct 12, 2020

Looks like xgboost uses execinfo. Here's some info on how they protect the include (e.g. GCC only): dmlc/xgboost#3365

I think for now this could be a linux + GCC only feature...

@hyperbolic2346
Copy link
Contributor Author

Ok, I took a stab at changing this back to stdlib stack tracing. The output isn't amazing as we have no debug symbols, but I think it's a step in the right direction:

0x7fe886a00000: 10485760 B : callstack:
#0 in ./build/release/gtests/TRACKING_TEST(+0x3a8a0) [0x56044a4a38a0]
#1 in ./build/release/gtests/TRACKING_TEST(+0x1596d) [0x56044a47e96d]
#2 in ./build/release/gtests/TRACKING_TEST(+0x64c5a) [0x56044a4cdc5a]
#3 in ./build/release/gtests/TRACKING_TEST(+0x59d7f) [0x56044a4c2d7f]
#4 in ./build/release/gtests/TRACKING_TEST(+0x5a21a) [0x56044a4c321a]
#5 in ./build/release/gtests/TRACKING_TEST(+0x5a428) [0x56044a4c3428]
#6 in ./build/release/gtests/TRACKING_TEST(+0x5b625) [0x56044a4c4625]
#7 in ./build/release/gtests/TRACKING_TEST(+0x5b8d8) [0x56044a4c48d8]
#8 in ./build/release/gtests/TRACKING_TEST(+0x14679) [0x56044a47d679]
#9 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fe8b9ce8b97]
#10 in ./build/release/gtests/TRACKING_TEST(+0x148ca) [0x56044a47d8ca]

I made the stack trace grabbing code a class to encapsulate all the trace-specific code. I'm unsure if I should comment out the stack traces when not compiling in linux or not. I feel like they should be gone, but I am worried about code that might be using it now needing to know if they're enabled or not, so I left it stubbed out.

See what you think of this.

@harrism
Copy link
Member

harrism commented Oct 19, 2020

Hmmm, you can't get symbols in a debug build?

hyperbolic2346 and others added 2 commits October 27, 2020 12:20
Co-authored-by: Conor Hoekstra <36027403+codereport@users.noreply.github.com>
Co-authored-by: Mark Harris <mharris@nvidia.com>
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple more small changes

Co-authored-by: Conor Hoekstra <36027403+codereport@users.noreply.github.com>
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm (pending other reviewers changes) 👍

@harrism harrism added the 3 - Ready for review Ready for review by team label Oct 28, 2020
@mdemoret-nv
Copy link
Contributor

Not sure if you are looking for input, but we needed something like this to track memory usage per test in https://github.com/rapidsai/cuml so I made something very similar (even named it very closely): branch-0.17...mdemoret-nv:enh-add-callback-memory-manager

The biggest changes between your PR and my branch are:

  1. You collect stack traces (would be very helpful but wasn't necessary for what I was doing)
  2. My tracked memory adaptor also keeps track of total bytes allocated, current outstanding bytes, peak bytes allocated, and total number of allocations
  3. I added cython wrappers around the new adaptor so the info could be queried from pytest and reset before each test

For the cuml team, it would be really helpful if #2 and #3 above could be added to this PR (If not, I can make the changes in a later PR). Having the ability to track peak allocations is very useful for us since this ultimately decides the max size on datasets that can be loaded. And being able to query/reset this information from python would be allow tracking this information by test.

(P.S. There is also another adaptor in there called callback_memory_adaptor which allows setting a callback from both python and C++ to be executed on every allocation/deallocation. It worked great but was ultimately too slow so I switched to tracking it on the C++ side. Not sure if this would be a useful addition anywhere in rmm)

@harrism
Copy link
Member

harrism commented Nov 1, 2020

The high water mark and total number of allocations would be useful, but can be a separate PR. The Cython bindings should be added in a later PR (since @hyperbolic2346 works on the Spark side, he may not be interested in the Python requirements).

@hyperbolic2346 suggest reverting the #ifdef change to make @jrhemstad happy and then we can get this merged.

@hyperbolic2346
Copy link
Contributor Author

Opened a new issue to keep track of the requests from @mdemoret-nv

@harrism harrism merged commit b1ac445 into rapidsai:branch-0.17 Nov 2, 2020
@mdemoret-nv
Copy link
Contributor

Opened a new issue to keep track of the requests from @mdemoret-nv

Excellent, I'll start a PR here shortly if that works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants