[REVIEW] tracking memory adaptor to catch memory leaks #596

hyperbolic2346 · 2020-10-06T01:52:49Z

Tracking adaptor will track memory allocations and can return or log remaining allocations to track leaks. Fixes #467.

Rebased on branch 0.17. Invalidates #575.

… adaptors will track allocations to detect leaks and limit the GPU allocations available to the process, respectively. Fixed a small copy/paste issue inside logging_resource_adaptor as well.

…p here.

GPUtester · 2020-10-06T01:53:13Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

jrhemstad · 2020-10-07T19:29:45Z

So the current functionality doesn't yet do what I described in the issue. The original intention of the tracking_resource_adaptor was to track sum(allocations) - sum(deallocations). It doesn't look like that is being done.

hyperbolic2346 · 2020-10-08T15:44:24Z

So the current functionality doesn't yet do what I described in the issue. The original intention of the tracking_resource_adaptor was to track sum(allocations) - sum(deallocations). It doesn't look like that is being done.

This is a good point and I will add it. Originally, I expected the limiting allocator to handle this, but it really is just about limits at this point. I'll get this added.

Where do we stand with respect to boost? I'm unsure if I should take time to try and find a way to pull in the header only or if that part has been dismissed.

jrhemstad · 2020-10-08T15:51:32Z

Where do we stand with respect to boost?

I'm not sure what the latest status is. If there isn't a way to get it via cmake, then we should just vendor the stackframe headers into RMM.

harrism · 2020-10-08T22:59:50Z

It would be nice if CPM could help with this. There is FindBoost which can be used to find specific modules locally, but I don't think Find* is meant to fetch packages like CPM does.

Boost has a tool for extracting the source of specific modules: https://www.boost.org/doc/libs/1_74_0/tools/bcp/doc/html/index.html , which would help get the right set of headers for vendoring. But I really hope we don't have to vendor.

hyperbolic2346 · 2020-10-12T14:59:07Z

Boost has a tool for extracting the source of specific modules: https://www.boost.org/doc/libs/1_74_0/tools/bcp/doc/html/index.html , which would help get the right set of headers for vendoring. But I really hope we don't have to vendor.

I agree and worry about keeping it up to date if we just copy the headers into a vendor directory. I ran bcp and imported stacktrace and it pulls in 355 header files. :(

germasch · 2020-10-12T15:40:37Z

So I'd summarize the stacktrace situation like this:

If boost is installed, it's easily pulled in via find_package()
If it is not, it could be fetched, but it's huge -- and it can't directly be used via add_subdirectory(), though that's not too much of a hassle to make work.
Alternatively, one could make an alternate repo that is fetched that just contains the 355 needed header files, or vendor them directly, which both come with maintenance needs.

Basically, if boost is not installed, it's a pain. So an alternate suggestion would be to make the stack trace functionality optional. This has its own drawbacks, of course, in that one may now have to deal with the fact that RMM sometimes supports back traces and sometimes it does not. But it may still be the better option. In the usual case, which seems to be using conda, boost is available, because it's used in cudf, too. In fact, in the usual case, none of the FetchContent should be necessary, when using a package manager, that should provide dependencies. (I say "should" because that' s not actually quite the case yet.)

harrism · 2020-10-12T22:19:31Z

Another option is to decide (as I usually do) to avoid boost, and go back to the original solution (execinfo.h -- GLIBC only)

hyperbolic2346 · 2020-10-12T22:28:29Z

I didn't feel the original method was that much more complex than the Boost method and could be made to behave like the Boost one by being an object that is created as well. Note that it didn't include much debug information about the stacks, so there could be something that Boost is providing for us.

harrism · 2020-10-12T22:43:45Z

Looks like xgboost uses execinfo. Here's some info on how they protect the include (e.g. GCC only): dmlc/xgboost#3365

I think for now this could be a linux + GCC only feature...

hyperbolic2346 · 2020-10-19T01:29:07Z

Ok, I took a stab at changing this back to stdlib stack tracing. The output isn't amazing as we have no debug symbols, but I think it's a step in the right direction:

0x7fe886a00000: 10485760 B : callstack:
#0 in ./build/release/gtests/TRACKING_TEST(+0x3a8a0) [0x56044a4a38a0]
#1 in ./build/release/gtests/TRACKING_TEST(+0x1596d) [0x56044a47e96d]
#2 in ./build/release/gtests/TRACKING_TEST(+0x64c5a) [0x56044a4cdc5a]
#3 in ./build/release/gtests/TRACKING_TEST(+0x59d7f) [0x56044a4c2d7f]
#4 in ./build/release/gtests/TRACKING_TEST(+0x5a21a) [0x56044a4c321a]
#5 in ./build/release/gtests/TRACKING_TEST(+0x5a428) [0x56044a4c3428]
#6 in ./build/release/gtests/TRACKING_TEST(+0x5b625) [0x56044a4c4625]
#7 in ./build/release/gtests/TRACKING_TEST(+0x5b8d8) [0x56044a4c48d8]
#8 in ./build/release/gtests/TRACKING_TEST(+0x14679) [0x56044a47d679]
#9 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fe8b9ce8b97]
#10 in ./build/release/gtests/TRACKING_TEST(+0x148ca) [0x56044a47d8ca]

I made the stack trace grabbing code a class to encapsulate all the trace-specific code. I'm unsure if I should comment out the stack traces when not compiling in linux or not. I feel like they should be gone, but I am worried about code that might be using it now needing to know if they're enabled or not, so I left it stubbed out.

See what you think of this.

harrism · 2020-10-19T01:39:01Z

Hmmm, you can't get symbols in a debug build?

…rmm into mwilson/limiting_adapter

include/rmm/detail/stack_trace.hpp

include/rmm/mr/device/tracking_resource_adaptor.hpp

tests/CMakeLists.txt

tests/mr/device/tracking_mr_tests.cpp

include/rmm/detail/stack_trace.hpp

Co-authored-by: Conor Hoekstra <36027403+codereport@users.noreply.github.com> Co-authored-by: Mark Harris <mharris@nvidia.com>

codereport

Just a couple more small changes

include/rmm/mr/device/tracking_resource_adaptor.hpp

Co-authored-by: Conor Hoekstra <36027403+codereport@users.noreply.github.com>

codereport

lgtm (pending other reviewers changes) 👍

…ting_adapter

mdemoret-nv · 2020-10-30T17:02:31Z

Not sure if you are looking for input, but we needed something like this to track memory usage per test in https://github.com/rapidsai/cuml so I made something very similar (even named it very closely): branch-0.17...mdemoret-nv:enh-add-callback-memory-manager

The biggest changes between your PR and my branch are:

You collect stack traces (would be very helpful but wasn't necessary for what I was doing)
My tracked memory adaptor also keeps track of total bytes allocated, current outstanding bytes, peak bytes allocated, and total number of allocations
I added cython wrappers around the new adaptor so the info could be queried from pytest and reset before each test

For the cuml team, it would be really helpful if #2 and #3 above could be added to this PR (If not, I can make the changes in a later PR). Having the ability to track peak allocations is very useful for us since this ultimately decides the max size on datasets that can be loaded. And being able to query/reset this information from python would be allow tracking this information by test.

(P.S. There is also another adaptor in there called callback_memory_adaptor which allows setting a callback from both python and C++ to be executed on every allocation/deallocation. It worked great but was ultimately too slow so I switched to tracking it on the C++ side. Not sure if this would be a useful addition anywhere in rmm)

harrism · 2020-11-01T22:55:19Z

The high water mark and total number of allocations would be useful, but can be a separate PR. The Cython bindings should be added in a later PR (since @hyperbolic2346 works on the Spark side, he may not be interested in the Python requirements).

@hyperbolic2346 suggest reverting the #ifdef change to make @jrhemstad happy and then we can get this merged.

include/rmm/detail/stack_trace.hpp

hyperbolic2346 · 2020-11-02T21:30:18Z

Opened a new issue to keep track of the requests from @mdemoret-nv

mdemoret-nv · 2020-11-02T23:42:07Z

Opened a new issue to keep track of the requests from @mdemoret-nv

Excellent, I'll start a PR here shortly if that works for you.

hyperbolic2346 and others added 9 commits October 6, 2020 00:48

Adding tracking resource adaptor and limiting resource adaptor. These…

c0ed60f

… adaptors will track allocations to detect leaks and limit the GPU allocations available to the process, respectively. Fixed a small copy/paste issue inside logging_resource_adaptor as well.

Adding changelog

3256cb8

Changes from review comments.

a78b7fe

Updating based on review comments.

54dd003

Updating from review comments to use boost for stack traces

ecfff46

Updating from some great review comments. I really appreciate the hel…

75045c0

…p here.

Create CODE_OF_CONDUCT.md (rapidsai#590)

7601430

Removing limiting adaptor from this branch

1f103b3

Removing header includes as requested in review

0f7a037

hyperbolic2346 requested review from a team as code owners October 6, 2020 01:52

hyperbolic2346 requested review from jrhemstad and codereport October 6, 2020 01:52

Updating changelog with new PR #

64baa22

hyperbolic2346 changed the title ~~Mwilson/limiting adapter~~ [REVIEW] tracking memory adaptor to catch memory leaks Oct 6, 2020

hyperbolic2346 mentioned this pull request Oct 6, 2020

[REVIEW] Adding tracking and limiting adaptors #575

Closed

Adding total allocation amount to tracking adaptor.

a999506

Adding stdlib stack traces back in

64b13a2

hyperbolic2346 added 3 commits October 26, 2020 15:52

Merge branch 'mwilson/limiting_adapter' of github.com:hyperbolic2346/…

51d0f78

…rmm into mwilson/limiting_adapter

fixing change merged in github

4c988cf

linting

1900f89

codereport suggested changes Oct 27, 2020

View reviewed changes

jrhemstad reviewed Oct 27, 2020

View reviewed changes

include/rmm/detail/stack_trace.hpp Outdated Show resolved Hide resolved

hyperbolic2346 and others added 2 commits October 27, 2020 12:20

Apply suggestions from code review

2b48710

Co-authored-by: Conor Hoekstra <36027403+codereport@users.noreply.github.com> Co-authored-by: Mark Harris <mharris@nvidia.com>

linting

8cfa74a

codereport suggested changes Oct 27, 2020

View reviewed changes

include/rmm/mr/device/tracking_resource_adaptor.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/tracking_resource_adaptor.hpp Outdated Show resolved Hide resolved

Apply suggestions from code review

fca904b

Co-authored-by: Conor Hoekstra <36027403+codereport@users.noreply.github.com>

codereport approved these changes Oct 27, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/branch-0.17' into mwilson/limi…

e1724f6

…ting_adapter

harrism added the 3 - Ready for review Ready for review by team label Oct 28, 2020

Updating to support and fix compilation errors

1794924

jrhemstad reviewed Nov 2, 2020

View reviewed changes

include/rmm/detail/stack_trace.hpp Outdated Show resolved Hide resolved

Changes for review comments: ifdefs inside functions and define rename

c40c99d

jrhemstad approved these changes Nov 2, 2020

View reviewed changes

harrism approved these changes Nov 2, 2020

View reviewed changes

hyperbolic2346 mentioned this pull request Nov 2, 2020

[FEA] Extend tracking_resource_adaptor to track current bytes allocated, peak bytes allocated, and total # allocations #622

Closed

harrism merged commit b1ac445 into rapidsai:branch-0.17 Nov 2, 2020

harrism mentioned this pull request Nov 2, 2020

[FEA] Add Cython bindings for tracking_resource_adaptor #623

Closed

hyperbolic2346 deleted the mwilson/limiting_adapter branch November 3, 2020 18:28

This was referenced Feb 16, 2021

[FEA] Extend logging_resource_adaptor to optionally include backtrace #452

Closed

[FEA] Add a "tracking" resource adaptor #415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] tracking memory adaptor to catch memory leaks #596

[REVIEW] tracking memory adaptor to catch memory leaks #596

hyperbolic2346 commented Oct 6, 2020

GPUtester commented Oct 6, 2020

jrhemstad commented Oct 7, 2020 •

edited

Loading

hyperbolic2346 commented Oct 8, 2020

jrhemstad commented Oct 8, 2020

harrism commented Oct 8, 2020

hyperbolic2346 commented Oct 12, 2020

germasch commented Oct 12, 2020

harrism commented Oct 12, 2020

hyperbolic2346 commented Oct 12, 2020

harrism commented Oct 12, 2020

hyperbolic2346 commented Oct 19, 2020

harrism commented Oct 19, 2020

codereport left a comment

codereport left a comment

mdemoret-nv commented Oct 30, 2020

harrism commented Nov 1, 2020

hyperbolic2346 commented Nov 2, 2020

mdemoret-nv commented Nov 2, 2020

[REVIEW] tracking memory adaptor to catch memory leaks #596

[REVIEW] tracking memory adaptor to catch memory leaks #596

Conversation

hyperbolic2346 commented Oct 6, 2020

GPUtester commented Oct 6, 2020

jrhemstad commented Oct 7, 2020 • edited Loading

hyperbolic2346 commented Oct 8, 2020

jrhemstad commented Oct 8, 2020

harrism commented Oct 8, 2020

hyperbolic2346 commented Oct 12, 2020

germasch commented Oct 12, 2020

harrism commented Oct 12, 2020

hyperbolic2346 commented Oct 12, 2020

harrism commented Oct 12, 2020

hyperbolic2346 commented Oct 19, 2020

harrism commented Oct 19, 2020

codereport left a comment

Choose a reason for hiding this comment

codereport left a comment

Choose a reason for hiding this comment

mdemoret-nv commented Oct 30, 2020

harrism commented Nov 1, 2020

hyperbolic2346 commented Nov 2, 2020

mdemoret-nv commented Nov 2, 2020

jrhemstad commented Oct 7, 2020 •

edited

Loading