Test that RAPIDS_NO_INITIALIZE means no cuInit #12361

wence- · 2022-12-12T17:26:53Z

Description

When RAPIDS_NO_INITIALIZE is set, importing cudf is not allowed to create a CUDA context. This is quite delicate since calls arbitrarily far down the import stack might create one.

To spot such problems, build a small shared library that interposes our own version of cuInit, and run a test importing cudf in a subprocess with that library LD_PRELOADed. If everything is kosher, we should not observe any calls to cuInit.

If one observes bad behaviour, the culprit can then be manually tracked down in a debugger by breaking on our cuInit implementation.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

When RAPIDS_NO_INITIALIZE is set, importing cudf is not allowed to create a CUDA context. This is quite delicate since calls arbitrarily far down the import stack _might_ create one. To spot such problems, build a small shared library that interposes our own version of cuInit, and run a test importing cudf in a subprocess with that library LD_PRELOADed. If everything is kosher, we should not observe any calls to cuInit. If one observes bad behaviour, the culprit can then be manually tracked down in a debugger by breaking on our cuInit implementation.

shwina · 2022-12-12T17:35:14Z

We could also consider using nvprof:

nvprof python -c 'import cudf' &| grep cuInit   
                    0.02%  30.542us         2  15.271us  2.9350us  27.607us  cuInit

RAPIDS_NO_INITIALIZE=1 nvprof python -c 'import cudf' &| grep cuInit
<no output>

wence-

Some signposts

wence- · 2022-12-12T17:27:38Z

cpp/CMakeLists.txt

+  target_link_libraries(cudfcuinit_intercept PRIVATE conda_env)
+endif()
+target_link_libraries(cudfcuinit_intercept PUBLIC CUDA::cudart cuda dl)
+


Need some help here, I'm completely flying blind, and this is wrong AFAICT.

Basically, I have a single file that I want to compile into a shared library and link against libdl and libcuda.

Why do you need to link to libdl? I think linking to libc is sufficient for your purposes (dlfcn). The linking to CUDA seems reasonable here (although if you care specifically about whether it's dynamically or statically linked you will want to set the CUDA_RUNTIME_LIBRARY property.

I think it is only glibc 2.34 and later where you don't need to link libdl to get access to dlsym and friends (see https://sourceware.org/pipermail/libc-alpha/2021-August/129718.html and bminor/glibc@77f876c) unless I am misunderstanding something.

In any case, would be very happy for someone who knows what they are doing to help rewrite this part of the patch completely.

I actually don't need cudart at all, only -lcuda, which I think I should get with CUDA::cuda_driver ?

wence- · 2022-12-12T17:29:02Z

cpp/src/utilities/cuinit_intercept.cpp

+}
+}
+}  // namespace
+#endif


This same stuff could easily be extended to address @jrhemstad's request in #11546 that one test that RMM is the only allocator of memory.

wence- · 2022-12-12T17:40:22Z

cpp/src/utilities/cuinit_intercept.cpp

+  original_dlsym = (dlsym_t)dlvsym(RTLD_NEXT, "dlsym", "GLIBC_2.2.5");
+  if (original_dlsym) {
+    original_cuGetProcAddress = (proc_t)original_dlsym(RTLD_NEXT, "cuGetProcAddress");
+  }


For driver calls there are two ways python libraries resolve them:

[numba does this] dlopen libcuda.so and then dlsym on the handle

[cuda-python does this] dlopen libcuda, dlsym cuGetProcAddress and then call cuGetProcAddress to get the driver symbol

So unfortunately, it's not sufficient to just define cuInit in this shared library and override the symbol resolution via LD_PRELOAD. We have to instead patch into dlsym and cuGetProcAddress. The latter is easy, the former is hard (we can't just dlsym(RTLD_NEXT, ...) here because that would call the local function. Instead, we use GLIBC's versioned lookup dlvsym, but now we need to match the glibc version exactly in the running environment (this is the one my conda environment has).

I guess I could spin over a bunch of versions until I find the right one.

Any other suggestions gratefully received.

but now we need to match the glibc version exactly in the running environment (this is the one my conda environment has).

Versioning is not quite as bad as this, 2.2.5 is a magic number but will be stable forever (due to glibc's forward-compat guarantee).

Could you patch into numba and cuda-python instead?

Could you patch into numba and cuda-python instead?

I can patch into numba, because the cuda interface is implemented in python, but can't do that for either cuda-python (or cupy) because their cuda interface is implemented in cython (so compiled) and hence monkey-patching won't work.

I also want to avoid a situation where some further third-party dependency is pulled in that also brings up a cuda context (perhaps directly via the C API). Since eventually everyone actually calls into the driver API, this seems like the best place to hook in.

There are some Linux options that I think are more reliable and do not require patching.
How about something like this: https://stackoverflow.com/questions/5103443/how-to-check-what-shared-libraries-are-loaded-at-run-time-for-a-given-process ?
I could try to work up a script based on this if you'd like.

will that tell me if cuInit is called? I think no

I guess we can run the process and inspect with nvml and try and match that way

This SO post seems to settle on basically the same thing that you do (funnily enough, there's another post about how Citrix copy-pasted this solution disregarding the issues and broke some users).

Due to the extensive dlopening/dlsyming happening, I am not sure that either strace or ltrace or anything like them will be sufficient to detect the calls, which would have been the easier route here as David suggests. If all functions were called by name then I think ltrace would have been sufficient, but as it is you'll only see the dlopen of libcuda.so and then the dlsym of some arbitrary memory address. You could hope that the dlsym calls always use a name for the handle that includes cuInit; I think that would show up? It would probably only catch a subset of cases though.

wence- · 2022-12-12T17:40:32Z

cpp/src/utilities/cuinit_intercept.cpp

+                                             &ptr,
+                                             CUDA_VERSION,
+                                             CU_GET_PROC_ADDRESS_DEFAULT
+#if CUDA_VERSION >= 12000


ABI change.

wence- · 2022-12-12T17:42:42Z

python/cudf/cudf/tests/test_nocuinit.py

+location = Path(__file__)
+cpp_build_dir = location / ".." / ".." / ".." / ".." / ".." / "cpp" / "build"
+libintercept = (cpp_build_dir / "libcudfcuinit_intercept.so").resolve()


What's the right way to reference this? Right now I'm assuming the build directory exists (because I didn't manage to wrangle cmake to install the library). Equally, however, I'm not sure we really want to install this library?

Oh boy, this is fun. I don't think there is a perfect solution here. FWIW my approach to this in #11875 was to move building the preload lib out of the main libcudf build, build it separately as part of CI, and then just launch tests with the preload library directly from the CLI in CI. That functionality was disabled as part of the Jenkins->GHA migration. Given that you're working on this, it may be time to investigate how to reenable that functionality within GHA.

@robertmaynard do you think that preload libraries like this or the stream verification lib should be built within the main CMakeLists.txt for the library, or shipped along with the conda packages? I had avoided that mostly because in the end we need the paths to the library anyway in order to preload, so it's not a great fit, but I know others had expressed different opinions. Depending on what direction we take with that we will need to adapt the solution in this pytest for how the library is discovered I think.

Presume that the stream verification lib is also a single library. My first thought had been to just compile to the .so as part of the test, referencing the source directory. But then I realised that I need someone to provide information about the compiler configuration and so forth.

codecov · 2022-12-12T19:37:23Z

Codecov Report

Base: 86.58% // Head: 86.58% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (a8800e9) compared to base (b6dccb3).
Patch coverage: 74.41% of modified lines in pull request are covered.

❗ Current head a8800e9 differs from pull request most recent head 8dee3f9. Consider uploading reports for the commit 8dee3f9 to get more accurate results

Additional details and impacted files

@@              Coverage Diff               @@
##           branch-23.02   #12361    +/-   ##
==============================================
  Coverage         86.58%   86.58%            
==============================================
  Files               155      155            
  Lines             24368    24507   +139     
==============================================
+ Hits              21098    21219   +121     
- Misses             3270     3288    +18

Impacted Files	Coverage Δ
python/cudf/cudf/api/extensions/accessor.py	`93.33% <0.00%> (ø)`
python/cudf/cudf/core/buffer/utils.py	`96.00% <ø> (ø)`
python/cudf/cudf/core/column/categorical.py	`89.34% <ø> (ø)`
python/cudf/cudf/core/frame.py	`94.05% <ø> (ø)`
python/cudf/cudf/core/buffer/spill_manager.py	`74.37% <63.07%> (-5.63%)`	⬇️
python/cudf/cudf/core/dataframe.py	`93.49% <64.28%> (-0.11%)`	⬇️
python/cudf/cudf/core/_base_index.py	`81.38% <87.50%> (+0.09%)`	⬆️
python/cudf/cudf/core/column/column.py	`87.95% <91.30%> (-0.01%)`	⬇️
python/cudf/cudf/core/algorithms.py	`90.00% <100.00%> (-0.48%)`	⬇️
python/cudf/cudf/core/buffer/spillable_buffer.py	`93.53% <100.00%> (+0.67%)`	⬆️
... and 38 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

wence- · 2022-12-16T09:44:17Z

cpp/src/utilities/cuinit_intercept.cpp

+
+void* dlsym(void* handle, const char* name_)
+{
+  std::string name{name_};


TODO: error handling in all these wrappers in case the resolution of the original functions failed (at which point we can only abort)

vyasr · 2022-12-16T22:11:37Z

Given the level of complexity we'd be introducing here (dlsyming dlsym itself seems like a massive minefield) I wonder if there might not be an easier approach altogether. We were at one point making a push to ensure that import cudf worked on a system with no GPUs. I assume that cuInit will fail if called on a system with no GPUs. Would refocusing on enabling that give us the same benefit as this PR?

shwina · 2023-01-03T15:24:22Z

I wonder if there might not be an easier approach altogether.

How about using nvprof as suggested above?

We were at one point making a push to ensure that import cudf worked on a system with no GPUs. I assume that cuInit will fail if called on a system with no GPUs. Would refocusing on enabling that give us the same benefit as this PR?

I think we should consider the two separate efforts. Yes, enabling import cudf on CPU only machines can be achieved by ensuring we don't initialize the CUDA context anywhere, but it could also be achieved in other ways (e.g., by choosing to ignore the RuntimeError thrown in Python land when cuInit fails).

vyasr · 2023-01-04T19:33:54Z

I'm fine with the nvprof solution too. That seems like the simplest and most direct approach for this particular problem. My mentioning of import cudf is because this seems like a good opportunity to knock out two birds with one stone. I've never really looked at what RAPIDS_NO_INITIALIZE does beyond seeing the early return from validate_setup. What happens if we remove setup validation altogether? Do we see much more opaque errors, or do we just see the errors at a later point in time?

Anyway I don't want to derail in this discussion too far. If the nvprof solution is sufficient no need to try to also address the import cudf question here unless there's an obvious way to remove CUDA context initialization unconditionally upon import.

wence- · 2023-01-09T14:03:38Z

I've never really looked at what RAPIDS_NO_INITIALIZE does beyond seeing the early return from validate_setup.

That's kind of all it does (since validate_setup calls into the cuda runtime to inspect various properties of the system and is called as part of importing __init__.py).

What happens if we remove setup validation altogether? Do we see much more opaque errors, or do we just see the errors at a later point in time?

We would, likely, see more opaque errors. validate_setup checks that certain prerequisites of the cuda/hardware versions are satisfied and produces a readable downstream error message for the caller; I presume that just trying to access those features on invalid hardware would produce much more inscrutable errors.

wence- · 2023-01-09T14:04:09Z

For the purposes of testing, it seems like just running with nvprof is the preferred approach, so I'll have a go at that.

wence- · 2023-01-13T11:05:43Z

For the purposes of testing, it seems like just running with nvprof is the preferred approach, so I'll have a go at that.

I couldn't get nvprof or nsys to work as I would like, but did something with GDB instead (see #12545). I'll leave this open until we definitively decide on a solution one way or another.

An alternate approach to that tried in rapidsai#12361, here we just script GDB and check if we hit a breakpoint in cuInit. When RAPIDS_NO_INITIALIZE is set in the environment, merely importing cudf should not call into the CUDA runtime/driver (i.e. no cuInit should be called). Conversely, to check that we are scripting GDB properly, when we create a cudf object, we definitely _should_ hit cuInit.

wence- · 2023-01-20T10:25:35Z

Closing in favour of #12545

An alternate approach to that tried in #12361, here we just script GDB and check if we hit a breakpoint in cuInit. When RAPIDS_NO_INITIALIZE is set in the environment, merely importing cudf should not call into the CUDA runtime/driver (i.e. no cuInit should be called). Conversely, to check that we are scripting GDB properly, when we create a cudf object, we definitely _should_ hit cuInit. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12545

wence- added tech debt 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Dec 12, 2022

wence- requested review from a team as code owners December 12, 2022 17:26

wence- requested review from mroeschke, galipremsagar, hyperbolic2346 and ttnghia December 12, 2022 17:26

github-actions bot added CMake CMake build issue Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Dec 12, 2022

wence- commented Dec 12, 2022

View reviewed changes

wence- force-pushed the wence/feature/test-no-cuInit-import branch from bd83d28 to e086ab3 Compare December 12, 2022 17:58

Comment indicating where the magic number comes from

8dee3f9

wence- mentioned this pull request Dec 12, 2022

Add strings udf to build setup rapidsai/dask-cuda#1065

Closed

rlratzel mentioned this pull request Dec 13, 2022

Ensure cugraph python packages do not create/initialize a CUDA context on import rapidsai/cugraph#3077

Open

wence- commented Dec 16, 2022

View reviewed changes

wence- mentioned this pull request Jan 13, 2023

Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set #12545

Merged

3 tasks

wence- closed this Jan 20, 2023

wence- deleted the wence/feature/test-no-cuInit-import branch October 24, 2024 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test that RAPIDS_NO_INITIALIZE means no cuInit #12361

Test that RAPIDS_NO_INITIALIZE means no cuInit #12361

wence- commented Dec 12, 2022

shwina commented Dec 12, 2022

wence- left a comment

wence- Dec 12, 2022

vyasr Dec 16, 2022

wence- Dec 16, 2022

wence- Dec 16, 2022

wence- Dec 12, 2022

wence- Dec 12, 2022

wence- Dec 12, 2022 •

edited

Loading

davidwendt Dec 13, 2022

wence- Dec 13, 2022

davidwendt Dec 13, 2022

wence- Dec 13, 2022

wence- Dec 13, 2022 •

edited

Loading

vyasr Dec 16, 2022

wence- Dec 12, 2022

wence- Dec 12, 2022

vyasr Dec 16, 2022

wence- Dec 16, 2022

codecov bot commented Dec 12, 2022 •

edited

Loading

wence- Dec 16, 2022

vyasr commented Dec 16, 2022

shwina commented Jan 3, 2023

vyasr commented Jan 4, 2023

wence- commented Jan 9, 2023

wence- commented Jan 9, 2023

wence- commented Jan 13, 2023

wence- commented Jan 20, 2023

Test that RAPIDS_NO_INITIALIZE means no cuInit #12361

Test that RAPIDS_NO_INITIALIZE means no cuInit #12361

Conversation

wence- commented Dec 12, 2022

Description

Checklist

shwina commented Dec 12, 2022

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- Dec 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 12, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

vyasr commented Dec 16, 2022

shwina commented Jan 3, 2023

vyasr commented Jan 4, 2023

wence- commented Jan 9, 2023

wence- commented Jan 9, 2023

wence- commented Jan 13, 2023

wence- commented Jan 20, 2023

wence- Dec 12, 2022 •

edited

Loading

wence- Dec 13, 2022 •

edited

Loading

codecov bot commented Dec 12, 2022 •

edited

Loading