Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify feature/options necessary to support OpenCL and OpenMP #14858

Open
calderpg-tri opened this issue Apr 1, 2021 · 13 comments
Open

Identify feature/options necessary to support OpenCL and OpenMP #14858

calderpg-tri opened this issue Apr 1, 2021 · 13 comments
Assignees
Labels
component: distribution Nightly binaries, monthly releases, docker, installation priority: medium type: feature request

Comments

@calderpg-tri
Copy link
Contributor

In service of #14431 we would like to add support for OpenCl (cross-platform GPU acceleration) and OpenMP (directive-based parallelization). Both of these dependencies add runtime components, which may be more (OpenMP) or less (OpenCL) problematic for Drake users.

OpenCL

#14843 adds OpenCL support for Ubuntu and Mac, used by a test in external voxelized_geometry_tools. We expect in the future that OpenCL will be used as part of planning code moved to Drake and thus be shipped in some/all binary forms of Drake. Broadly, our OpenCL uses the Installable Client Driver mechanism, by which our code links to the ICD loader and at runtime enumerates the available OpenCL platforms and devices. If no OpenCL platform/device is available our code will fall back to a different implementation, and thus Drake will not require OpenCL execution to be available.

Concerns/risks:

  • We believe that the runtime element should be minimal in the case of code that doesn't use OpenCL and that it shouldn't conflict with other software users want to integrate with Drake, but have not confirmed this yet (and doing so will require some feedback from the community).

  • The OpenCL execution model means that kernels are not compiled until run by a specific platform. So long as planning tools and externals are tested outside of Drake against a number of OpenCL implementations, we will not necessarily require that Drake CI support OpenCL execution (i.e. we don't need instances with GPUs). This could change in the future.

  • Apple has officially deprecated both OpenGL and OpenCL; however support for both continues to be available on Big Sur. Should this change in the future, we will need to remove support for OpenCL on Mac and potentially add additional Mac-specific implementation(s) of our planning tools.

OpenMP

OpenMP requires both compiler support and a runtime component. On Ubuntu platforms this is quite easy to integrate with a set of compile and link flags (although these flags differ somewhat between GCC and Clang). However, Apple does not provide the runtime library and partially disables OpenMP support in their compiler. At the very least, OpenMP support must be opt-out, whether or not it should be opt-in is a question.

Concerns/risks

  • OpenMP directives in our code interact with Eigen's own OpenMP integration. Conservatively, safe combination relies on the use of the EIGEN_DONT_PARALLELIZE define to disable Eigen's built-in uses.

  • OpenMP may interact or conflict with commercial solvers such as Gurobi and Mosek. We use OpenMP with Snopt internally and have patched interaction issues that arose, but have not extensively used it with the other commercial solvers. Mosek uses Cilk, which shouldn't directly conflict with OpenMP in terms of shared memory, but will definitely cause some sort of resource contention in the case someone puts a call to Mosek in the body of a #pragma omp parallel for loop.

  • If we want to add Mac support, doing so would require either a different compiler (i.e. GCC or upstream Clang from homebrew) or the use of the -Xclang option to Apple's compiler and a separately-provided release-specific version of the OpenMP runtime library.

CI and release implications

@jwnimmer-tri has enumerated some of the support matrix we'll need to consider, accounting for user channel and build options

User channel:

  • Source build (nightly, monthly)
  • GitHub binary tarball (nightly, monthly)
  • Homebrew binary cask (monthly)
  • Docker binary image (nightly, monthly)
  • Debian PPA binary w/sources (monthly)
  • Colab notebooks, likely via Debian PPA (monthly)

Build configs:

  • Gurobi on/off -- must be off for first-party binaries
  • Mosek on/off -- must be off for first-party binaries
  • Snopt on/off -- n.b. our first-party binaries turn this on, shrouded
  • Debug / Release / Coverage / Dynamic Analysis
  • Clang / GCC
  • Bionic / Focal / Catalina / Big Sur
  • OpenMP on/off
  • OpenCL on/off

We need to decide which channels will either support (or require) the various build option permutations and what coverage must exist in CI. I am putting together a survey to gather feedback of which combinations of channel/build should be supported and tested.

cc @ggould-tri @jwnimmer-tri @jamiesnape @sherm1

@calderpg-tri calderpg-tri added component: distribution Nightly binaries, monthly releases, docker, installation component: continuous integration Jenkins, CDash, mirroring of externals, website infrastructure component: build system Bazel, CMake, dependencies, memory checkers, linters labels Apr 1, 2021
@calderpg-tri calderpg-tri self-assigned this Apr 1, 2021
@EricCousineau-TRI
Copy link
Contributor

Moved from PR:

[...] but a quick sanity check probably is still worthwhile. (If it does have downsides, we might need the option to disable it.)

Perhaps OpenCV should be part of the checklist? (From brief shallow investigations here, I think it enables OpenCL by default; dunno about static vs. dynamic linking: repro/.../opencv_cvtcolor_slow)

@calderpg-tri
Copy link
Contributor Author

From a basic test program that uses OpenCV and OpenCL, I don't see any problems combining OpenCV-with-OpenCL-enabled and OpenCL.

@calderpg-tri
Copy link
Contributor Author

So long as planning tools and externals are tested outside of Drake against a number of OpenCL implementations

To elaborate, right now I manually test changes to the OpenCL implementations against Nvidia, AMD, and Intel platforms. Nvidia and AMD platforms are amenable to testing through AWS via something like G4ad (AMD) and G4dn (Nvidia) instances, but I'm not aware of any instances that use Intel GPUs.

@jamiesnape
Copy link
Contributor

jamiesnape commented Apr 14, 2021

So long as planning tools and externals are tested outside of Drake against a number of OpenCL implementations

To elaborate, right now I manually test changes to the OpenCL implementations against Nvidia, AMD, and Intel platforms. Nvidia and AMD platforms are amenable to testing through AWS via something like G4ad (AMD) and G4dn (Nvidia) instances, but I'm not aware of any instances that use Intel GPUs.

Factoring in hardware and software revisions, we are nearing an intractable number of implementations. If we can support two implementations of a given version of OpenCL we are probably doing well. Realistically, the Intel version would be at the bottom of my heap of versions to test. Budget is going to play into what we test too. We have some slack, but there are only so many G4 variant instances we could run in a weekly cycle (I would prioritize NVIDIA, not least because they own ARM).

@calderpg-tri
Copy link
Contributor Author

Factoring in hardware and software revisions, we are nearing an intractable number of implementations. If we can support two implementations of a given version of OpenCL we are probably doing well.

I don't think we need to plan around testing on a range of (hardware x software) revisions - the most important part of testing on multiple platforms is to confirm that the OpenCL kernels build and something platform/implementation-specific doesn't sneak in. I think we can achieve that fine with a single example each of AMD and Nvidia.

Realistically, the Intel version would be at the bottom of my heap of versions to test.

Unfortunately, it's quite possible that this is the most-used implementation due to laptops. That said, I've only run into an Intel implementation-specific issue once (an ambiguous call to sqrt), so I think for now we could require that the rare changes to OpenCL kernels get manually tested on Intel instead.

@jamiesnape
Copy link
Contributor

I don't think we need to plan around testing on a range of (hardware x software) revisions...

Yes, I just wouldn't want anyone to get a false sense of security from a given AWS instance type. OpenGL is hard enough, let alone OpenCL.

Unfortunately, it's quite possible that this is the most-used implementation due to laptops.

True, but they probably have the least to gain from using OpenCL?

@calderpg-tri
Copy link
Contributor Author

Unfortunately, it's quite possible that this is the most-used implementation due to laptops.

True, but they probably have the least to gain from using OpenCL?

I have seen pretty solid speedups on NUCs and laptops for pointcloud voxelization and roadmap updating with OpenCL, especially on machines with fewer cores.

@jamiesnape
Copy link
Contributor

Cool, nice to be proven wrong. Are are there good gains with both the GPU and CPU implementations?

@calderpg-tri
Copy link
Contributor Author

Are are there good gains with both the GPU and CPU implementations?

I haven't tried them against Intel's OpenCL-on-CPU implementation, only their two GPU implementations (older beignet and newer NEO/GCR) if that's what you're asking.

@jamiesnape
Copy link
Contributor

Yes. FWIW That may be a configuration we can handle on AWS.

@jwnimmer-tri jwnimmer-tri added priority: low and removed component: continuous integration Jenkins, CDash, mirroring of externals, website infrastructure component: build system Bazel, CMake, dependencies, memory checkers, linters labels Nov 10, 2021
@jwnimmer-tri
Copy link
Collaborator

\CC @xuchenhan-tri FYI as this might relate to FEM simulations in the future as well.

@DamrongGuoy
Copy link
Contributor

I can see this is a big change, but I believe it will open Drake to new fruitful opportunities. Cheers!

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Feb 16, 2022

A few more notes from my digging...

For users who might use MKL's libblas (instead of Ubuntu libblas) at load-time, it seems like the obvious and good things will happen by default, and we can can rely on OpenMP to sort out the details, per the MKL Developer Guide.


Mosek currently uses Cilk for the thread pool, but

Mosek version 10 will no longer employ Cilk but most likely oneTBB. This will allow for a more fine grained control on threading.
-- https://groups.google.com/g/mosek/c/x2pZnW0OJEo

For background docs and good tips, see:

When solving, possibly we should detect if we're within an parallel section (per omp_in_parallel) and then set MSK_IPAR_INTPNT_MULTI_THREAD to OFF automatically, or maybe we should just document the caveat and let users configure what they need. Maybe in MOSEK 10 it will be easier.


Gurobi also consumes all threads on the machine by default:
https://www.gurobi.com/documentation/9.5/refman/threads.html

See also:
https://support.gurobi.com/hc/en-us/community/posts/360055837711-Solving-different-models-in-parallel-C-OpenMP-

I haven't yet found what kind of thread pool it's using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: distribution Nightly binaries, monthly releases, docker, installation priority: medium type: feature request
Projects
None yet
Development

No branches or pull requests

5 participants