-
Notifications
You must be signed in to change notification settings - Fork 124
Move CUDA sources from intel/llvm to UR #896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move CUDA sources from intel/llvm to UR #896
Conversation
This moves the CUDA plugin implementation to Unified Runtime; and changes the pi_cuda plugin to use pi2ur to implement PI. The changes to the implementation have been kept to a minimum and should be functionally the same. Documentation and comments have been moved verbatim, other than changing PI references to UR. This PR is based on top of the Level Zero adapter (#8744) so will only be ready when that is merged. --------- Co-authored-by: Petr Vesely <petr.vesely@codeplay.com> Co-authored-by: Omar Ahmed <omar.ahmed@codeplay.com> Co-authored-by: Martin Morrison-Grant <martin.morrisongrant@codeplay.com> Co-authored-by: Aaron Greig <aaron.greig@codeplay.com>
Resolves the warnings as errors reported in [post merge](https://github.com/intel/llvm/actions/runs/5266121277/jobs/9519634360) as a result of merging #9512. Additionally move pre-processor guards to resolve unused global variables which would also fail in this build configuration (clang & SYCL_ENABLE_WERROR=ON).
…(#9938) * Call to `hipEventElapsedTime` return `hipErrorNotReady` when the timestamp has not yet been `recorded` on one or both events. Calling `hipEventSynchronize` block until the event is ready. * The issue showed itself when profiling sycl-blas benchmark. * Enable support for cuda / hip in event_profiling_info.cpp
…RSION (#9873) This should have been an obvious update of Unified Runtime tag to support UR_DEVICE_INFO_IP_VERSION, required in intel/llvm#9843 (just tag update, nothing else), but it also brought many API breaks caused by this patch: oneapi-src#536. So the current PR updates our codebase in accordance with changed UR API. --------- Signed-off-by: Dmitry Vodopyanov <dmitry.vodopyanov@intel.com> Co-authored-by: Callum Fare <callum@codeplay.com> Co-authored-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
This fixes a change in `hasBeenSynchronized` accidentally introduced during the UR port
…(#10055) * Reverts back the update to calculation of threads per block for 0th dimension when primary ranges are involved. That could cause out of range access.
Includes a large number of missing `pi_result` mappings, so previously reported errors should no longer map to just `PI_ERROR_UNKNOWN`. NFCI for the adapters.
These checks are causing issues for very large USM allocations because the `MAX_MEM_ALLOC_SIZE` reported is lower than what CUDA actually supports. We will follow up with an update on the reported `MAX_MEM_ALLOC_SIZE`, but it makes sense to remove the checks either way, as the CUDA allocation functions will return an error if they can't allocate the memory.
# Level Zero Backend Support for SYCL Graphs This is the second patch of a series that adds support for an [experimental command graph extension](intel/llvm#5626) A snapshot of the complete work can be seen in draft PR #9375 which has support all the specification defined ways of adding nodes and edges to the graph, including both Explicit and Record & Replay graph construction. The two types of nodes currently implemented are kernel execution and memcpy commands. See https://github.com/reble/llvm#implementation-status for the status of our total work. ## Scope This second patch focuses on the required PI/UR support for the experimental command-buffer feature in the Level Zero adapter: * PI stubs for all adapters to enable compilation, no functionality. * Command-buffer implementation for the Level Zero UR adapter. * Stubs for the CUDA UR adapter to enable compilation, no functionality. ## Following Split PRs Future follow-up PRs with the remainder of our work on the extension will include: * Hooking up backend to graphs runtime, bugfixes and other feature additions, will add symbols but not break the ABI. (3/4) * Add end-to-end tests for SYCL Graph extension. (4/4) * NFC changes - Design doc and codeowner update. ## Authors Co-authored-by: Pablo Reble <pablo.reble@intel.com> Co-authored-by: Julian Miller <julian.miller@intel.com> Co-authored-by: Ben Tracy <ben.tracy@codeplay.com> Co-authored-by: Ewan Crawford <ewan@codeplay.com> Co-authored-by: Maxime France-Pillois <maxime.francepillois@codeplay.com> --------- Co-authored-by: Ewan Crawford <ewan@codeplay.com> Co-authored-by: Maxime France-Pillois <maxime.francepillois@codeplay.com>
The destructor should be calling `cuDevicePrimaryCtxRelease`
Fixes error found in [post-commit CI](https://github.com/intel/llvm/actions/runs/5454766342/jobs/9925392005) after the merge of intel/llvm#9992 ``` /__w/llvm/llvm/src/sycl/plugins/hip/pi_hip.cpp:5635:24: error: unused parameter 'sync_point' [-Werror,-Wunused-parameter] 5635 | pi_ext_sync_point *sync_point) { | ^ /__w/llvm/llvm/src/sycl/plugins/hip/pi_hip.cpp:5691:12: error: unused parameter 'dst_row_pitch' [-Werror,-Wunused-parameter] 5691 | size_t dst_row_pitch, size_t dst_slice_pitch, | ^ /__w/llvm/llvm/src/sycl/plugins/hip/pi_hip.cpp:5691:34: error: unused parameter 'dst_slice_pitch' [-Werror,-Wunused-parameter] 5691 | size_t dst_row_pitch, size_t dst_slice_pitch, | ^ 3 errors generated. ```
This patch re-introduces the fix from intel/llvm#8765 Which seems to have been accidentally dropped by the UR port.
… classes (#10104) This PR fixes an issue with overhead when calling the same kernel multiple times in a loop. Right now, some calls to CUDA API happen every time the kernel is invoked to query the same information. Those calls were moved to `device` and `kernel` constructors and the info was cached as private members of those classes.
These checks are already performed by the UR validation layer
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
Currently if `cuInit` fails while getting platforms it returns UR_RESULT_SUCCESS. Instead we use `UR_CHECK_ERROR` to return the correct error code.
This was preventing out-of-tree build of the adapter for standalone use with unified runtime. Signed-off-by: Piotr Balcer <piotr.balcer@intel.com>
…s images extension (#10112) # Experimental Implementation of SYCL Bindless Images Extension This commit stands as the second commit of four to make code review easier, implementing revision 4 of the [bindless images extension proposal](intel/llvm#9842). ## Scope This PR covers changes made to the PI and the UR. This includes - Extending PI with extension functions - Updating UR FetchContent commit and implementing [UR bindless images experimental features](https://oneapi-src.github.io/unified-runtime/core/EXP-BINDLESS-IMAGES.html) on the CUDA adaptor ## Following Split PRs - [3/4] Implement the user-facing SYCL extension - [4/4] Add tests ## Authors Co-authored-by: Isaac Ault <isaac.ault@codeplay.com> Co-authored-by: Hugh Bird <hugh.bird@codeplay.com> Co-authored-by: Duncan Brawley <duncan.brawley@codeplay.com> Co-authored-by: Przemek Malon <przemek.malon@codeplay.com> Co-authored-by: Chedy Najjar <chedy.najjar@codeplay.com> Co-authored-by: Sean Stirling <sean.stirling@codeplay.com> Co-authored-by: Peter Zuzek <peter@codeplay.com>
In CUDA objects are represented as integers rather than opaque handles. This patch fixes a segmentation fault when creating a device handle from a native handle by avoiding dereferencing a pointer which should be treated as an integer.
- Fixed compiler errors/warnings related to unused and uninitialized variables and parameters. Post-commit fix for PR: intel/llvm#10112
After the recent device partition changes in the UR spec (i.e. ur_device_partition_property_t), level_zero, cuda and hip adapters are returning incorrect values and types for UR_DEVICE_INFO_SUPPORTED_PARTITIONS and UR_DEVICE_INFO_PARTITION_TYPE. This PR fixes this issues with the adapters and updates pi2ur to correctly convert between ur_device_partition_properties_t and pi_device_partition_property.
…#9294) This change adds a SYCL interface to the Level Zero APIs zexDriverImportExternalPointer and zexDriverReleaseImportedPointer. These functions are used for importing host memory into USM for the duration of data transfer to increase bandwidth.
Cache the max local mem size so that we can call less CUDA driver entry points at `urEnqueueKernelLaunch`. Also allows us to query the value set for `SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE` using `device.get_info<sycl::info::device::local_mem_size>()`
During the port to UR the CUDA and HIP PI plugin ABI's were unintentionally changed. There does not appear to be symbol checks for these plugins, unlike the [Level Zero symbol check](https://github.com/intel/llvm/blob/sycl/sycl/test/abi/pi_level_zero_symbol_check.dump) and [OpenCL symbol check](https://github.com/intel/llvm/blob/sycl/sycl/test/abi/pi_opencl_symbol_check.dump). As such, the ABI change went unnoticed until intel/llvm#10490 was opened using the same approach for the OpenCL port, which [failed](https://github.com/intel/llvm/actions/runs/5610646255/job/15200624025?pr=10490) the OpenCL symbol check. This PR restores the expected ABI for the CUDA and HIP plugins and introduces new CUDA and HIP symbol check tests.
Bump the Unified Runtime commit, and make adapter changes needed for the newly added adapter handles (see oneapi-src#715 for details) This fixes #10066 by providing an implementation of `piPluginGetLastError` in pi2ur.
The `UR_RESULT_ADAPTER_SPECIFIC_ERROR` was not returning an error to the SYCL RT which meant all errors were treated as warnings and ignored unless `SYCL_RT_WARNING_LEVEL` is set to geq 2. This changes things so the adapter specific error is now reported as such, meaning all uses `UR_RESULT_ADAPTER_SPECIFIC_ERROR` meant as warnings are now caught as errors. --------- Co-authored-by: Hugh Delaney <hughd94@gmail.com>
In the CUDA/HIP adapters `urKernelSetArgValue` was being used to implement both `urKernelSetArgValue` & `urKernelSetArgLocal`. However, if the validation layer is enabled in UR then the path to set local arg is never taken since it includes a check that `pArgValue` is not null. This PR: * Implements `urKernelSetArgLocal` for CUDA/HIP adapters * Changes `pi2ur` to call `urKernelSetArgLocal` when `arg_value` is `nullptr` * Implements `urKernelSetArgLocal` for L0 adapter - this just calls back into `urKernelSetArgValue`.
Fix the license headers at the top of each source file in the unified runtime directory. --------- Co-authored-by: Alexey Bader <alexey.bader@intel.com>
This PR adds missing functions in the hip backend to allow for interoperability in programs that create sycl objects from native hip objects. The new function implementations are: - `make_device` - `make_queue` - `make_event` Note that it would really make sense for intel/llvm#10491 to be merged first because this PR makes the same code change in pi2ur, for a fix that is attributed to #10491. --------- Signed-off-by: Jack Kirk <jack.kirk@codeplay.com>
intel/llvm#10691 missed these extra cases. This PR adds the needed ifdefs
… (#11023) Resolve same issue fixed in [PR](intel/llvm#10034) Co-authored-by: Omar Ahmed <omar.ahmed@codeplay.com>
Lots of hip/cu driver API calls were wrapped in `ur::assertion(res == CU_SUCCESS)` etc which: - Means that any native error messages returned from the affected driver api calls were lost. - Since these APIs report errors asynchronously, such that they are thrown from the last API call rather than the call which led to the error, previous asynchronous error messages from different APIs to the ones wrapped by the `ur::assertion` could also be lost depending on user code. These problems are fixed by swapping these assertions with `UR_CHECK_ERROR`. Note that in the future UR may want to adjust `UR_CHECK_ERROR` so that it throws `UR_RESULT_ERROR_ADAPTER_SPECIFIC` etc instead of using `std::cerr` etc to report the error etc. But I think it makes sense to still use `UR_CHECK_ERROR` to wrap driver API calls because it means that the __LINE__, __FUNCTION__ etc info can be correctly passed to native error reporting. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
- Support was added for the following image channel types: - `unorm_int8` - `unorm_int16` - `snorm_int8` - `snorm_int16` - Reading these types through `read_image` returns the denormalized floating point data. - A test was added for these new types. - Support for the following packed normalized image channel types was removed from the proposal: - `unorm_short_565` - `unorm_short_555` - `unorm_int_101010` - This was done due to lack of device support. If the need for these types arises in the future, we can revisit support for these types.
Fetch the latest revision of unified runtime: Notable changes * Some command buffer entry-points have been renamed, also 2 additions of membufferFill and USMFill * UrInit/urTearDown have been removed - replaced with loader only versions.
This change adds a new aspect for esimd, `ext_intel_esimd`, and annotates the two fundamental esimd classes, `simd_obj_impl` and `simd_view_impl` with the `uses_aspect` attribute. `simd_obj_impl` is the base class of `simd` and `simd_mask` which are the fundamental user-facing classes. `simd_view_impl` is the base of only `simd_view`. `simd_obj_impl` is not a base of `simd_view_impl`, but every `simd_view_impl` requires a reference to a `simd` or `simd_mask` at construction time, so I am not sure if we truly need to annotate `simd_view_impl`, but I added it to be safe. It also adds a new PI device info query, `PI_EXT_INTEL_DEVICE_INFO_ESIMD_SUPPORT` that is used to query at runtime if a device supports ESIMD. For UR-based plugins, we map that to `UR_DEVICE_INFO_ESIMD_SUPPOR`. The implementation simply returns false for cuda, hip and native_cpu. For l0 and opencl, we check that the device is an intel gpu by querying the device type is gpu and the vendor id is `0x8086`. For ESIMD emulator we simply return true. I would appreciate careful review on the plugin changes in particular, as I am not an expert. This change also updates the esimd spec to document the new aspect. In a future change, I plan to use the new aspect to remove the requirement for the `SYCL_ESIMD_FUNCTION` and `SYCL_ESIMD_KERNEL` macros that set function attributes, but I am not doing that as part of this PR as it requires more investigation. --------- Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Reverts intel/llvm#11155. I'm almost sure this is what broke our CI for the past few days.
- Return correct error code in urContextCreateWithNativeHandle - Add UR_DEVICE_INFO_VIRTUAL_MEMORY_SUPPORT to urDeviceGetInfo - Move asserts from urEnqueueMemBufferFill to UR validation layers (commit 2c533e6 on UR repository) - Make behaviour of urEventSetCallback consistent with other unsupported entrypoints. - urProgramGetInfo and urQueueGetInfo now return UR_RESULT_ERROR_UNSUPPORTED_ENUMERATION for unsupported cases. - Add entrypoints for urSamplerGetNativeHandle and urSamplerCreateWithNativeHandle - Fix UR_USM_ALLOC_INFO_BASE_PTR and UR_USM_ALLOC_INFO_SIZE implementation.
`UR_CHECK_ERROR` was designed to return `ur_result_t`, however in practice it was guaranteed to only ever return `UR_RESULT_SUCCESS`, as other paths would either terminate, abort or throw. This in turns leads to poor quality/error prone code, as the codebase was littered with: * statements not checking the return value - depending on the compiler generating a warning, * extra check on the return which was only ever going to be true. Some care was required, as the codebase has a habit of accumulating err codes across branches, so depending on the use case the initial value of `ur_result_t Result`s had to be set accordingly (now that `UR_CHECK_ERROR` does not return).
8879199 to
245afd0
Compare
245afd0 to
3accba7
Compare
| endif() | ||
|
|
||
| if(UMF_ENABLE_POOL_TRACKING) | ||
| target_compile_definitions("ur_adapter_cuda" PRIVATE UMF_ENABLE_POOL_TRACKING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this option in the readme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I know the UMF team are reworking stuff so that this option isn't needed, hopefully its days are numbered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already an option in UR. It was just not being used in the adapter
| @@ -0,0 +1,89 @@ | |||
| //===--------- adapter.cpp - CUDA Adapter ---------------------------------===// | |||
| // | |||
| // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These licenses will need updating too, but a follow up PR is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add this in a follow up PR
a5c8d9e to
9d35c99
Compare
e48c9b8 to
e04e2f9
Compare
This PR moves the CUDA adapter for the intel/llvm repository to the unified runtime repository. This was done using git subtrees to preserve the commit history.
A follow-up PR on intel/llvm will remove the sources from that repo.