[CUDA][HIP] Use device to get native context#425
[CUDA][HIP] Use device to get native context#425Rbiessy merged 1 commit intouxlfoundation:developfrom
Conversation
8243065 to
9669252
Compare
|
reading your changes, I have a question. For example, auto cudaDevice = sycl::get_nativesycl::backend::ext_oneapi_cuda(queue.get_device()); Is the type of cudaDevice "CUdevice" ? |
|
Hi @jinz2014 yes you are correct! |
|
cufft_run.txt |
396cd30 to
0ea3b84
Compare
|
AMD tests for lapack and blas all passing: 8 lapack nvidia test failing on GTX1050 but these tests are also failing on Nvidia blas tests passing |
|
I see all the buffer tests failing for the Logs: The failures are not because of the changes in this PR, but rather a recent change in the compiler. All these tests are expected to pass once oneapi-src/unified-runtime#1226 and intel/llvm#12297 are merged. |
c8a758d to
878b981
Compare
Rbiessy
left a comment
There was a problem hiding this comment.
Would you be able to attach test logs again? Also does this change compile with the 2024.0 icpx release?
| // Getting the primary context also sets it as the active context | ||
| CUDA_ERROR_FUNC(cuDevicePrimaryCtxRetain, err, &desired, cudaDevice); |
There was a problem hiding this comment.
Should we expect a performance cost from this change? From what I understand cuCtxSetCurrent was expected to be called only once before, assuming the context active was not changed outside of oneMKL.
This constructor is called once before each calls to blas functions so I am wary that the cost may add up.
There was a problem hiding this comment.
The cost of cuDevicePrimaryContextRetain is minimal once the primary context is not being initialized for the first time, which it should not be here. A simple benchmark like this:
for (int i = 0; i < NUM_ITERATIONS; i++) {
CHECK(cuDevicePrimaryCtxRetain(&context, device));
CHECK(cuDevicePrimaryCtxRelease(device));
}Gives 32ns per loop, so calls to these funcs are almost free.
Using setup:
$ nvidia-smi
Mon Mar 25 16:27:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 Ti On | 00000000:01:00.0 Off | N/A |
| 31% 24C P8 N/A / 75W | 14MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
hjabird
left a comment
There was a problem hiding this comment.
Generally LGTM. I've tested the rocFFT and cuFFT backends with DPC++ 2024.0's icpx.
It would be good to see logs again after the rebase as Rbiessy suggests.
SYCL contexts have a many to one mapping to native contexts. Therefore it is necessary to get the desired native context from a SYCL device, as SYCL devices have a one to one mapping to native contexts.
878b981 to
8d14c8a
Compare
|
Some test results: CUDAgtx1050.txt HIPgfx90a_oneMKL_test.txt I am not sure how to build/run the FFT tests. Are there some build/test instructions that I can follow? |
|
Fixed. I can successfully build this branch with icpx 2024.0.2 for CUDA |
|
Thanks a lot @hdelan ! The instructions are here but need to be improved. The short answer is that you should just need to add |
|
Thanks @Rbiessy ! Building rocFFT is broken for me but this PR does not touch that code. Building with cuFFT is OK. Here is updated tests for all oneMKL for CUDA including cuBLAS, cuFFT, cuRAND, cuSOLVER: |
|
Thanks! LGTM |
Since oneapi-src/unified-runtime#999 it is no longer valid to get the native context from the SYCL context on a multi GPU system. The get native func for contexts has been deprecated for this reason. See intel/llvm#10975
Similar ticket: uxlfoundation/oneDNN#1765