Skip to content

[BLAS::SYCL-BLAS backend] New BLAS backend: SYCL-BLAS#262

Merged
mkrainiuk merged 20 commits intouxlfoundation:developfrom
codeplaysoftware:sycl-blas-backend-staging
May 10, 2023
Merged

[BLAS::SYCL-BLAS backend] New BLAS backend: SYCL-BLAS#262
mkrainiuk merged 20 commits intouxlfoundation:developfrom
codeplaysoftware:sycl-blas-backend-staging

Conversation

@hjabird
Copy link
Contributor

@hjabird hjabird commented Dec 30, 2022

Description

  • Add SYCL-BLAS as a backend for OneMKL.
  • SYCL-BLAS is a pure-SYCL, open-source, BLAS implementation.

Checklist

All Submissions

  • [ x ] Tests: Logs attached below
    • syclblas_ct.txt
    • syclblas_rt.txt
    • The failures for double-precision functions on an iGPU that does not support double precision. In these cases, the backend throws mkl::unsupported_device, which results in a test failure.
    • The tests were run with --gtest_filter=-ImatcopyBatchStrideTestSuite* because this test suite segfaults. This function is not implemented in SYCL-BLAS.
  • [ x ] Have you formatted the code using clang-format?

* Adds SYCL-BLAS backend to OneMKL
  * SYCL-BLAS is an pure-SYCL opensource BLAS implementation
* SYCL-BLAS is added as a header-only library

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
Co-authored-by: Muhammad Tanvir <muhammad.tanvir@codeplay.com>
Co-authored-by: Kumudha Narasimhan <kumudha.narasimhan@codeplay.com>
Co-authored-by: Fabio Mestre <fabio.mestre@codeplay.com>
Co-authored-by: Paolo Gorlani <paolo.gorlani@codeplay.com>
Co-authored-by: Mehdi Goli <mehdi.goli@codeplay.com>
@andrewtbarker
Copy link
Contributor

Thank you for the PR! Can you provide more details on the segfaults? There are several other routines that are not implemented in SYCL-BLAS but they do not segfault (omatcopy, omatadd).

@hjabird
Copy link
Contributor Author

hjabird commented Jan 4, 2023

@andrewtbarker

Thank you for the PR! Can you provide more details on the segfaults? There are several other routines that are not implemented in SYCL-BLAS but they do not segfault (omatcopy, omatadd).

The only segfaulting test is ImatcopyBatchStrideTestSuite, for complex single precision with row-major. I'm compiling DPC++ with the 2022-06-10 nighty on Ubuntu. double free or corruption and then Aborted (core dumped) are encountered after the test catches the unimplemented exception, and after the test<elem_t> function has executed:

__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff6bd2859 in __GI_abort () at abort.c:79
#2  0x00007ffff6c3d26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff6d67298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007ffff6c452fc in malloc_printerr (str=str@entry=0x7ffff6d69690 "double free or corruption (!prev)") at malloc.c:5347
#4  0x00007ffff6c46fac in _int_free (av=0x7ffff6d9cb80 <main_arena>, p=0x13b2d70, have_lock=<optimised out>) at malloc.c:4317
#5  0x00000000009e5095 in (anonymous namespace)::ImatcopyBatchStrideTests_ComplexSinglePrecision_Test::TestBody() ()
#6  0x00007ffff776b518 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
   from /home/hugh/dev/onemkl/build/lib/libgtest.so
#7  0x00007ffff774894f in testing::Test::Run() () from /home/hugh/dev/onemkl/build/lib/libgtest.so
#8  0x00007ffff774a04d in testing::TestInfo::Run() () from /home/hugh/dev/onemkl/build/lib/libgtest.so
#9  0x00007ffff774a800 in testing::TestSuite::Run() () from /home/hugh/dev/onemkl/build/lib/libgtest.so
#10 0x00007ffff77581f3 in testing::internal::UnitTestImpl::RunAllTests() () from /home/hugh/dev/onemkl/build/lib/libgtest.so
#11 0x00007ffff776c3f8 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) () from /home/hugh/dev/onemkl/build/lib/libgtest.so
#12 0x00007ffff7757df9 in testing::UnitTest::Run() () from /home/hugh/dev/onemkl/build/lib/libgtest.so
#13 0x0000000000ab4e16 in main ()

If I comment out the call to imatcopy_ref, there is no segfault, so I think imatcopy_ref is corrupting the heap. It isn't clear to me why this seems to occur for complex-single-precision row-major and not for real matrices.

If I was guessing as to the location of the bug, I'd look into

    lda = std::max(m, n);
    ldb = std::max(m, n);

which I don't understand at the moment - the imatcopy docs and imatcopy batched strided docs make this parameter look rather more complicated.

Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working my way through the code, so there could be more comments, but here are a few to start:

  1. This could be a very nice addition to the project!
  2. Does SYCL-BLAS not support USM memory at all? I am not sure of the long-term future of the buffer interfaces so this could be a concern.
  3. I cannot reproduce the segfault for imatcopy batch stride.
  4. I am seeing quite a few failures in my local testing (integrated Intel GPU). I have not investigated these closely, but I'm attaching the log just in case the list of failures means anything.

test.txt

@hjabird
Copy link
Contributor Author

hjabird commented Jan 5, 2023

I'm still working my way through the code, so there could be more comments, but here are a few to start:

1. This could be a very nice addition to the project!

2. Does SYCL-BLAS not support USM memory at all? I am not sure of the long-term future of the buffer interfaces so this could be a concern.

3. I cannot reproduce the segfault for imatcopy batch stride.

4. I am seeing quite a few failures in my local testing (integrated Intel GPU). I have not investigated these closely, but I'm attaching the log just in case the list of failures means anything.

test.txt

  1. Thank you!
  2. SYCL-BLAS doesn't support USM memory at the moment, but it will support USM in the future. We will add support USM in the backend when SYCL-BLAS's implementation is ready.
  3. Interesting. I don't think this is an issue introduced in this PR, but something to keep track of. If I can find time, I'll investigate more with other backends and create an issue for this.
  4. Your failures are all of the functions that SYCL-BLAS implements. If you could provide more details on how you compiled this, and the drivers you're using, that would help us reproduce the issue.

Fix compile time dispatch testing when using sycl-blas
@andrewtbarker
Copy link
Contributor

andrewtbarker commented Jan 5, 2023

Typical error message from my failures:

1: [ RUN      ] Nrm2TestSuite/Nrm2Tests.RealSinglePrecision/Column_Major_Intel_R__Iris_R__Pro_Graphics_580__0x193b_
1: Native API failed. Native API returns: -997 (Command failed to enqueue/execute) -997 (Command failed to enqueue/execute)
1: Enqueue process failed. -59 (CL_INVALID_OPERATION)
1: Enqueue process failed. -59 (CL_INVALID_OPERATION)
1: unknown file: Failure
1: C++ exception with description "Enqueue process failed. -59 (CL_INVALID_OPERATION)" thrown in the test body.
1: [  FAILED  ] Nrm2TestSuite/Nrm2Tests.RealSinglePrecision/Column_Major_Intel_R__Iris_R__Pro_Graphics_580__0x193b_, where GetParam() = (0x3162320, 1-byte object <00>) (222 ms)

Driver version: 22.08.22549

I'm using the icx compiler from the 2022.2 base toolkit, and building only the SYCL-BLAS backend.

This is quite possibly a bad configuration locally, I'm investigating more closely.

@andrewtbarker
Copy link
Contributor

I have the tests passing on a different machine, so never mind the failures I saw. Overall I think this PR is looking good.

@mmeterel
Copy link
Contributor

mmeterel commented Jan 6, 2023

I have the tests passing on a different machine, so never mind the failures I saw. Overall I think this PR is looking good.

@andrewtbarker Do we have any plans in place for testing this backend in terms of machines, compilers as a part of CI? If not, I think, we should make that plan and validate the PR on all configurations (if we have not done it yet).

@andrewtbarker
Copy link
Contributor

@mmeterel The PR in its current form is only supported with dpc++ compiler and Intel GPU. There is some discussion about the eventual scope for this, we'll see. I will try to document the scripts for the install and test process so we can automate it in the future.

Enable multiple devices type with the SYCL-BLAS backend.

Co-authored-by: Hugh Bird <hugh.bird@codeplay.com>
@Rbiessy
Copy link
Contributor

Rbiessy commented Jan 26, 2023

Logs of the SYCL-BLAS backend on different devices:
sycl-blas-tests-intel.log
sycl-blas-tests-nvidia_gpu.log
sycl-blas-tests-amd_gpu.log

Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally looking pretty good. I have some comments below about the documentation (which I think could generally be clearer for a non-expert user) and I'm running some tests now.

@andrewtbarker
Copy link
Contributor

Is this intended to skip all row_major tests?

@Rbiessy
Copy link
Contributor

Rbiessy commented Jan 30, 2023

Is this intended to skip all row_major tests?

Yes, SYCL-BLAS does not support row major operations currently. We are planning to support this in the future.

@andrewtbarker
Copy link
Contributor

On nVidia hardware I'm seeing lots of errors of this form:

949: Note: Google Test filter = Nrm2TestSuite/Nrm2Tests.RealDoublePrecision/Column_Major_NVIDIA_TITAN_RTX
949: [==========] Running 1 test from 1 test suite.
949: [----------] Global test environment set-up.
949: [----------] 1 test from Nrm2TestSuite/Nrm2Tests
949: [ RUN      ] Nrm2TestSuite/Nrm2Tests.RealDoublePrecision/Column_Major_NVIDIA_TITAN_RTX
949: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)
949: Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)
949: Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)
949: unknown file: Failure
949: C++ exception with description "Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)" thrown in the test body.
949: [  FAILED  ] Nrm2TestSuite/Nrm2Tests.RealDoublePrecision/Column_Major_NVIDIA_TITAN_RTX, where GetParam() = (0x1b6bf90, 1-byte object <00>) (77 ms)
949: [----------] 1 test from Nrm2TestSuite/Nrm2Tests (77 ms total)
949: 
949: [----------] Global test environment tear-down
949: [==========] 1 test from 1 test suite ran. (77 ms total)
949: [  PASSED  ] 0 tests.
949: [  FAILED  ] 1 test, listed below:
949: [  FAILED  ] Nrm2TestSuite/Nrm2Tests.RealDoublePrecision/Column_Major_NVIDIA_TITAN_RTX, where GetParam() = (0x1b6bf90, 1-byte object <00>)
949: 
949:  1 FAILED TEST

I'll dig in a little more and let you know what I find.

@Rbiessy
Copy link
Contributor

Rbiessy commented Jan 31, 2023

@andrewtbarker this looks like an error with the wrong target triple?
For reference I tested this on an RTX 2060 with the CMake command line cmake -Bbuild -GNinja -DENABLE_SYCLBLAS_BACKEND=ON -DREF_BLAS_ROOT=/path/to/lapack/install -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DTARGET_DOMAINS=blas -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_MKLGPU_BACKEND=OFF -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda" and a build of intel/llvm with CUDA support.

@andrewtbarker
Copy link
Contributor

@Rbiessy that's exactly it, thanks. Do you think we can modify the way FindCompiler.cmake sets the -fsycl-targets argument so it's similar with SYCL-BLAS and the other backends (see here) ?

@Rbiessy
Copy link
Contributor

Rbiessy commented Jan 31, 2023

@andrewtbarker I am not keen on adding more CMake flags specific to SYCL-BLAS. One difference with the variables in FindCompiler.cmake is that we can enable multiple targets for SYCL-BLAS which is an important feature. I think it would be confusing for multiple targets. This requires more work to automate this on the CMake level so could be worth doing in a separate MR if we decide this is needed?

@andrewtbarker
Copy link
Contributor

Would it be better if we add the -fsycl-targets configuration in backends/syclblas/CMakeLists.txt? Then we can do it based on the already existing ENABLE_SYCLBLAS_BACKEND_* cmake variables.

At the very least this needs to be documented much better. I did not know the explicit CMAKE_CXX_FLAGS would be needed until you told me today - how will users know how to configure this backend?

@Rbiessy
Copy link
Contributor

Rbiessy commented Feb 1, 2023

It's not possible to set -fsycl-targets in backends/syclblas/CMakeLists.txt. ENABLE_SYCLBLAS_BACKEND_* are internal variables, they will be overwritten if set on the command line (and are not visible when listing the available options).

We are documenting CMAKE_CXX_FLAGS in docs/building_the_project.rst. They are meant as advanced options for now but we could improve this later if needed.

@andrewtbarker
Copy link
Contributor

I made a PR into this branch to show what I have in mind. It works fine on our nvidia GPU, have not tried it with AMD.

For documentation, do I understand the following correctly:

  1. Intel GPU is supported without -fsycl-targets option;
  2. nVidia and AMD GPU require an explicit -fsycl-targets option (this is my experience);
  3. Enabling multiple GPU types will also require explicit -fsycl-targets;
  4. SYCL-BLAS can only tune for a single GPU;
  5. Tuning is strongly recommended in the SYCL-BLAS docs for performance.

If I'm right on these points (or even if I'm wrong), some of this needs to be clarified in building_the_project.rst.

@Rbiessy
Copy link
Contributor

Rbiessy commented Feb 2, 2023

I have updated the documentation according https://github.com/codeplaysoftware/oneMKL/pull/16 will be merged.
I am trying to keep it generic. For instance the default sycl-target is spir64 so that should run on a wide range of devices if they have the right drivers (including Nvidia or AMD).

@andrewtbarker
Copy link
Contributor

This is much better, thank you. Sorry for the churn and confusion, I just want to make sure that if we claim to support multiple HW it's clear to users how to use them. I'm ready to approve this once the above PR (or something similar) is merged.

[BLAS] Automate -fsycl-targets argument for SYCL-BLAS backend
Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your wok on this!

@andrewtbarker
Copy link
Contributor

@mkrainiuk Can you or someone from engine team take a look at this when you get a chance?

@muhammad-tanvir-1211
Copy link
Contributor

I'm still working my way through the code, so there could be more comments, but here are a few to start:

3. I cannot reproduce the segfault for imatcopy batch stride.

Hi @andrewtbarker, please take a look at #281 which explains how to reproduce the imatcopy seg faults and a potential fix for these seg faults.

@muhammad-tanvir-1211
Copy link
Contributor

Hi @andrewtbarker @mkrainiuk is there anything else that you would like us to add in this PR?

@andrewtbarker
Copy link
Contributor

Hi @muhammad-tanvir-1211 , from my perspective this is ready to go but we need a second review to merge. Our team is right in the middle of a busy process right now, sorry for the delay.

Copy link
Contributor

@mkrainiuk mkrainiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I have several questions to the implementation.
I also want to clarify the long-term plan for this backend. As it's header based opensource solution I think it could improve the usability if it can be fully integrated to the oneMKL project. In this case we can skip the argument conversion because of different API and reduce the execution time.

netlib,
rocblas,
rocrand,
syclblas,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we build new backend with the name of what device arch it is targeted for? E.g., syclblas_intelgpu, syclblas_nvidiagpu? With the current implementation it's not possible to have syclblas backends built for different targets at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had a different idea to tuned for multiple targets. We were planning to keep only one backend syclblas and improve SYCL-BLAS to allow to tune for multiple targets at the same time. Maybe we could also make sure the translation unit are split and SYCL-BLAS is included multiple times with different macro definitions for the tuning.
I don't see the benefit of adding multiple syclblas backends currently. In any case this needs more work to be supported, is it important to support for the first PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I'm just trying to clarify the long-term support. If the dispatching between different HW will happen in the backend itself it's fine to me to have only syclblas as the backend library name. But in this case I guess all dispatching logic implemented on oneMKL Interfaces side will be duplicated inside syclblas backend, so we probably need to make sure we will remove this duplication in the future.

Copy link
Contributor

@mkrainiuk mkrainiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor request for documentation to make sure CPU support performance limitations are covered. Except this the initial implementation looks fine to me. Thank you!

When the SYCL BLAS project will be mature enough I think we need to revisit the integration for this backend: removing device detection duplication in oneMKL interface layer/SYCL BLAS and also considering option to make SYCL BLAS part of oneMKL Interfaces project.

@mmeterel mmeterel changed the title [BLAS] New BLAS backend: SYCL-BLAS [BLAS::SYCL-BLAS backend] New BLAS backend: SYCL-BLAS Apr 12, 2023
@mkrainiuk mkrainiuk merged commit d11dac5 into uxlfoundation:develop May 10, 2023
normallytangent pushed a commit to normallytangent/oneMKL that referenced this pull request Aug 6, 2024
)

* [BLAS] New BLAS backend: SYCL-BLAS

* Adds SYCL-BLAS backend to OneMKL
  * SYCL-BLAS is an pure-SYCL opensource BLAS implementation
* SYCL-BLAS is added as a header-only library

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
Co-authored-by: Muhammad Tanvir <muhammad.tanvir@codeplay.com>
Co-authored-by: Kumudha Narasimhan <kumudha.narasimhan@codeplay.com>
Co-authored-by: Fabio Mestre <fabio.mestre@codeplay.com>
Co-authored-by: Paolo Gorlani <paolo.gorlani@codeplay.com>
Co-authored-by: Mehdi Goli <mehdi.goli@codeplay.com>

* Remove duplicate unsupported backend in backend map

* Fix compile time dispatch testing when using sycl-blas

* Enable rotmg

* Fix issue in previous commit

* Re-enable Trsm

* Change wording in docs and exception; fix typo;

* Sycl blas backend staging device support (uxlfoundation#13)

Enable multiple devices type with the SYCL-BLAS backend.

Co-authored-by: Hugh Bird <hugh.bird@codeplay.com>

* Revert change to disable OpenCL gpu devices

* Update documentation

* [BLAS] Automate -fsycl-targets argument for SYCL-BLAS backend

* Update documentation

* Fix for CMake 3.16

* Add checks for compiler

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
Co-authored-by: Muhammad Tanvir <muhammad.tanvir@codeplay.com>
Co-authored-by: Kumudha Narasimhan <kumudha.narasimhan@codeplay.com>
Co-authored-by: Fabio Mestre <fabio.mestre@codeplay.com>
Co-authored-by: Paolo Gorlani <paolo.gorlani@codeplay.com>
Co-authored-by: Mehdi Goli <mehdi.goli@codeplay.com>
Co-authored-by: Andrew T. Barker <andrew1.barker@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants