Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature : Add SYCL runtime support #747

Open
wants to merge 69 commits into
base: sycl-refactor
Choose a base branch
from

Conversation

abhilash1910
Copy link
Contributor

In the plan to support an extensive ecosystem spanning from Huggingface, Lora/LLama families, this addition would enable this great FW to run on our GPU cards seemlessly. The discussion started from this merged PR (artidoro/qlora#219) and currently this PR would eventually enable low precision quantization. (PR is initial draft state).
Tagging @TimDettmers . Thanks for creating this repository.

@abhilash1910
Copy link
Contributor Author

@Titus-von-Koeller based on issue #894 , there are some abstractions which are specific to certain intel devices which are to be selectively implemented in Ipex but for a majority of the ops/kernels to be supported across all SYCL devices this integration is needed. This is a WIP as it requires some dependencies to be resolved on Intel LLVM to make it runnable across SYCL devices. In general this is a broader approach catering to SYCL runtime. @jianan-gu

@abhilash1910 abhilash1910 changed the title Feature : Add Intel GPU/SYCL runtime support Feature : Add SYCL runtime support Dec 13, 2023
@Titus-von-Koeller
Copy link
Collaborator

Dear @abhilash1910,

Thanks for submitting this PR and flagging the connection with #894, really appreciated.

Would you be so kind to give a bit more background and describe in detail that you're PR is aiming to solve (once it's complete) and what roadblocks you see before getting there?

Beware, that contrary to Tim, my systems programming knowledge is not so good, yet, but once I understand the above better, I'm happy to coordinate + help along this contribution.

Could you please elaborate specifically where you see overlap + hurdles to resolve relative to #894?

@abhilash1910
Copy link
Contributor Author

Hi @Titus-von-Koeller
Thank you.
Yes sure , the main aim of this PR is to provide a common runtime for SYCL kernels - this includes device agnostic kernels which can be made to run on SYCL runtime. I had started this initially for our Intel GPU devices , which would be now partly covered by the IPEX module (specified in 894) . For users and developers who are not using IPEX runtime entirely , and want the bare metal performance on (nearly all) SYCL runtime support devices - this PR is needed.
Although the IPEX pathway would provide some intrinsic optimizations through IPEX kernels ( and some specific ISAs which will benefit certain sets of our GPU lineups) , this is a more generic standalone implementation which is decoupled from IPEX and can be compiled and run separately . The entirety of this port relies on Intel LLVM's clang which is also being modified to make comparable care metal APIs . SYCL is a runtime built on top of OpenCL (which is the defacto runtime supported in majority of CPUs) and can be aligned as a middleware layer compatible with Nvidia/AMD and other devices supporting Sycl runtime (from Khronos).
The initial motivation was to provide support to our devices specifically (GPU sets) but with the growing adoption of oneAPI and SYCL on all device backends this would provide another way to users who want to rely on pure compiler runtimes for performance.
As of yet, there are not much compat hurdles when compared to my colleagues PR (894) as much of IPEX work would be on the IPEX framework (including quant/dequant) .But when it comes to certain functionality issues, our compiler /SYCL backend and LLVM is now getting updated to more robust APIs for translation/and analysis which is why the PR is in draft mode. I am working on the compiler support as well and since most of the translations are in place some part is left which will be done soon.
Let me know your thoughts and appreciate your help and responses. :)

@TimDettmers
Copy link
Collaborator

This is great, thank you so much for your contribution!

Can you run me through how a user would use/compile this for their Intel device? A step-by-step procedure would be great to get a easy overview for users.

@TimDettmers TimDettmers added medium priority (will be worked on after all high priority issues) Medium risk Risk of bugs in transformers and other libraries labels Jan 2, 2024
@abhilash1910
Copy link
Contributor Author

Hi @TimDettmers , @Titus-von-Koeller as my fellow colleagues are working on the common device abstraction properties , this is an alternative addition to support SYCL backend across vendors . Since here we are only abstracting the sycl kernels, there is a need for a dedicated cmake for running only the kernels. The compilation of this would require a standalone cmake with the following options:

  • intel basekit (oneapi latest release)
  • IntelSYCL and IntelLLVM (this would be present within the basekit)
  • c++17 standards as SYCL has a deep linkage
  • icpx and icx compiler
  • fsycl targets for spirv (Intel) ,ptx(Nv) and amdgcn(AMD)
  • Certain compiler level optimizations (narrowing/lowering etc).
    This work will be a parallel separate effort to meet customer requirements who are interested in bare sycl performance ,from the existing effort of my colleagues, and gradually more SYCL organizations would be contributing to this PR for kernels. I will be sending out a detailed description of our sycl backend support. Also to get a better idea regarding this, @Titus-von-Koeller @TimDettmers if you would be available to discuss via a meeting regarding the strategies - firstly the device common strategy which my colleagues are working on and specially the SYCL kernel path which is the current PR.

@matthewdouglas
Copy link
Member

@abhilash1910 I have a few questions.

  1. What would be the advantage of using the PTX (Nvidia) target on the SYCL implementation vs. existing CUDA backend? I can understand the appeal to have a single abstraction across hardware vendors, but I just can't see shifting away from the CUDA reference implementation. Further, as far as I understand, there's no way to target Apple Silicon.
  2. Does this require the plugin from Codeplay to target AMD when using the Intel compiler? I notice that the Codeplay AMD plugin is still labeled as being "beta."
  3. Is it possible to build with AdaptiveCpp (aka hipSYCL / Open SYCL)?

@abhilash1910
Copy link
Contributor Author

@abhilash1910 I have a few questions.

  1. What would be the advantage of using the PTX (Nvidia) target on the SYCL implementation vs. existing CUDA backend? I can understand the appeal to have a single abstraction across hardware vendors, but I just can't see shifting away from the CUDA reference implementation. Further, as far as I understand, there's no way to target Apple Silicon.
  2. Does this require the plugin from Codeplay to target AMD when using the Intel compiler? I notice that the Codeplay AMD plugin is still labeled as being "beta."
  3. Is it possible to build with AdaptiveCpp (aka hipSYCL / Open SYCL)?

@matthewdouglas Thanks for the questions. I will try to answer them in best way possible:

  1. Currently we are ramping up SYCL in terms of performance on different clang compilation targets and the intent is to benchmark kernel performance with respect to CUDA. It is not only for a common abstraction at compiler level, but the findings here allow us to optimize llvm to a great extent and inturn perlocate optimizations to sycl and "xpu" logic as a whole(Since xpu is built on top of sycl abstraction). Regarding the adoption, there is an overlap regarding running bare kernels (SYCL) in different Intel GPU variations (probably will come up for b&b post the pythonic development is completed in the other PR) and customer demands. In terms of community , we have been getting a lot of requests - for CUDA SYCL it is for internal and external benchmarking purposes . As of yet SYCL does not target Apple .
  2. I am working with Codeplay on the internal adoption to support amdgcns , and currently the PR should help optimize the backend for amd sycl as well. We have been targetting sycl across gpu vendors to see the stability of the common kernel code.
    3.AdaptiveCPP is something which we are discussing internally for amd specifically, but until we have established a SYCL interface , we will eventually go ahead with the discussion & strategy.

Also on other note, I would like to add that SYCL support specifically for kernels would eventually be one of the ways for Intel GPU upstream effort that is ongoing. And having SYCL backend opens a lot of Intel (not only) but also other devices for competitive benchmarking.

@abhilash1910
Copy link
Contributor Author

abhilash1910 commented Mar 30, 2024

Seeing the development update in #898 on a separate branch, I do agree branch separation of backends will be helpful . While the common device abstraction is the probable goal to merge, I think this PR would benefit from a standalone SYCL branch . Since we plan to integrate full sycl support on all kernel adaptations , this would be helpful to maintain as a separate branch. Hence proposing for a separate branch - SYCL.
Would appreciate your thoughts, @TimDettmers @Titus-von-Koeller @matthewdouglas .
cc @jgong5 @yao-matrix
PS: The PR would still take some amount of time to complete and would be eventually assisted by Codeplay.

@abhilash1910
Copy link
Contributor Author

Yes these are I think the prospective answers:

  1. This is something which really requires a lot of collaborative thought. SYCL being a language to support hardware semantics can be leveraged (through different device specific cmake args) to build for separate runtimes & devices. When the torch extensions arising from these separate cmakes are used to build and make the .so files, we can plug in the torch dispatcher as we normally would in any application. The only catch here is that there would be separate binaries for separate devices which need wrapping with torch . I would also ping @JackAKirk @npmiller @Ruyk @AerialMantis for their thoughts on this.

  2. Yes this is something which is very interesting as to see the performance , specifically of gemm kernels and how much is the gap in perf when it comes to the pure cublas implementation. It would be SYCL runtime compiled for ptx targets vs native Cuda runtime performance. This would give a good idea of further improvement of SYCL standards in general.

  3. SYCL being a language in its entirety does need extensions to be pluggable with torch. The extensions would come after we compiled the binaries (from SYCL) and link device specific binaries with torch extensions. For instance we have DPCPPExtension for SYCL for SPIRV (Intel targets) and standard CUDA_extensions for (Cuda or in some cases HIP). The later case requires the binaries to be compatible with ptx /cuda or amdgcn targets.

  4. There are a couple of considerations on this since we have a standing approachable PR to torch for Intel GPU Support (which contains sycl kernels). It is actually a separate non interfering branch which requires some thoughts. (I am keeping this open for now until I think of a best way if intersection is possible).

@abhilash1910
Copy link
Contributor Author

@Titus-von-Koeller A detail on this PR: I am nearing completion of the kernels . Since the LOC is quite big for this PR , I think the pytorch aspect of linking the binaries can be extended in a separate PR. I am open to suggestions and can add here as well but it would make the review process very hectic.
Currently the kernels are almost finalised (barring some small changes), it would be great if the review process is also started .
I have a collection of kernel wise test scripts separately ( in native c++ without invoking pt backend) which I can add as a benchmark test suite for sycl . Appreciate thoughts from other reviewers as well. Thanks

@abhilash1910 abhilash1910 marked this pull request as ready for review June 7, 2024 13:44
@matthewdouglas
Copy link
Member

One point to note is that bitsandbytes isn't built as a PyTorch extension currently for any of the backends. Instead the shared lib exports a C API that is invoked with Python's ctypes module.

This SYCL backend also does not need to link to libtorch or even python, and since it exposes the same API (see pythonInterface.cpp), it should be possible to just build multiple targets (e.g. spirv, ptx, amdgcn) and load the correct .so at runtime; no CUDAExtension or DPCPPExtension needed.

@abhilash1910
Copy link
Contributor Author

One point to note is that bitsandbytes isn't built as a PyTorch extension currently for any of the backends. Instead the shared lib exports a C API that is invoked with Python's ctypes module.

This SYCL backend also does not need to link to libtorch or even python, and since it exposes the same API (see pythonInterface.cpp), it should be possible to just build multiple targets (e.g. spirv, ptx, amdgcn) and load the correct .so at runtime; no CUDAExtension or DPCPPExtension needed.

Yes this is something which I noticed, and in that case the kernels can be built for multiple targets.
At runtime the kernels can be offloaded without linking to the respective pytorch bindings for the devices.
Thanks for the details @matthewdouglas

@abhilash1910
Copy link
Contributor Author

abhilash1910 commented Jul 29, 2024

Hi @Titus-von-Koeller , pinging for a review . The code is now updated with latest compiler . After some discussion internally , this PR would only consist of the SYCL kernels only so that these can be abstracted across vendors. Since the kernels are completed I would request start of the review process ; also there is a test suite which has been created for SYCL and wanted to discuss how to add that. Nvidia build feature is also added. There is also a seperate repository containing iGPU/ Cuda builds of this fork with python bindings: https://github.com/abhilash1910/bitsandbytes_sycl_samples

FYI: There is another PR from my collegaue with xpu support #1257 which is using ipex for xpu (Intel GPU) support only. We are discussing better ways to formulate it on the multibackend refactor branch.

@abhilash1910
Copy link
Contributor Author

abhilash1910 commented Aug 7, 2024

Pinging @Titus-von-Koeller @matthewdouglas with latest updated code.
cc @JackAKirk , @npmiller
(@zhuhong61 @Liangliang-Ma for aware)

@Titus-von-Koeller
Copy link
Collaborator

Titus-von-Koeller commented Aug 8, 2024

Dear @abhilash1910,

Thanks for this valuable contribution. I see you've made major progress, impressive!

We (the bitsandbytes team) are under very tight bandwidth at the moment so we won't have time to deep dive into it just yet. Currently we're both trying to get out of a maintenance impasse and preparing for the Intel CPU+XPU, AMD ROCm alpha release. Additionally, we're working on infra, pkging and testing improvements. Also, the multi-backend stuff still needs a major refactor with everyone waiting for us to finalize the backend abstraction with a move to the torch.library API for registering custom ops and dispatching through the PyTorch dispatcher.

These things currently take priority.

We'll however try to take a look as soon as we can and with @matthewdouglas having just been added to the team full-time, we're in a better position going forward.

@Titus-von-Koeller
Copy link
Collaborator

Also there is a test suite which has been created for SYCL and wanted to discuss how to add that

@abhilash1910 are these the tests you were referring to?

@abhilash1910
Copy link
Contributor Author

abhilash1910 commented Aug 8, 2024

Hi @Titus-von-Koeller
Thanks for the details. Yes it is infact just the public version of it - there is an internal testing suite which is present. PLan to add that to my personal repository - until we see how to add to sycl-refactor.
For the XPU logic addition, I believe some of these kernels would get eventually used up for 1257. I am aware of the major refactoring going on for the multi-factor backend and for XPU specifically we are currently prototyping for the full integration.
Some context on this work, currently we are examining kernel-kernel performance on primarily a100 / iGPU (datacenter) for public competitive benchmarking purposes. I believe that down the line this branch would evolve to be a sycl runtime only version of bitsandbytes for compiler level performance & analysis. In the meantime, would appreciate triggering the CI.
Appreciate the work you guys are doing , and I am taking a look into the XPU upsteaming logic as well internally for multi-backend refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Intel Integration medium priority (will be worked on after all high priority issues) Medium risk Risk of bugs in transformers and other libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants