Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Cross-Platform Refactor: CPU-only implementation #1021

Open
3 tasks
rickardp opened this issue Feb 3, 2024 · 2 comments
Open
3 tasks

[RFC] Cross-Platform Refactor: CPU-only implementation #1021

rickardp opened this issue Feb 3, 2024 · 2 comments

Comments

@rickardp
Copy link
Contributor

rickardp commented Feb 3, 2024

Motivation

As we want to have this library portable, the first step would be to make 100% of this library run correctly on only CPU (i.e. not requiring CUDA for any part of the functionality). This would serve two purposes:

  • Provide a baseline that contributors of ports can reference
  • Provide a fallback for partially implemented hardware platforms

Proposed solution

  • Implement all the CUDA kernels in "normal" C++
  • Make sure the unit tests all run on the CPU as well
  • Make sure unit test coverage is satisfactory

Open questions

  • Which CPU architectures do we support (x86_64 and arm64 are givens, but any more)?
  • How do we deal with SIMD intrinsics? Build separate libraries for each SIMD architecture? Or run-time selection based on CPU features?

@Titus-von-Koeller Feel free to edit this issue as you see fit, if you want a different structure for it for example.tbd

tbd

@simepy
Copy link

simepy commented Sep 6, 2024

@rickardp Where are we on this feature ? It is some part already working, or another threads talking about this feature ?, not much comment here.

I'm especially interested about arm64 CPU only

@rickardp
Copy link
Contributor Author

@rickardp Where are we on this feature ? It is some part already working, or another threads talking about this feature ?, not much comment here.

Hi @simepy, sorry not much to add here still. I am still up for contributing towards this when 1) I have time to do so and 2) the dependencies that I do not have time to contribute are ready to use. More specifically the idea is to take a gradual approach and use the reference implementation where MPS acceleration is not yet implemented. Currently, large parts of this codebase require CUDA, which does not run on Apple silicon, making a partial implementation virtually unusable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants