Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup CUDA implementation a bit #199

Closed
wants to merge 6 commits into from

Conversation

gonzalobg
Copy link
Contributor

  • Refactor all kernels into a generic "parallel for" algorithm
    • Supports grid-stride and block-stride loops, configurable with model flag
    • Handles devices of different sizes via occupancy APIs
  • Refactor memory allocation APIs
  • Prints more GPU details, in particular, the theoretical peak BW in GB/s of the current device, using the NVML library (which is part of the CUDA Toolkit and always available)
  • Fixes 2 bugs:
    • Prints the "order" used to run the benchmarks (e.g. classic vs isolated)
    • Fixes a division by zero bug in the solution checking

@gonzalobg
Copy link
Contributor Author

This was passing. Seems like this and other PRs are spuriously failing due to some cache issue @tom91136 @tomdeakin

@gonzalobg
Copy link
Contributor Author

Closing for #202

@gonzalobg gonzalobg closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant