-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU CI #138
Comments
This sounds good to me 👍 Do we also want to include this functionality in |
I think we will want this for both. There are serialization functions, UCX-Py (as you mentioned), etc. in Distributed that would be good to test |
Also, thanks for raising this issue and coordinating with the rest of the RAPIDS team. Let me know if there's anything I can do to help out with this effort |
It's be good if we could put something like this in place for dask-image, too. Ad hoc local testing isn't going very well. |
Seems like you guys have a solution already, FWIW I'll share my two cents. I created a service for problems like these, which is basically running custom machines (including GPUs) in GitHub Actions: https://cirun.io/ We are about use it in the sgkit project here: https://github.com/pystatgen/sgkit/pull/567/checks?check_run_id=2618833216 It is fairly simple to setup, all you need is a cloud account (AWS or GCP) and a simple yaml file describing what kind of machines you need and Cirun will spin up ephemeral machines on your cloud for GitHub Actions to run. It's native to GitHub ecosystem, which mean you can see logs/trigger in the Github's interface itself, just like any Github Action run. Notes
|
Bumping this as we had a similar issue happen resulting in rapidsai/dask-cuda#634; do folks have a preference between gpuCI and Cirun? |
I think it would be good test experiment with CIRUN. gpuCI is going to require us to get some time from NVIDIA ops folks which we don't have quite yet |
Great! @aktech would you be willing to help set this up on Distributed? We should probably sync offline to discuss account / cloud provider set up |
@charlesbluca Sure we can catch up offline for account. Meanwhile I can get it working with a personal AWS account. |
Here is a run of the Dask Distributed CI (for Python 3.7) on GPU via Cirun.io: Here is the branch |
Should I go ahead and create a PR for the full matrix on distributed? |
I would suggest we keep with one version of python (maybe 3.8) and latest available of everything else (cupy/cuda/etc) |
After some internal conversation with NVIDIA ops folks, we can confirm that gpuCI does have the capacity to run tests for Dask and Distributed - the tests would be triggered both on commits to the main branch, as well as on PRs opened by a set of approved users (this would probably start out as the members of the Dask org, and could be expanded later on). Currently, we are working on getting this set up on my own forks of Dask/Distributed; here are PRs adding the relevant gpuCI scripts:
Once testing is working on these forks, we can manage the required permissions/webhooks and merged these branches into upstream Dask/Distributed. With this option available, we probably don't need to use Cirun for the time being, though it still seems like a good option if we intend to expand GPU testing far beyond the current repos in question. |
@charlesbluca That's some excellent news! Looking forward to it. I'll close the Cirun PR, feel free to let me know if I can help with anything or need setting up Cirun for any other project. |
Thanks @charlesbluca @aktech for your continued work on this. @charlesbluca FWIW I don't have a strong opinion on gpuCI vs. Cirun. I suspect you and folks around you have the most context/expertise to make that decision
Is it possible to also trigger CI runs on PRs from non-Dask org members through some other mechanism like having |
Yes, we will have a mechanisms to test non-dask org member PRs We'll see the comment "can one of the admin please verify the patch" and an admin then admin would respond with In order to enable gpuCI, please do the following to each repo:
I'm planning on making these changes later this afternoon unless there are objections |
Thanks for the update @quasiben -- that sounds good to me |
Note we are using a docker image with pre-installed dependencies here: |
The gpuCI seems to have been running pretty well on the main dask repository recently - perhaps it's a good time to open a conversation about whether we could do the same on the dask-image repository? cc @jakirkham & @quasiben |
Thanks for the ping @GenevieveBuckley -- we can bring this up with OPs folks. And see what their capacity and try and find out about dask/distributed usage. Can you characterize the activity on dask-image ? My impression is that PRs are in the weekly to monthly range |
Thanks Ben!
This impression is relatively accurate, it's pretty low traffic. Here's the code frequency graph: https://github.com/dask/dask-image/graphs/code-frequency |
Bumping this issue to say that we've also been considering setting up gpuCI for dask-ml, as there has been recent work getting cuML integrated there (dask/dask-ml#862); this was discussed earlier in the Dask monthly meeting. cc @TomAugspurger hoping that we can continue that conversation here? |
SGTM. Just let me know if there's anything I need to enable on the project settings side of things. |
Sure! In general, #138 (comment) breaks down the admin tasks that need to happen on the repo, but that can until we:
Could you give an idea of the PR frequency on the repo? Trying to gauge if we would prefer having gpuCI run on all PRs, or just those where a trigger phrase is commented (i.e. |
Should we leave this open or close it now that we have GPU CI up and running? |
I think this should be good to close - at this point. we've generally handled maintenance in separate issues, and requests to add gpuCI to additional repos can be opened in follow up issues here or in the relevant repos. |
We've been chatting with folks from the ops teams within RAPIDS about getting access to the
gpuCI
infrastructure.gpuCI
is the GPU based CI platform used for testing throughout the RAPIDS ecosystem. We've been asking for access for a couple reasons:Distributed
only and the testing occurs in an out-of-bound manner. That is, we test GPU and UCX bits of Distributed in ucx-py and dask-cuda. This is better than no testing, however, it's only limited to distributed and only when developers push changes to dask-cuda/ucx-pyGaining access to
gpuCI
resolves both of these problems and will allow us to test incoming PRs to Dask ensuring GPU support is maintained without breakages and undue burdens.While we are talking with OPs folks we've suggested that the testing matrix is a single row:
This service will start off as something maintainers can ping if they think a PR might need GPU testing. This might include changes to array/dataframe functions or new functionality. While this is not the ideal solution, it is a step towards getting better GPU test for Dask without much effort on the part of the maintainers
For this to work a bot
gputester
fromgpuCI
will need to have at least “triage” rights to monitor comments and respond with pass/fail notifications to the PR in question.cc @pentschev @jrbourbeau
The text was updated successfully, but these errors were encountered: