-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to pin to gpu for all estimators #3526
Conversation
Signed-off-by: TJ <tix@uber.com>
Unit Test Results (with flaky tests) 905 files - 24 905 suites - 24 9h 54m 52s ⏱️ + 7m 34s Results for commit 13c2b04. ± Comparison against base commit 94cd856. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic param of Spark Estimator can be put in EstimatorParams
, and also add Getter and Setter there.
https://github.com/horovod/horovod/blob/master/horovod/spark/common/params.py#L56
Signed-off-by: TJ <tix@uber.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@EnricoMi Can you help do a sanity check as well? |
I have set this to draft, press "Ready for review" when you are happy to see this merged. |
The fix for the horovod incompatibility on ray tune's side was checked in just now. I'm running the integration test now and will merge it once that's done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor improvements possible.
@@ -194,7 +195,10 @@ def on_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") - | |||
f"Val rows: {val_rows}, Val batch size: {val_batch_size}, Val_steps_per_epoch: {_val_steps_per_epoch}\n" | |||
f"Checkpoint file: {remote_store.checkpoint_path}, Logs dir: {remote_store.logs_path}\n") | |||
|
|||
cuda_available = torch.cuda.is_available() | |||
if not should_pin_gpu and verbose: | |||
print("Skip pinning current process to the GPU.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using the logger? Why is there verbose
when there is a logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logger doesn't write to stdout and stderr properly in this function since it's run in ray executor. The train_logger is for passing some specialized loggers(not writing to stdout and stderr directly) to pytorch lightning. I have tried using some generic logger modules here, they either failed to serialize or no output.
Signed-off-by: TJ <tix@uber.com>
rename pin_gpu to use_gpu Signed-off-by: TJ <tix@uber.com>
some builds are failing with
@EnricoMi Do you know if this is because nv's key expired or we need to refresh on our side? |
Integration test passed on my side. Changing this back to ready state. |
@Tixxx This is something NVidia has to fix on their side. @romerojosh @nvcastet NVidia's CUDA docker image cannot
Something you can escalate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please wait with merge for fixed CI.
Signed-off-by: TJ <tix@uber.com>
Looks like nvidia rolled out a rotating key mechanism. I will try to fix it by downloading the corresponding keys. |
@Tixxx this is how @tgaddair fixed it for Ludwig: ludwig-ai/ludwig@1a4f679 |
@Tixxx looks like your fix works! |
yea, I tried that first, but we need to support 18.04 too which doesn't have wget installed by default. It needs to call apt-get update which will fail with invalid key. My fix is the alternative approach from nvidia which hopefully works for both 18.04 and 20.04. |
Signed-off-by: TJ <tix@uber.com>
Signed-off-by: TJ tix@uber.com
Checklist before submitting
Description
Sometimes the estimator can be created on a GPU host which initializes model weights on one device. Pinning to GPU again in remote trainer can have strange behaviors such as "Variable resource not found" error for Tensorflow since tf thinks the model weights were initialized on a different device.
Fixes # (issue).
#3524
Review process to land