Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to pin to gpu for all estimators #3526

Merged
merged 6 commits into from
Apr 30, 2022
Merged

Conversation

Tixxx
Copy link
Collaborator

@Tixxx Tixxx commented Apr 25, 2022

Signed-off-by: TJ tix@uber.com

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Sometimes the estimator can be created on a GPU host which initializes model weights on one device. Pinning to GPU again in remote trainer can have strange behaviors such as "Variable resource not found" error for Tensorflow since tf thinks the model weights were initialized on a different device.

Fixes # (issue).
#3524

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: TJ <tix@uber.com>
@Tixxx Tixxx requested review from irasit and chongxiaoc April 25, 2022 04:41
@github-actions
Copy link

github-actions bot commented Apr 25, 2022

Unit Test Results

     821 files   -   28       821 suites   - 28   9h 35m 21s ⏱️ + 11m 28s
     768 tests ±    0       725 ✔️ ±    0       43 💤 ±    0  0 ±0 
18 950 runs   - 738  13 655 ✔️  - 468  5 295 💤  - 270  0 ±0 

Results for commit 13c2b04. ± Comparison against base commit 94cd856.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Apr 25, 2022

Unit Test Results (with flaky tests)

     905 files   -   24       905 suites   - 24   9h 54m 52s ⏱️ + 7m 34s
     768 tests ±    0       725 ✔️ ±    0       43 💤 ±    0  0 ±0 
21 164 runs   - 526  14 999 ✔️  - 292  6 165 💤  - 234  0 ±0 

Results for commit 13c2b04. ± Comparison against base commit 94cd856.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic param of Spark Estimator can be put in EstimatorParams , and also add Getter and Setter there.
https://github.com/horovod/horovod/blob/master/horovod/spark/common/params.py#L56

Signed-off-by: TJ <tix@uber.com>
@Tixxx Tixxx requested a review from chongxiaoc April 25, 2022 18:12
Copy link
Collaborator

@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chongxiaoc chongxiaoc requested a review from EnricoMi April 27, 2022 17:24
@chongxiaoc
Copy link
Collaborator

@EnricoMi Can you help do a sanity check as well?

@chongxiaoc
Copy link
Collaborator

@EnricoMi Hold this PR unmerge for now.
I just synced with @Tixxx and we need to wait for Ray Tune to test this PR for integration test.

@EnricoMi EnricoMi marked this pull request as draft April 27, 2022 18:59
@EnricoMi
Copy link
Collaborator

Hold this PR unmerge for now.

I have set this to draft, press "Ready for review" when you are happy to see this merged.

@Tixxx
Copy link
Collaborator Author

Tixxx commented Apr 27, 2022

The fix for the horovod incompatibility on ray tune's side was checked in just now. I'm running the integration test now and will merge it once that's done.

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor improvements possible.

horovod/spark/common/params.py Outdated Show resolved Hide resolved
horovod/spark/keras/remote.py Outdated Show resolved Hide resolved
horovod/spark/keras/remote.py Outdated Show resolved Hide resolved
horovod/spark/lightning/remote.py Outdated Show resolved Hide resolved
@@ -194,7 +195,10 @@ def on_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -
f"Val rows: {val_rows}, Val batch size: {val_batch_size}, Val_steps_per_epoch: {_val_steps_per_epoch}\n"
f"Checkpoint file: {remote_store.checkpoint_path}, Logs dir: {remote_store.logs_path}\n")

cuda_available = torch.cuda.is_available()
if not should_pin_gpu and verbose:
print("Skip pinning current process to the GPU.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the logger? Why is there verbose when there is a logger?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger doesn't write to stdout and stderr properly in this function since it's run in ray executor. The train_logger is for passing some specialized loggers(not writing to stdout and stderr directly) to pytorch lightning. I have tried using some generic logger modules here, they either failed to serialize or no output.

Signed-off-by: TJ <tix@uber.com>
horovod/spark/torch/remote.py Outdated Show resolved Hide resolved
horovod/spark/torch/remote.py Outdated Show resolved Hide resolved
horovod/spark/common/params.py Outdated Show resolved Hide resolved
rename pin_gpu to use_gpu

Signed-off-by: TJ <tix@uber.com>
@Tixxx
Copy link
Collaborator Author

Tixxx commented Apr 28, 2022

some builds are failing with

#6 4.879 W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
  #6 4.879 E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease' is not signed.

@EnricoMi Do you know if this is because nv's key expired or we need to refresh on our side?

@Tixxx Tixxx marked this pull request as ready for review April 28, 2022 18:13
@Tixxx
Copy link
Collaborator Author

Tixxx commented Apr 28, 2022

Integration test passed on my side. Changing this back to ready state.

@EnricoMi
Copy link
Collaborator

@EnricoMi Do you know if this is because nv's key expired or we need to refresh on our side?

@Tixxx This is something NVidia has to fix on their side.

@romerojosh @nvcastet NVidia's CUDA docker image cannot apt update because one of NVidia's repos' key is invalid:

docker run --rm -it nvidia/cuda:10.0-devel-ubuntu18.04 apt-get update
...
Err:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease                        
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC

Something you can escalate?

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please wait with merge for fixed CI.

Signed-off-by: TJ <tix@uber.com>
@Tixxx
Copy link
Collaborator Author

Tixxx commented Apr 29, 2022

Looks like nvidia rolled out a rotating key mechanism. I will try to fix it by downloading the corresponding keys.

@EnricoMi
Copy link
Collaborator

@Tixxx this is how @tgaddair fixed it for Ludwig: ludwig-ai/ludwig@1a4f679

@EnricoMi
Copy link
Collaborator

@Tixxx looks like your fix works!

@Tixxx
Copy link
Collaborator Author

Tixxx commented Apr 29, 2022

@Tixxx this is how @tgaddair fixed it for Ludwig: ludwig-ai/ludwig@1a4f679

yea, I tried that first, but we need to support 18.04 too which doesn't have wget installed by default. It needs to call apt-get update which will fail with invalid key. My fix is the alternative approach from nvidia which hopefully works for both 18.04 and 20.04.

Signed-off-by: TJ <tix@uber.com>
@Tixxx Tixxx merged commit 3926a01 into master Apr 30, 2022
@Tixxx Tixxx added this to the v0.25.0 milestone May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants