Add an option to pin to gpu for all estimators #3526

Tixxx · 2022-04-25T04:41:01Z

Signed-off-by: TJ tix@uber.com

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Sometimes the estimator can be created on a GPU host which initializes model weights on one device. Pinning to GPU again in remote trainer can have strange behaviors such as "Variable resource not found" error for Tensorflow since tf thinks the model weights were initialized on a different device.

Fixes # (issue).
#3524

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: TJ <tix@uber.com>

github-actions · 2022-04-25T07:44:42Z

Unit Test Results

    821 files -   28     821 suites - 28 9h 35m 21s ⏱️ + 11m 28s
    768 tests ±    0     725 ✔️ ±    0     43 💤 ±    0 0 ❌ ±0
18 950 runs - 738 13 655 ✔️ - 468 5 295 💤 - 270 0 ❌ ±0

Results for commit 13c2b04. ± Comparison against base commit 94cd856.

♻️ This comment has been updated with latest results.

github-actions · 2022-04-25T07:44:56Z

Unit Test Results (with flaky tests)

    905 files -   24     905 suites - 24 9h 54m 52s ⏱️ + 7m 34s
    768 tests ±    0     725 ✔️ ±    0     43 💤 ±    0 0 ❌ ±0
21 164 runs - 526 14 999 ✔️ - 292 6 165 💤 - 234 0 ❌ ±0

Results for commit 13c2b04. ± Comparison against base commit 94cd856.

♻️ This comment has been updated with latest results.

chongxiaoc

Generic param of Spark Estimator can be put in EstimatorParams , and also add Getter and Setter there.
https://github.com/horovod/horovod/blob/master/horovod/spark/common/params.py#L56

Signed-off-by: TJ <tix@uber.com>

chongxiaoc

LGTM

chongxiaoc · 2022-04-27T17:25:13Z

@EnricoMi Can you help do a sanity check as well?

chongxiaoc · 2022-04-27T18:07:07Z

@EnricoMi Hold this PR unmerge for now.
I just synced with @Tixxx and we need to wait for Ray Tune to test this PR for integration test.

EnricoMi · 2022-04-27T19:00:19Z

Hold this PR unmerge for now.

I have set this to draft, press "Ready for review" when you are happy to see this merged.

Tixxx · 2022-04-27T19:01:21Z

The fix for the horovod incompatibility on ray tune's side was checked in just now. I'm running the integration test now and will merge it once that's done.

EnricoMi

Some minor improvements possible.

horovod/spark/common/params.py

horovod/spark/keras/remote.py

horovod/spark/lightning/remote.py

EnricoMi · 2022-04-27T19:06:49Z

horovod/spark/lightning/remote.py

@@ -194,7 +195,10 @@ def on_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -
                      f"Val rows: {val_rows}, Val batch size: {val_batch_size}, Val_steps_per_epoch: {_val_steps_per_epoch}\n"
                      f"Checkpoint file: {remote_store.checkpoint_path}, Logs dir: {remote_store.logs_path}\n")

-            cuda_available = torch.cuda.is_available()
+            if not should_pin_gpu and verbose:
+                print("Skip pinning current process to the GPU.")


Why not using the logger? Why is there verbose when there is a logger?

The logger doesn't write to stdout and stderr properly in this function since it's run in ray executor. The train_logger is for passing some specialized loggers(not writing to stdout and stderr directly) to pytorch lightning. I have tried using some generic logger modules here, they either failed to serialize or no output.

Signed-off-by: TJ <tix@uber.com>

horovod/spark/torch/remote.py

horovod/spark/common/params.py

rename pin_gpu to use_gpu Signed-off-by: TJ <tix@uber.com>

Tixxx · 2022-04-28T18:10:19Z

some builds are failing with

#6 4.879 W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
  #6 4.879 E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease' is not signed.

@EnricoMi Do you know if this is because nv's key expired or we need to refresh on our side?

Tixxx · 2022-04-28T18:13:48Z

Integration test passed on my side. Changing this back to ready state.

EnricoMi · 2022-04-28T20:29:20Z

@EnricoMi Do you know if this is because nv's key expired or we need to refresh on our side?

@Tixxx This is something NVidia has to fix on their side.

@romerojosh @nvcastet NVidia's CUDA docker image cannot apt update because one of NVidia's repos' key is invalid:

docker run --rm -it nvidia/cuda:10.0-devel-ubuntu18.04 apt-get update
...
Err:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease                        
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC

Something you can escalate?

EnricoMi

LGTM! Please wait with merge for fixed CI.

Signed-off-by: TJ <tix@uber.com>

Tixxx · 2022-04-29T20:52:27Z

Looks like nvidia rolled out a rotating key mechanism. I will try to fix it by downloading the corresponding keys.

EnricoMi · 2022-04-29T21:10:48Z

@Tixxx this is how @tgaddair fixed it for Ludwig: ludwig-ai/ludwig@1a4f679

EnricoMi · 2022-04-29T21:14:22Z

@Tixxx looks like your fix works!

Tixxx · 2022-04-29T21:15:12Z

@Tixxx this is how @tgaddair fixed it for Ludwig: ludwig-ai/ludwig@1a4f679

yea, I tried that first, but we need to support 18.04 too which doesn't have wget installed by default. It needs to call apt-get update which will fail with invalid key. My fix is the alternative approach from nvidia which hopefully works for both 18.04 and 20.04.

Signed-off-by: TJ <tix@uber.com>

Add an option to pin to gpu for all estimators

37229d4

Signed-off-by: TJ <tix@uber.com>

Tixxx requested review from irasit and chongxiaoc April 25, 2022 04:41

chongxiaoc requested changes Apr 25, 2022

View reviewed changes

addressed comments

fd0969e

Signed-off-by: TJ <tix@uber.com>

Tixxx requested a review from chongxiaoc April 25, 2022 18:12

chongxiaoc approved these changes Apr 25, 2022

View reviewed changes

chongxiaoc requested a review from EnricoMi April 27, 2022 17:24

EnricoMi marked this pull request as draft April 27, 2022 18:59

EnricoMi reviewed Apr 27, 2022

View reviewed changes

addressed pr comments

92c9680

Signed-off-by: TJ <tix@uber.com>

EnricoMi reviewed Apr 28, 2022

View reviewed changes

horovod/spark/torch/remote.py Outdated Show resolved Hide resolved

horovod/spark/torch/remote.py Outdated Show resolved Hide resolved

horovod/spark/common/params.py Outdated Show resolved Hide resolved

addressed pr comments.

377437a

rename pin_gpu to use_gpu Signed-off-by: TJ <tix@uber.com>

Tixxx marked this pull request as ready for review April 28, 2022 18:13

EnricoMi approved these changes Apr 28, 2022

View reviewed changes

Attempt to fix CI

8cefd03

Signed-off-by: TJ <tix@uber.com>

tgaddair approved these changes Apr 29, 2022

View reviewed changes

fix nv key error in other dockerfiles

13c2b04

Signed-off-by: TJ <tix@uber.com>

Tixxx merged commit 3926a01 into master Apr 30, 2022

Tixxx added this to the v0.25.0 milestone May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to pin to gpu for all estimators #3526

Add an option to pin to gpu for all estimators #3526

Tixxx commented Apr 25, 2022

github-actions bot commented Apr 25, 2022 •

edited

Loading

github-actions bot commented Apr 25, 2022 •

edited

Loading

chongxiaoc left a comment

chongxiaoc left a comment

chongxiaoc commented Apr 27, 2022

chongxiaoc commented Apr 27, 2022

EnricoMi commented Apr 27, 2022

Tixxx commented Apr 27, 2022

EnricoMi left a comment

EnricoMi Apr 27, 2022

Tixxx Apr 27, 2022

Tixxx commented Apr 28, 2022

Tixxx commented Apr 28, 2022

EnricoMi commented Apr 28, 2022

EnricoMi left a comment

Tixxx commented Apr 29, 2022

EnricoMi commented Apr 29, 2022

EnricoMi commented Apr 29, 2022

Tixxx commented Apr 29, 2022 •

edited

Loading

Add an option to pin to gpu for all estimators #3526

Add an option to pin to gpu for all estimators #3526

Conversation

Tixxx commented Apr 25, 2022

Checklist before submitting

Description

Review process to land

github-actions bot commented Apr 25, 2022 • edited Loading

Unit Test Results

github-actions bot commented Apr 25, 2022 • edited Loading

Unit Test Results (with flaky tests)

chongxiaoc left a comment

Choose a reason for hiding this comment

chongxiaoc left a comment

Choose a reason for hiding this comment

chongxiaoc commented Apr 27, 2022

chongxiaoc commented Apr 27, 2022

EnricoMi commented Apr 27, 2022

Tixxx commented Apr 27, 2022

EnricoMi left a comment

Choose a reason for hiding this comment

EnricoMi Apr 27, 2022

Choose a reason for hiding this comment

Tixxx Apr 27, 2022

Choose a reason for hiding this comment

Tixxx commented Apr 28, 2022

Tixxx commented Apr 28, 2022

EnricoMi commented Apr 28, 2022

EnricoMi left a comment

Choose a reason for hiding this comment

Tixxx commented Apr 29, 2022

EnricoMi commented Apr 29, 2022

EnricoMi commented Apr 29, 2022

Tixxx commented Apr 29, 2022 • edited Loading

github-actions bot commented Apr 25, 2022 •

edited

Loading

github-actions bot commented Apr 25, 2022 •

edited

Loading

Tixxx commented Apr 29, 2022 •

edited

Loading