Skip to content

Commit

Permalink
[AIR] Bump start timeout for Horovod Release Tests (#33083)
Browse files Browse the repository at this point in the history
Horovod on Ray release tests have been timing out during the initial rendezvous. This PR bumps the timeout configuration from 30 seconds to 120 seconds.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
  • Loading branch information
amogkam authored Mar 7, 2023
1 parent c0f6068 commit e0fda4a
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 1 deletion.
1 change: 1 addition & 0 deletions release/ml_user_tests/horovod/horovod_user_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
num_workers=6,
use_gpu=True,
placement_group_timeout_s=2000,
timeout_s=120,
kwargs={"num_epochs": 20},
)

Expand Down
2 changes: 1 addition & 1 deletion release/nightly_tests/dataset/pipelined_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ def consume(split, rank=None, batch_size=None):
ray.get(tasks)
else:
print("Create Ray executor")
settings = RayExecutor.create_settings(timeout_s=30)
settings = RayExecutor.create_settings(timeout_s=120)
executor = RayExecutor(settings, num_workers=args.num_workers, use_gpu=True)
executor.start()
executor.run(train_main, args=[args, splits])
Expand Down

0 comments on commit e0fda4a

Please sign in to comment.