Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray 2.3 release] air_benchmark_xgboost_cpu_10 fails #32068

Closed
cadedaniel opened this issue Jan 30, 2023 · 3 comments · Fixed by #32023
Closed

[Ray 2.3 release] air_benchmark_xgboost_cpu_10 fails #32068

cadedaniel opened this issue Jan 30, 2023 · 3 comments · Fixed by #32023
Assignees
Labels
P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@cadedaniel
Copy link
Member

https://buildkite.com/ray-project/release-tests-branch/builds/1314#0185f566-9ead-46d9-a99b-364205ea7520

First failure: RuntimeError: Batch prediction on XGBoost is taking 675.7125317330001 seconds, which is longer than expected (450 seconds).
Second failure: RuntimeError: Batch prediction on XGBoost is taking 675.956159504 seconds, which is longer than expected (450 seconds).

Both logs had spilling, which might indicate an issue.

@cadedaniel cadedaniel added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Jan 30, 2023
@amogkam
Copy link
Contributor

amogkam commented Jan 30, 2023

Some spilling is expected for this test...looks like the problem is that Datasets ActorPool is not scaling up to more than 1 actor.

@clarkzinzow
Copy link
Contributor

Without max_tasks_in_flight capping (which was broken until this fix was merged), as soon as the first actor becomes ready, we’ll submit all queued tasks to that actor.

Here’s what might be happening:

  1. We add all inputs to the operator (these queue up in op._bundle_queue) and call op.inputs_done(), setting op._inputs_done = True flag (code).
  2. When the first actor finishes starting, without max_tasks_in_flight capping, we submit every input bundle in the queue to that actor (code); op._bundle_queue is now empty, so we try to trigger scale-down (code).
  3. With op._inputs_done set and the bundle queue empty, when the last input bundle is submitted to that actor and scale-down is triggered, we’ll actually end up killing all inactive actors (code1, code2), including the other 9 pending actors (code).
  4. Finally, the first map task completes, and this is when the actor pool progress string first pops up (code).

We should rerun this release test on latest master to confirm that the linked PR fixes this bug.

@zhe-thoughts zhe-thoughts linked a pull request Jan 30, 2023 that will close this issue
1 task
@amogkam
Copy link
Contributor

amogkam commented Jan 31, 2023

Thanks @clarkzinzow! Confirmed this is passing on master: https://buildkite.com/ray-project/release-tests-branch/builds/1322#018604d8-3d6e-4bd8-9991-04e201d1a1cf

@amogkam amogkam closed this as completed Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants