[Ray 2.3 release] `air_benchmark_xgboost_cpu_10` fails #32068

cadedaniel · 2023-01-30T21:06:43Z

https://buildkite.com/ray-project/release-tests-branch/builds/1314#0185f566-9ead-46d9-a99b-364205ea7520

First failure: RuntimeError: Batch prediction on XGBoost is taking 675.7125317330001 seconds, which is longer than expected (450 seconds).
Second failure: RuntimeError: Batch prediction on XGBoost is taking 675.956159504 seconds, which is longer than expected (450 seconds).

Both logs had spilling, which might indicate an issue.

The text was updated successfully, but these errors were encountered:

amogkam · 2023-01-30T21:30:48Z

Some spilling is expected for this test...looks like the problem is that Datasets ActorPool is not scaling up to more than 1 actor.

clarkzinzow · 2023-01-30T22:06:31Z

Without max_tasks_in_flight capping (which was broken until this fix was merged), as soon as the first actor becomes ready, we’ll submit all queued tasks to that actor.

Here’s what might be happening:

We add all inputs to the operator (these queue up in op._bundle_queue) and call op.inputs_done(), setting op._inputs_done = True flag (code).
When the first actor finishes starting, without max_tasks_in_flight capping, we submit every input bundle in the queue to that actor (code); op._bundle_queue is now empty, so we try to trigger scale-down (code).
With op._inputs_done set and the bundle queue empty, when the last input bundle is submitted to that actor and scale-down is triggered, we’ll actually end up killing all inactive actors (code1, code2), including the other 9 pending actors (code).
Finally, the first map task completes, and this is when the actor pool progress string first pops up (code).

We should rerun this release test on latest master to confirm that the linked PR fixes this bug.

amogkam · 2023-01-31T00:15:24Z

Thanks @clarkzinzow! Confirmed this is passing on master: https://buildkite.com/ray-project/release-tests-branch/builds/1322#018604d8-3d6e-4bd8-9991-04e201d1a1cf

cadedaniel added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Jan 30, 2023

cadedaniel assigned krfricke Jan 30, 2023

amogkam assigned clarkzinzow Jan 30, 2023

zhe-thoughts linked a pull request Jan 30, 2023 that will close this issue

[data] [streaming] Fixes to autoscaling actor pool streaming op #32023

Merged

1 task

amogkam closed this as completed Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray 2.3 release] `air_benchmark_xgboost_cpu_10` fails #32068

[Ray 2.3 release] `air_benchmark_xgboost_cpu_10` fails #32068

cadedaniel commented Jan 30, 2023

amogkam commented Jan 30, 2023

clarkzinzow commented Jan 30, 2023

amogkam commented Jan 31, 2023 •

edited

Loading

[Ray 2.3 release] air_benchmark_xgboost_cpu_10 fails #32068

[Ray 2.3 release] air_benchmark_xgboost_cpu_10 fails #32068

Comments

cadedaniel commented Jan 30, 2023

amogkam commented Jan 30, 2023

clarkzinzow commented Jan 30, 2023

amogkam commented Jan 31, 2023 • edited Loading

[Ray 2.3 release] `air_benchmark_xgboost_cpu_10` fails #32068

[Ray 2.3 release] `air_benchmark_xgboost_cpu_10` fails #32068

amogkam commented Jan 31, 2023 •

edited

Loading