You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First failure: RuntimeError: Batch prediction on XGBoost is taking 675.7125317330001 seconds, which is longer than expected (450 seconds).
Second failure: RuntimeError: Batch prediction on XGBoost is taking 675.956159504 seconds, which is longer than expected (450 seconds).
Both logs had spilling, which might indicate an issue.
The text was updated successfully, but these errors were encountered:
Without max_tasks_in_flight capping (which was broken until this fix was merged), as soon as the first actor becomes ready, we’ll submit all queued tasks to that actor.
Here’s what might be happening:
We add all inputs to the operator (these queue up in op._bundle_queue) and call op.inputs_done(), setting op._inputs_done = True flag (code).
When the first actor finishes starting, without max_tasks_in_flight capping, we submit every input bundle in the queue to that actor (code); op._bundle_queue is now empty, so we try to trigger scale-down (code).
With op._inputs_done set and the bundle queue empty, when the last input bundle is submitted to that actor and scale-down is triggered, we’ll actually end up killing all inactive actors (code1, code2), including the other 9 pending actors (code).
Finally, the first map task completes, and this is when the actor pool progress string first pops up (code).
We should rerun this release test on latest master to confirm that the linked PR fixes this bug.
https://buildkite.com/ray-project/release-tests-branch/builds/1314#0185f566-9ead-46d9-a99b-364205ea7520
First failure:
RuntimeError: Batch prediction on XGBoost is taking 675.7125317330001 seconds, which is longer than expected (450 seconds).
Second failure:
RuntimeError: Batch prediction on XGBoost is taking 675.956159504 seconds, which is longer than expected (450 seconds).
Both logs had spilling, which might indicate an issue.
The text was updated successfully, but these errors were encountered: