-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray 2.3 Release] Common failure in a few release tests failed to connect to all addresses; last error: UNKNOWN
#32213
Comments
looks like something related to jobs... hmm |
Thanks for catching @cadedaniel . Starting from when are we seeing the errors? |
Will begin bisecting tomorrow! |
Thanks @cadedaniel ! Assigning to you for now |
@cadedaniel I think it's probably #31046 because it modifies @iycheng The error message says |
Same error as #32367 in all three tests:
So it's the same issue. Will close that issue so we can consolidate the tracking here |
The only way I can see how the check can fail is if there are two jobs in the job table with the same |
Local reproduction:
Each job in the job table corresponds to a driver, so it's possible that more than one of them could have come from the same Ray Job API job submission, as in this example. |
Indeed, (For |
Quick question: should we consider reverting #31046? |
…GetAllJobInfo endpoint (#32388) The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes #32213
Great it's merged into master! Keeping it open until the cherry pick |
…GetAllJobInfo endpoint (ray-project#32388) The changes to the GetAllJobInfo endpoint in ray-project#31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes ray-project#32213
…GetAllJobInfo endpoint (#32388) (#32426) The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes #32213 Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…GetAllJobInfo endpoint (ray-project#32388) The changes to the GetAllJobInfo endpoint in ray-project#31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes ray-project#32213 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
The following stacktrace appears consistently in the following tests:
BuildKite: https://buildkite.com/ray-project/release-tests-branch/builds/1351#018618c9-8c0d-4102-9420-c5cd8eb29b3d
EDIT(archit): here's the last line pretty-printed by chatgpt for readability
The text was updated successfully, but these errors were encountered: