Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray 2.3 Release] Common failure in a few release tests failed to connect to all addresses; last error: UNKNOWN #32213

Closed
cadedaniel opened this issue Feb 3, 2023 · 13 comments · Fixed by #32388
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@cadedaniel
Copy link
Member

cadedaniel commented Feb 3, 2023

The following stacktrace appears consistently in the following tests:

  • air_benchmark_xgboost_cpu_10
  • lightgbm_distributed_api_test
  • xgboost_distributed_api_test

BuildKite: https://buildkite.com/ray-project/release-tests-branch/builds/1351#018618c9-8c0d-4102-9420-c5cd8eb29b3d

Traceback (most recent call last):
--
  | File "ray_release/scripts/run_release_test.py", line 168, in <module>
  | main()
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
  | return self.main(*args, **kwargs)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1055, in main
  | rv = self.invoke(ctx)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
  | return ctx.invoke(self.callback, **ctx.params)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 760, in invoke
  | return __callback(*args, **kwargs)
  | File "ray_release/scripts/run_release_test.py", line 153, in main
  | no_terminate=no_terminate,
  | File "/tmp/release-B1APUcfhbQ/release/ray_release/glue.py", line 404, in run_release_test
  | raise pipeline_exception
  | File "/tmp/release-B1APUcfhbQ/release/ray_release/glue.py", line 311, in run_release_test
  | command, env=command_env, timeout=command_timeout
  | File "/tmp/release-B1APUcfhbQ/release/ray_release/command_runner/job_runner.py", line 115, in run_command
  | full_command, full_env, working_dir=".", timeout=int(timeout)
  | File "/tmp/release-B1APUcfhbQ/release/ray_release/job_manager.py", line 113, in run_and_wait
  | return self._wait_job(cid, timeout)
  | File "/tmp/release-B1APUcfhbQ/release/ray_release/job_manager.py", line 92, in _wait_job
  | status = self._get_job_status_with_retry(command_id)
  | File "/tmp/release-B1APUcfhbQ/release/ray_release/job_manager.py", line 69, in _get_job_status_with_retry
  | max_retries=3,
  | File "/tmp/release-B1APUcfhbQ/release/ray_release/util.py", line 119, in exponential_backoff_retry
  | return f()
  | File "/tmp/release-B1APUcfhbQ/release/ray_release/job_manager.py", line 66, in <lambda>
  | lambda: job_client.get_job_status(self.job_id_pool[command_id]),
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/sdk.py", line 396, in get_job_status
  | return self.get_job_info(job_id).status
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/sdk.py", line 331, in get_job_info
  | self._raise_error(r)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 262, in _raise_error
  | f"Request failed with status code {r.status_code}: {r.text}."
  | RuntimeError: Request failed with status code 500: {"result": false, "msg": "Traceback (most recent call last):\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/optional_utils.py\", line 95, in _handler_route\n    return await handler(bind_info.instance, req)\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/job_head.py\", line 391, in get_job_info\n    job_or_submission_id,\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/utils.py\", line 209, in find_job_by_ids\n    driver_jobs, submission_job_drivers = await get_driver_jobs(gcs_aio_client)\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/utils.py\", line 154, in get_driver_jobs\n    reply = await gcs_aio_client.get_all_job_info()\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py\", line 164, in wrapper\n    return await f(self, *args, **kwargs)\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py\", line 556, in get_all_job_info\n    reply = await self._job_info_stub.GetAllJobInfo(req, timeout=timeout)\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py\", line 291, in __await__\n    self._cython_call._status)\ngrpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"failed to connect to all addresses; last error: UNKNOWN: ipv4:172.31.254.44:9031: Failed to connect to remote host: Connection refused\"\n\tdebug_error_string = \"UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:172.31.254.44:9031: Failed to connect to remote host: Connection refused {created_time:\"2023-02-03T12:04:49.142935439-08:00\", grpc_status:14}\"\n>\n", "data": {}}.

EDIT(archit): here's the last line pretty-printed by chatgpt for readability

{
    "result": false,
    "msg": "Traceback (most recent call last):
  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/optional_utils.py\", line 95, in _handler_route
    return await handler(bind_info.instance, req)
  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/job_head.py\", line 391, in get_job_info
    job_or_submission_id,
  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/utils.py\", line 209, in find_job_by_ids
    driver_jobs, submission_job_drivers = await get_driver_jobs(gcs_aio_client)
  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/utils.py\", line 154, in get_driver_jobs
    reply = await gcs_aio_client.get_all_job_info()
  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py\", line 164, in wrapper
    return await f(self, *args, **kwargs)
  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py\", line 556, in get_all_job_info
    reply = await self._job_info_stub.GetAllJobInfo(req, timeout=timeout)
  File \"/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py\", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = \"failed to connect to all addresses; last error: UNKNOWN: ipv4:172.31.254.44:9031: Failed to connect to remote host: Connection refused\"
    debug_error_string = \"UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:172.31.254.44:9031: Failed to connect to remote host: Connection refused {created_time:\"2023-02-03T12:04:49.142935439-08:00\", grpc_status:14}\"
>
",
    "data": {}
}
@cadedaniel cadedaniel added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Feb 3, 2023
@scv119
Copy link
Contributor

scv119 commented Feb 3, 2023

looks like something related to jobs... hmm

@cadedaniel
Copy link
Member Author

@zhe-thoughts
Copy link
Collaborator

Thanks for catching @cadedaniel . Starting from when are we seeing the errors?

@cadedaniel
Copy link
Member Author

Will begin bisecting tomorrow!

@zhe-thoughts
Copy link
Collaborator

Thanks @cadedaniel ! Assigning to you for now

@architkulkarni
Copy link
Contributor

architkulkarni commented Feb 6, 2023

@cadedaniel I think it's probably #31046 because it modifies GetAllJobs, if that's the case feel free to unassign yourself

@iycheng The error message says StatusCode.UNAVAILABLE for reply = await self._job_info_stub.GetAllJobInfo(req, timeout=timeout). In the PR, the change was to call InternalKV inside GetAllJobInfo (previously it wasn't called). Any ideas what might be causing UNAVAILABLE? I know we call InternalKV in many other places but don't see this kind of error.

@architkulkarni
Copy link
Contributor

Same error as #32367 in all three tests:

[2023-02-03 12:04:40,254 C 172 172] (gcs_server) gcs_job_manager.cc:191:  Check failed: job_api_data_keys.size() == job_data_key_to_index.size() 
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99aa6a) [0x56494a9c3a6a] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99c542) [0x56494a9c5542] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99c857) [0x56494a9c5857] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x316058) [0x56494a33f058] ray::gcs::GcsJobManager::HandleGetAllJobInfo()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x2a8168) [0x56494a2d1168] ray::gcs::GcsTable<>::GetAll()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x343fa0) [0x56494a36cfa0] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x56cad6) [0x56494a595ad6] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x51ae9e) [0x56494a543e9e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x51b3f6) [0x56494a5443f6] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8cbeb) [0x56494aab5beb] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8de81) [0x56494aab6e81] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8e0f0) [0x56494aab70f0] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x16782e) [0x56494a19082e] main
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f683fe30083] __libc_start_main
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x1ade47) [0x56494a1d6e47]

So it's the same issue. Will close that issue so we can consolidate the tracking here

@zhe-thoughts zhe-thoughts added the bug Something that is supposed to be working; but isn't label Feb 9, 2023
@architkulkarni
Copy link
Contributor

The only way I can see how the check can fail is if there are two jobs in the job table with the same submission_id. Not sure how that's possible and I haven't been able to reproduce it locally. Rerunning release tests with more debug logs here, should shed more light on the issue:
https://buildkite.com/ray-project/release-tests-pr/builds/27963#_

@architkulkarni
Copy link
Contributor

Local reproduction:

ray start --head
ray job submit -- python -c "import ray; ray.init(); ray.shutdown(); ray.init(); ray.shutdown();"

Each job in the job table corresponds to a driver, so it's possible that more than one of them could have come from the same Ray Job API job submission, as in this example.

@architkulkarni
Copy link
Contributor

architkulkarni commented Feb 10, 2023

Indeed, air_benchmark_xgboost_cpu_10, lightgbm_distributed_api_test and xgboost_distributed_api_test contain ray.shutdown() and multiple ray.init() calls.

(For air_benchmark I only see ray.shutdown() but I assume there is more than one ray.init() being called internally in Ray AIR library functions.)

@zhe-thoughts
Copy link
Collaborator

Quick question: should we consider reverting #31046?

@cadedaniel
Copy link
Member Author

Quick question: should we consider reverting #31046?

We're planning on fixing forward with #32388.

scv119 pushed a commit that referenced this issue Feb 10, 2023
…GetAllJobInfo endpoint (#32388)

The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.

This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.

Related issue number
Closes #32213
@zhe-thoughts zhe-thoughts reopened this Feb 10, 2023
@zhe-thoughts
Copy link
Collaborator

Great it's merged into master! Keeping it open until the cherry pick

cadedaniel pushed a commit to cadedaniel/ray that referenced this issue Feb 10, 2023
…GetAllJobInfo endpoint (ray-project#32388)

The changes to the GetAllJobInfo endpoint in ray-project#31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.

This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.

Related issue number
Closes ray-project#32213
cadedaniel added a commit that referenced this issue Feb 10, 2023
…GetAllJobInfo endpoint (#32388) (#32426)

The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.

This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.

Related issue number
Closes #32213

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
…GetAllJobInfo endpoint (ray-project#32388)

The changes to the GetAllJobInfo endpoint in ray-project#31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.

This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.

Related issue number
Closes ray-project#32213

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
4 participants