Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] Nightly test - submitting job results in GCS crash #32367

Closed
shomilj opened this issue Feb 9, 2023 · 7 comments
Closed

[Jobs] Nightly test - submitting job results in GCS crash #32367

shomilj opened this issue Feb 9, 2023 · 7 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@shomilj
Copy link
Contributor

shomilj commented Feb 9, 2023

What happened + What you expected to happen

I've been switching over the execution mode of nightly tests to Jobs, and I think I gave Jobs more test coverage :) it seems like shuffle_20gb_with_state_api failed right after cluster startup when we attempted to submit a job due to a GCS crash. Here's the stacktrace:

*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99aa6a) [0x55d08bc4da6a] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99c542) [0x55d08bc4f542] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99c857) [0x55d08bc4f857] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x316058) [0x55d08b5c9058] ray::gcs::GcsJobManager::HandleGetAllJobInfo()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x2a8168) [0x55d08b55b168] ray::gcs::GcsTable<>::GetAll()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x343fa0) [0x55d08b5f6fa0] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x56cad6) [0x55d08b81fad6] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x51ae9e) [0x55d08b7cde9e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x51b3f6) [0x55d08b7ce3f6] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8cbeb) [0x55d08bd3fbeb] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8de81) [0x55d08bd40e81] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8e0f0) [0x55d08bd410f0] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x16782e) [0x55d08b41a82e] main
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f45e448708
3] __libc_start_main
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x1ade47) [0x55d08b460e47]

Logs from head node:
head-10.0.4.212-i-0086e889d3a72c5b1.zip

Link to cluster: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_ksaufjuihy7h6ww7abh5gwlqjh/clusters/ses_yzx1g7qa3jrnnhysv8akxz6hgg

Versions / Dependencies

Nightly

Reproduction script

Run shuffle_20gb_with_state_api against this PR: #32204

Issue Severity

High: It blocks me from completing my task.

@shomilj shomilj added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 9, 2023
@architkulkarni architkulkarni self-assigned this Feb 9, 2023
@architkulkarni architkulkarni added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 9, 2023
@scv119
Copy link
Contributor

scv119 commented Feb 9, 2023

nvm

@scv119
Copy link
Contributor

scv119 commented Feb 9, 2023

[2023-02-07 19:53:34,399 I 260 260] (gcs_server) gcs_job_manager.cc:149: Getting all job info.
[2023-02-07 19:53:34,458 C 260 260] (gcs_server) gcs_job_manager.cc:191:  Check failed: job_api_data_keys.size() == job_data_key_to_index.size() 
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99aa6a) [0x55d08bc4da6a] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99c542) [0x55d08bc4f542] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99c857) [0x55d08bc4f857] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x316058) [0x55d08b5c9058] ray::gcs::GcsJobManager::HandleGetAllJobInfo()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x2a8168) [0x55d08b55b168] ray::gcs::GcsTable<>::GetAll()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x343fa0) [0x55d08b5f6fa0] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x56cad6) [0x55d08b81fad6] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x51ae9e) [0x55d08b7cde9e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x51b3f6) [0x55d08b7ce3f6] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8cbeb) [0x55d08bd3fbeb] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8de81) [0x55d08bd40e81] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8e0f0) [0x55d08bd410f0] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x16782e) [0x55d08b41a82e] main
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f45e4487083] __libc_start_main
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x1ade47) [0x55d08b460e47]

@scv119
Copy link
Contributor

scv119 commented Feb 9, 2023

so it's likely introduced by b2c5e63

@cadedaniel
Copy link
Member

We suspect #32213 is caused by the same commit

@shomilj
Copy link
Contributor Author

shomilj commented Feb 9, 2023

Also saw this in ray-data-bulk-ingest-file-size-benchmark - https://buildkite.com/ray-project/release-tests-pr/builds/27704#01862ef5-c730-4ca5-978c-13192ff794b4

[2023-02-07 19:30:45,637 I 260 260] (gcs_server) gcs_job_manager.cc:42: Adding job, job id = 03000000, driver pid = 5845
[2023-02-07 19:30:45,637 I 260 260] (gcs_server) gcs_job_manager.cc:57: Finished adding job, job id = 03000000, driver pid = 5845
[2023-02-07 19:30:45,986 I 260 260] (gcs_server) gcs_job_manager.cc:149: Getting all job info.
[2023-02-07 19:30:46,023 C 260 260] (gcs_server) gcs_job_manager.cc:191:  Check failed: job_api_data_keys.size() == job_data_key_to_index.size()
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99aa6a) [0x560b99afca6a] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99c542) [0x560b99afe542] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x99c857) [0x560b99afe857] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x316058) [0x560b99478058] ray::gcs::GcsJobManager::HandleGetAllJobInfo()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x2a8168) [0x560b9940a168] ray::gcs::GcsTable<>::GetAll()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x343fa0) [0x560b994a5fa0] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x56cad6) [0x560b996cead6] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x51ae9e) [0x560b9967ce9e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x51b3f6) [0x560b9967d3f6] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8cbeb) [0x560b99beebeb] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8de81) [0x560b99befe81] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xa8e0f0) [0x560b99bf00f0] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x16782e) [0x560b992c982e] main
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fb465ebc083] __libc_start_main
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x1ade47) [0x560b9930fe47]

@architkulkarni
Copy link
Contributor

Thanks for the help with the investigation! If the check failure Check failed: job_api_data_keys.size() == job_data_key_to_index.size() is the same root cause of both release blockers, the fix might be pretty straightforward. I'll continue to investigate.

@architkulkarni
Copy link
Contributor

Same issue as #32213, moving discussion there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
Development

No branches or pull requests

4 participants