Enable GRPC based resource reporting #19438

fishbone · 2021-10-18T01:47:15Z

Related PR: #16910

fishbone · 2021-10-18T01:47:44Z

[2021-10-18 21:20:50,244 E 106 106] event.cc:125: [ Event 8386b9dad797b9194ba76290d8fb4781d654 {"node_id":"080f8e7415b449d5efbb3366658c9528e677ef395a4cda1ab2d285b1"} ] src/ray/raylet/scheduling/cluster_task_manager.cc:1060: Check failed: node_info_opt Spilling [2/91674]
 node manager, but no GCS info found for node 76476cfe25c32615b1aa46c3e7a01bcb31222370d77cdc3c2ec6596f
*** StackTrace Information ***
    ray::GetCallTrace()
    ray::SpdLogMessage::Flush()
    ray::SpdLogMessage::~SpdLogMessage()
    ray::RayLog::~RayLog()
    ray::raylet::ClusterTaskManager::Spillback()
    ray::raylet::ClusterTaskManager::SpillWaitingTasks()
    ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
    ray::raylet::NodeManager::UpdateResourceUsage()
    ray::raylet::NodeManager::HandleUpdateResourceUsage()
    ray::rpc::ServerCallImpl<>::HandleRequestImpl()
    ray::rpc::ServerCallImpl<>::HandleRequest()::{lambda()#1}::operator()()
    std::__invoke_impl<>()
    std::__invoke_r<>()
    std::_Function_handler<>::_M_invoke()
    std::function<>::operator()()
    boost::asio::asio_handler_invoke<>()
    boost_asio_handler_invoke_helpers::invoke<>()
    boost::asio::detail::handler_work<>::complete<>()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler_operation::complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main

fishbone · 2021-10-18T02:09:29Z

[2021-10-18 02:12:19,711 I 166428 166645] service_based_accessor.cc:611: Received notification for node id = bd7459ba715e7054c4ed5bca59ccc86f6fed568867eeb903880be701, IsAlive = 1
[2021-10-18 02:12:20,694 I 166428 166645] core_worker.cc:783: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=IDLE_EXIT
[2021-10-18 02:12:20,694 I 166428 166428] core_worker.cc:333: Removed worker e24f8ea18a6759f1adf50cb29a1bda9cb7544318503c088b050e6f73
[2021-10-18 02:12:20,707 I 166428 166428] core_worker.cc:177: Destructing CoreWorkerProcess. pid: 166428
[2021-10-18 02:12:20,707 I 166428 166428] io_service_pool.cc:47: IOServicePool is stopped.
[2021-10-18 02:12:21,370 C 166428 166650] core_worker.cc:187:  Check failed: core_worker_process The core worker process is not initialized yet or already shutdown.
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::core::CoreWorkerProcess::EnsureInitialized()
    ray::core::CoreWorkerProcess::GetCoreWorker()
    __pyx_pw_3ray_7_raylet_10CoreWorker_23get_worker_id()
    method_vectorcall_NOARGS



[2021-10-18 02:12:21,370 E 166428 166650] event.cc:125: [ Event 7a56a28155a782c45508107380da88f53ed0 {"worker_id":"e24f8ea18a6759f1adf50cb29a1bda9cb7544318503c088b050e6f73"} ] src/ray/core_worker/core_w
orker.cc:187: Check failed: core_worker_process The core worker process is not initialized yet or already shutdown.
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::core::CoreWorkerProcess::EnsureInitialized()
    ray::core::CoreWorkerProcess::GetCoreWorker()
    __pyx_pw_3ray_7_raylet_10CoreWorker_23get_worker_id()
    method_vectorcall_NOARGS

fishbone · 2021-10-18T07:07:21Z

It looks like the core worker is reported as failure to GCS.

fishbone · 2021-10-19T05:34:58Z

[2021-10-18 21:19:45,304 E 106 106] event.cc:125: [ Event 4e152813409d0dcbb197f4e9b111cce7633f {"node_id":"a53aa192fb1ac0f1447a4ae9cc6d0cac8becd5cff53d0689112025d6"} ] src/ray/object_manager/object_manager.cc:381:Object id:8ccc3ddbe84e2a54ffffffffffffffffffffffff0100000027000000's size mismatches our record. Expected data size: 5344521, expected metadata size: 6, actual data size: 0, actual metadata size: 2
*** StackTrace Information ***
    ray::GetCallTrace()
    ray::SpdLogMessage::Flush()
    ray::SpdLogMessage::~SpdLogMessage()
    ray::RayLog::~RayLog()
    ray::ObjectManager::PushLocalObject()
    ray::ObjectManager::Push()
    ray::ObjectManager::HandlePull()::{lambda()#1}::operator()()
    std::__invoke_impl<>()
    std::__invoke_r<>()
    std::_Function_handler<>::_M_invoke()
    std::function<>::operator()()
    boost::asio::asio_handler_invoke<>()
    boost_asio_handler_invoke_helpers::invoke<>()
    boost::asio::detail::handler_work<>::complete<>()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler_operation::complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

fishbone · 2021-10-19T05:51:46Z

There are some issues here:

data corruption
spillback error not handled well
node liveness doesn't match (grpc based broadcasting enlarge this issue)

2&3 are the key problem here.

fishbone · 2021-10-19T06:55:55Z

[2021-10-18 21:21:19,389 E 144 144] gcs_node_manager.cc:164: [ Event c82933fbe8b3cf542c76d093ea8dbf9f92af {"ip":"172.31.31.153","node_id":"080f8e7415b449d5efbb3366658c9528e677ef395a4cda1ab2d285b1"} ] The node with node id: 080f8e7415b449d5efbb3366658c9528e677ef395a4cda1ab2d285b1 and ip: 172.31.31.153 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

fishbone · 2021-10-19T21:51:13Z

Some summary here:
step 1
the driver failed to get the result because the owner died

(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=11987, ip=172.31.7.126)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=54041, ip=172.31.31.153)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=54043, ip=172.31.31.153)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=55506, ip=172.31.31.153)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=54109, ip=172.31.31.153)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 202, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=63760, ip=172.31.22.145)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 202, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=63847, ip=172.31.22.145)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 188, in grow_tree_remote
(pid=2015)   File "decision_tree/cart_with_tree.py", line 128, in _best_split
(pid=2015)   File "decision_tree/cart_with_tree.py", line 326, in best_split
(pid=2015)   File "decision_tree/cart_with_tree.py", line 326, in <listcomp>
(pid=2015) ray.exceptions.RayTaskError: ray::best_split_for_idx_remote() (pid=33096, ip=172.31.17.56)
(pid=2015)   At least one of the input arguments for this task could not be computed:
(pid=2015) ray.exceptions.OwnerDiedError: Failed to retrieve object 88aec4031cbe00c6ffffffffffffffffffffffff010000000b000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(pid=2015) 
(pid=2015) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*bb90d816689ff8e9025a3fd5311001a2cf2f5065aec94b8f3a52ef31*` at IP address 172.31.22.145) for more information about the Python worker failure.

step 2

Then check the raylet at this node tmp/ray/session_latest/logs/*bb90d816689ff8e9025a3fd5311001a2cf2f5065aec94b8f3a52ef31

It crashed due to spill back

[2021-10-18 21:23:19,484 C 106 106] cluster_task_manager.cc:1060:  Check failed: node_info_opt Spilling back to a node manager, but no GCS info found for node 6bc618f783deb19735aea6615d5a71540e5a83c531c81f2bfeb70ca4
*** StackTrace Information ***
    ray::GetCallTrace()
    ray::SpdLogMessage::Flush()
    ray::SpdLogMessage::~SpdLogMessage()
    ray::RayLog::~RayLog()
    ray::raylet::ClusterTaskManager::Spillback()
    ray::raylet::ClusterTaskManager::SpillWaitingTasks()
    ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
    ray::raylet::NodeManager::UpdateResourceUsage()
    ray::raylet::NodeManager::HandleUpdateResourceUsage()
    ray::rpc::ServerCallImpl<>::HandleRequestImpl()
    ray::rpc::ServerCallImpl<>::HandleRequest()::{lambda()#1}::operator()()
    std::__invoke_impl<>()
    std::__invoke_r<>()
    std::_Function_handler<>::_M_invoke()
    std::function<>::operator()()
    boost::asio::asio_handler_invoke<>()
    boost_asio_handler_invoke_helpers::invoke<>()
    boost::asio::detail::handler_work<>::complete<>()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler_operation::complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

step 3
It's because 6bc618f783deb19735aea6615d5a71540e5a83c531c81f2bfeb70ca4 is a new node and resource broadcasting contain more recent information than the local view.

6bc618f783deb19735aea6615d5a71540e5a83c531c81f2bfeb70ca4 is created because the old one crashed due to the same reason. Trace back to the first crashed one

step 4

[2021-10-18 21:19:45,301 C 106 106] object_manager.cc:381: Object id:8ccc3ddbe84e2a54ffffffffffffffffffffffff0100000027000000's size mismatches our record. Expected data size: 5344521, expected metadata size: 6, actual data size: 0, actual metadata size: 2
*** StackTrace Information ***
    ray::GetCallTrace()
    ray::SpdLogMessage::Flush()
    ray::SpdLogMessage::~SpdLogMessage()
    ray::RayLog::~RayLog()
    ray::ObjectManager::PushLocalObject()
    ray::ObjectManager::Push()
    ray::ObjectManager::HandlePull()::{lambda()#1}::operator()()
    std::__invoke_impl<>()
    std::__invoke_r<>()
    std::_Function_handler<>::_M_invoke()
    std::function<>::operator()()
    boost::asio::asio_handler_invoke<>()
    boost_asio_handler_invoke_helpers::invoke<>()
    boost::asio::detail::handler_work<>::complete<>()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler_operation::complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

It looks like

the ordering changed because part of them run in grpc and part of them run on redis and we need to take care of this
there is data corruption in the middle

fishbone · 2021-10-21T07:18:41Z

With regard to performance (20 actors x 250 nodes)

grpc based: 32s
redis based: 23s

So the regression is ~40% slower

fishbone · 2021-10-22T06:23:24Z

20actors x 500 nodes test
Total time: 60s
vs
Total time: 161s

## Why are these changes needed? When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this. This PR filter out the node that's not available when select the node. ## Related issue number #19438

… scheduling (#19664) ## Why are these changes needed? Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request. TODO: - Push the snapshot to raylet if the message is lost. - Handle message loss in raylet better. ## Related issue number #19438

…ode (#19779) ## Why are these changes needed? We have concern that grpc based broadcasting might have negative impact on pg related workload. This test is to ensure it's running well before merging. ## Related issue number #19438

## Why are these changes needed? When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent. 1. local node has used all some pg resource 2. gcs broadcast node resources 3. local node now have resources 4. scheduler picks local node 5. local node can't schedule the task 6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs ## Related issue number #19438

#19996) This reverts commit f1eedb1. ## Why are these changes needed? Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself. ## Related issue number #19438

## Why are these changes needed? This is part of redis removal project. This PR is going to enable grpc based broadcasting by default. ## Related issue number  #19438 ## Checks

…20048) ## Why are these changes needed? In this test case, the following case could happen: 1. actor creation first uses all resource in local node which is a GPU node 2. the actor need GPU will not be able to be scheduled since we only have one GPU node The fixing is just a short term fix and only tries to connect to the head node with CPU resources. ## Related issue number #19438

fishbone added the P1 Issue that should be fixed within a few weeks label Oct 18, 2021

fishbone added this to the GCS HA - Redis Removal milestone Oct 18, 2021

fishbone self-assigned this Oct 18, 2021

fishbone mentioned this issue Oct 23, 2021

[gcs] Fix the regression of enabling grpc based broadcasting in actor scheduling #19664

Merged

6 tasks

This was referenced Oct 26, 2021

[gcs] Enable grpc based broadcasting by default #19716

Merged

[nightly] Adding more tests related to grpc broadcasting to staging mode #19779

Merged

fishbone mentioned this issue Oct 30, 2021

[core] Fix wrong local resource view in raylet #19911

Merged

6 tasks

fishbone mentioned this issue Nov 4, 2021

[pg] Fix the test case which hangs because of scheduling dead lock #20048

Merged

6 tasks

fishbone mentioned this issue Nov 5, 2021

[nightly] Remove grpc staging test since nightly is stable #20119

Merged

6 tasks

rkooo567 closed this as completed in #20119 Nov 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable GRPC based resource reporting #19438

Enable GRPC based resource reporting #19438

fishbone commented Oct 18, 2021

fishbone commented Oct 18, 2021 •

edited

Loading

fishbone commented Oct 18, 2021 •

edited

Loading

fishbone commented Oct 18, 2021

fishbone commented Oct 19, 2021

fishbone commented Oct 19, 2021 •

edited

Loading

fishbone commented Oct 19, 2021

fishbone commented Oct 19, 2021

fishbone commented Oct 21, 2021

fishbone commented Oct 22, 2021

Enable GRPC based resource reporting #19438

Enable GRPC based resource reporting #19438

Comments

fishbone commented Oct 18, 2021

fishbone commented Oct 18, 2021 • edited Loading

fishbone commented Oct 18, 2021 • edited Loading

fishbone commented Oct 18, 2021

fishbone commented Oct 19, 2021

fishbone commented Oct 19, 2021 • edited Loading

fishbone commented Oct 19, 2021

fishbone commented Oct 19, 2021

fishbone commented Oct 21, 2021

fishbone commented Oct 22, 2021

fishbone commented Oct 18, 2021 •

edited

Loading

fishbone commented Oct 18, 2021 •

edited

Loading

fishbone commented Oct 19, 2021 •

edited

Loading