Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GRPC based resource reporting #19438

Closed
fishbone opened this issue Oct 18, 2021 · 9 comments · Fixed by #20119
Closed

Enable GRPC based resource reporting #19438

fishbone opened this issue Oct 18, 2021 · 9 comments · Fixed by #20119
Assignees
Labels
P1 Issue that should be fixed within a few weeks

Comments

@fishbone
Copy link
Contributor

Related PR: #16910

@fishbone fishbone added the P1 Issue that should be fixed within a few weeks label Oct 18, 2021
@fishbone fishbone added this to the GCS HA - Redis Removal milestone Oct 18, 2021
@fishbone fishbone self-assigned this Oct 18, 2021
@fishbone
Copy link
Contributor Author

fishbone commented Oct 18, 2021

[2021-10-18 21:20:50,244 E 106 106] event.cc:125: [ Event 8386b9dad797b9194ba76290d8fb4781d654 {"node_id":"080f8e7415b449d5efbb3366658c9528e677ef395a4cda1ab2d285b1"} ] src/ray/raylet/scheduling/cluster_task_manager.cc:1060: Check failed: node_info_opt Spilling [2/91674]
 node manager, but no GCS info found for node 76476cfe25c32615b1aa46c3e7a01bcb31222370d77cdc3c2ec6596f
*** StackTrace Information ***
    ray::GetCallTrace()
    ray::SpdLogMessage::Flush()
    ray::SpdLogMessage::~SpdLogMessage()
    ray::RayLog::~RayLog()
    ray::raylet::ClusterTaskManager::Spillback()
    ray::raylet::ClusterTaskManager::SpillWaitingTasks()
    ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
    ray::raylet::NodeManager::UpdateResourceUsage()
    ray::raylet::NodeManager::HandleUpdateResourceUsage()
    ray::rpc::ServerCallImpl<>::HandleRequestImpl()
    ray::rpc::ServerCallImpl<>::HandleRequest()::{lambda()#1}::operator()()
    std::__invoke_impl<>()
    std::__invoke_r<>()
    std::_Function_handler<>::_M_invoke()
    std::function<>::operator()()
    boost::asio::asio_handler_invoke<>()
    boost_asio_handler_invoke_helpers::invoke<>()
    boost::asio::detail::handler_work<>::complete<>()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler_operation::complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main

@fishbone
Copy link
Contributor Author

fishbone commented Oct 18, 2021

[2021-10-18 02:12:19,711 I 166428 166645] service_based_accessor.cc:611: Received notification for node id = bd7459ba715e7054c4ed5bca59ccc86f6fed568867eeb903880be701, IsAlive = 1
[2021-10-18 02:12:20,694 I 166428 166645] core_worker.cc:783: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=IDLE_EXIT
[2021-10-18 02:12:20,694 I 166428 166428] core_worker.cc:333: Removed worker e24f8ea18a6759f1adf50cb29a1bda9cb7544318503c088b050e6f73
[2021-10-18 02:12:20,707 I 166428 166428] core_worker.cc:177: Destructing CoreWorkerProcess. pid: 166428
[2021-10-18 02:12:20,707 I 166428 166428] io_service_pool.cc:47: IOServicePool is stopped.
[2021-10-18 02:12:21,370 C 166428 166650] core_worker.cc:187:  Check failed: core_worker_process The core worker process is not initialized yet or already shutdown.
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::core::CoreWorkerProcess::EnsureInitialized()
    ray::core::CoreWorkerProcess::GetCoreWorker()
    __pyx_pw_3ray_7_raylet_10CoreWorker_23get_worker_id()
    method_vectorcall_NOARGS



[2021-10-18 02:12:21,370 E 166428 166650] event.cc:125: [ Event 7a56a28155a782c45508107380da88f53ed0 {"worker_id":"e24f8ea18a6759f1adf50cb29a1bda9cb7544318503c088b050e6f73"} ] src/ray/core_worker/core_w
orker.cc:187: Check failed: core_worker_process The core worker process is not initialized yet or already shutdown.
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::core::CoreWorkerProcess::EnsureInitialized()
    ray::core::CoreWorkerProcess::GetCoreWorker()
    __pyx_pw_3ray_7_raylet_10CoreWorker_23get_worker_id()
    method_vectorcall_NOARGS

@fishbone
Copy link
Contributor Author

It looks like the core worker is reported as failure to GCS.

@fishbone
Copy link
Contributor Author

[2021-10-18 21:19:45,304 E 106 106] event.cc:125: [ Event 4e152813409d0dcbb197f4e9b111cce7633f {"node_id":"a53aa192fb1ac0f1447a4ae9cc6d0cac8becd5cff53d0689112025d6"} ] src/ray/object_manager/object_manager.cc:381:Object id:8ccc3ddbe84e2a54ffffffffffffffffffffffff0100000027000000's size mismatches our record. Expected data size: 5344521, expected metadata size: 6, actual data size: 0, actual metadata size: 2
*** StackTrace Information ***
    ray::GetCallTrace()
    ray::SpdLogMessage::Flush()
    ray::SpdLogMessage::~SpdLogMessage()
    ray::RayLog::~RayLog()
    ray::ObjectManager::PushLocalObject()
    ray::ObjectManager::Push()
    ray::ObjectManager::HandlePull()::{lambda()#1}::operator()()
    std::__invoke_impl<>()
    std::__invoke_r<>()
    std::_Function_handler<>::_M_invoke()
    std::function<>::operator()()
    boost::asio::asio_handler_invoke<>()
    boost_asio_handler_invoke_helpers::invoke<>()
    boost::asio::detail::handler_work<>::complete<>()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler_operation::complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

@fishbone
Copy link
Contributor Author

fishbone commented Oct 19, 2021

There are some issues here:

  1. data corruption
  2. spillback error not handled well
  3. node liveness doesn't match (grpc based broadcasting enlarge this issue)

2&3 are the key problem here.

@fishbone
Copy link
Contributor Author

[2021-10-18 21:21:19,389 E 144 144] gcs_node_manager.cc:164: [ Event c82933fbe8b3cf542c76d093ea8dbf9f92af {"ip":"172.31.31.153","node_id":"080f8e7415b449d5efbb3366658c9528e677ef395a4cda1ab2d285b1"} ] The node with node id: 080f8e7415b449d5efbb3366658c9528e677ef395a4cda1ab2d285b1 and ip: 172.31.31.153 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

@fishbone
Copy link
Contributor Author

Some summary here:
step 1
the driver failed to get the result because the owner died

(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=11987, ip=172.31.7.126)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=54041, ip=172.31.31.153)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=54043, ip=172.31.31.153)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=55506, ip=172.31.31.153)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 201, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=54109, ip=172.31.31.153)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 202, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=63760, ip=172.31.22.145)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 202, in grow_tree_remote
(pid=2015) ray.exceptions.RayTaskError: ray::grow_tree_remote() (pid=63847, ip=172.31.22.145)
(pid=2015)   File "decision_tree/cart_with_tree.py", line 188, in grow_tree_remote
(pid=2015)   File "decision_tree/cart_with_tree.py", line 128, in _best_split
(pid=2015)   File "decision_tree/cart_with_tree.py", line 326, in best_split
(pid=2015)   File "decision_tree/cart_with_tree.py", line 326, in <listcomp>
(pid=2015) ray.exceptions.RayTaskError: ray::best_split_for_idx_remote() (pid=33096, ip=172.31.17.56)
(pid=2015)   At least one of the input arguments for this task could not be computed:
(pid=2015) ray.exceptions.OwnerDiedError: Failed to retrieve object 88aec4031cbe00c6ffffffffffffffffffffffff010000000b000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(pid=2015) 
(pid=2015) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*bb90d816689ff8e9025a3fd5311001a2cf2f5065aec94b8f3a52ef31*` at IP address 172.31.22.145) for more information about the Python worker failure.

step 2

Then check the raylet at this node tmp/ray/session_latest/logs/*bb90d816689ff8e9025a3fd5311001a2cf2f5065aec94b8f3a52ef31

It crashed due to spill back

[2021-10-18 21:23:19,484 C 106 106] cluster_task_manager.cc:1060:  Check failed: node_info_opt Spilling back to a node manager, but no GCS info found for node 6bc618f783deb19735aea6615d5a71540e5a83c531c81f2bfeb70ca4
*** StackTrace Information ***
    ray::GetCallTrace()
    ray::SpdLogMessage::Flush()
    ray::SpdLogMessage::~SpdLogMessage()
    ray::RayLog::~RayLog()
    ray::raylet::ClusterTaskManager::Spillback()
    ray::raylet::ClusterTaskManager::SpillWaitingTasks()
    ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
    ray::raylet::NodeManager::UpdateResourceUsage()
    ray::raylet::NodeManager::HandleUpdateResourceUsage()
    ray::rpc::ServerCallImpl<>::HandleRequestImpl()
    ray::rpc::ServerCallImpl<>::HandleRequest()::{lambda()#1}::operator()()
    std::__invoke_impl<>()
    std::__invoke_r<>()
    std::_Function_handler<>::_M_invoke()
    std::function<>::operator()()
    boost::asio::asio_handler_invoke<>()
    boost_asio_handler_invoke_helpers::invoke<>()
    boost::asio::detail::handler_work<>::complete<>()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler_operation::complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

step 3
It's because 6bc618f783deb19735aea6615d5a71540e5a83c531c81f2bfeb70ca4 is a new node and resource broadcasting contain more recent information than the local view.

6bc618f783deb19735aea6615d5a71540e5a83c531c81f2bfeb70ca4 is created because the old one crashed due to the same reason. Trace back to the first crashed one

step 4

[2021-10-18 21:19:45,301 C 106 106] object_manager.cc:381: Object id:8ccc3ddbe84e2a54ffffffffffffffffffffffff0100000027000000's size mismatches our record. Expected data size: 5344521, expected metadata size: 6, actual data size: 0, actual metadata size: 2
*** StackTrace Information ***
    ray::GetCallTrace()
    ray::SpdLogMessage::Flush()
    ray::SpdLogMessage::~SpdLogMessage()
    ray::RayLog::~RayLog()
    ray::ObjectManager::PushLocalObject()
    ray::ObjectManager::Push()
    ray::ObjectManager::HandlePull()::{lambda()#1}::operator()()
    std::__invoke_impl<>()
    std::__invoke_r<>()
    std::_Function_handler<>::_M_invoke()
    std::function<>::operator()()
    boost::asio::asio_handler_invoke<>()
    boost_asio_handler_invoke_helpers::invoke<>()
    boost::asio::detail::handler_work<>::complete<>()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler_operation::complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

It looks like

  • the ordering changed because part of them run in grpc and part of them run on redis and we need to take care of this
  • there is data corruption in the middle

@fishbone
Copy link
Contributor Author

With regard to performance (20 actors x 250 nodes)

grpc based: 32s
redis based: 23s

So the regression is ~40% slower

@fishbone
Copy link
Contributor Author

20actors x 500 nodes test
Total time: 60s
vs
Total time: 161s

fishbone added a commit that referenced this issue Oct 22, 2021
## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.

This PR filter out the node that's not available when select the node.

## Related issue number
#19438
fishbone added a commit that referenced this issue Oct 26, 2021
… scheduling (#19664)

## Why are these changes needed?
Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request.

TODO:
- Push the snapshot to raylet if the message is lost.
- Handle message loss in raylet better.


## Related issue number
#19438
fishbone added a commit that referenced this issue Oct 27, 2021
…ode (#19779)

## Why are these changes needed?
We have concern that grpc based broadcasting might have negative impact on pg related workload. This test is to ensure it's running well before merging.

## Related issue number
#19438
fishbone added a commit that referenced this issue Nov 2, 2021
## Why are these changes needed?
When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent.

1. local node has used all some pg resource
2. gcs broadcast node resources
3. local node now have resources
4. scheduler picks local node
5. local node can't schedule the task
6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs

## Related issue number
#19438
fishbone added a commit that referenced this issue Nov 3, 2021
#19996)

This reverts commit f1eedb1.

## Why are these changes needed?
Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself.

## Related issue number
#19438
fishbone added a commit that referenced this issue Nov 3, 2021
## Why are these changes needed?
This is part of redis removal project. This PR is going to enable grpc based broadcasting by default.

## Related issue number

<!-- For example: "Closes #1234" -->
#19438 
## Checks
fishbone added a commit that referenced this issue Nov 4, 2021
…20048)

## Why are these changes needed?
In this test case, the following case could happen:

1. actor creation first uses all resource in local node which is a GPU node
2. the actor need GPU will not be able to be scheduled since we only have one GPU node

The fixing is just a short term fix and only tries to connect to the head node with CPU resources.

## Related issue number
#19438
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant