-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable GRPC based resource reporting #19438
Comments
|
|
It looks like the core worker is reported as failure to GCS. |
|
There are some issues here:
2&3 are the key problem here. |
|
Some summary here:
step 2 Then check the raylet at this node It crashed due to spill back
step 3 6bc618f783deb19735aea6615d5a71540e5a83c531c81f2bfeb70ca4 is created because the old one crashed due to the same reason. Trace back to the first crashed one step 4
It looks like
|
With regard to performance (20 actors x 250 nodes) grpc based: 32s So the regression is ~40% slower |
20actors x 500 nodes test |
## Why are these changes needed? When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this. This PR filter out the node that's not available when select the node. ## Related issue number #19438
… scheduling (#19664) ## Why are these changes needed? Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request. TODO: - Push the snapshot to raylet if the message is lost. - Handle message loss in raylet better. ## Related issue number #19438
## Why are these changes needed? When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent. 1. local node has used all some pg resource 2. gcs broadcast node resources 3. local node now have resources 4. scheduler picks local node 5. local node can't schedule the task 6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs ## Related issue number #19438
…20048) ## Why are these changes needed? In this test case, the following case could happen: 1. actor creation first uses all resource in local node which is a GPU node 2. the actor need GPU will not be able to be scheduled since we only have one GPU node The fixing is just a short term fix and only tries to connect to the head node with CPU resources. ## Related issue number #19438
Related PR: #16910
The text was updated successfully, but these errors were encountered: