Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to node scheduler #213

Merged
merged 1 commit into from
Jul 21, 2023
Merged

Improvements to node scheduler #213

merged 1 commit into from
Jul 21, 2023

Conversation

mwylde
Copy link
Member

@mwylde mwylde commented Jul 20, 2023

This addresses a few issues with the node scheduler:

  • Previously we were sending the entire pipeline binary as part of a single gRPC message, but as our pipelines binaries have grown in size this can bump up against the max size limit. This PR moves the communication to a multi-part stream of 2MB chunks.
  • If a node dies, the jobs running on it going into recovery. The first phase of this is to try to clean up the old cluster, which for the node scheduler involves telling the node to shut down the worker. Except that the node itself is down, and so can't do that. Now we no longer treat this as an error, so that recovery can continue under the assumption that if the node isn't responding any more, the worker probably isn't running either.

@jacksonrnewhouse
Copy link
Contributor

We could probably bump the message size limits up to the point where that isn't an issue. Have a preference?

force,
) {
let Ok(mut client) = NodeGrpcClient::connect(format!("http://{}", node.addr)).await else {
warn!("Failed to connect to worker to stop; this likely means it is dead");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to be careful about making sure a pipeline was actually dead we'd coordinate this behavior with timeouts/heartbeats. Might not be worth it for what is likely not the main mode of deployment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there are ways of making this more robust but waiting for a heartbeat does mean taking longer to recover failed pipelines.

There are basically two cases:

  • The node process has died, in which case the workers will also have died (generally) because they are child processes of the node
  • We have a network partition between the controller and the node

In case 1, we're doing the right thing. In case 2, the worker and the node will each die on their own as they fail to talk to the controller.

@mwylde
Copy link
Member Author

mwylde commented Jul 21, 2023

We could probably bump the message size limits up to the point where that isn't an issue. Have a preference?

We actually can't, there's a hard limit in tonic and we've always had it configured as high as it can be set. It also looks like that limit has just been lowered to 4MB in the newest version of tonic.

@mwylde mwylde merged commit b32e657 into master Jul 21, 2023
@mwylde mwylde deleted the node branch July 21, 2023 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants