Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline restatectl connection to cluster #2592

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

muhamadazmy
Copy link
Contributor

Streamline restatectl connection to cluster

Summary:
Make it easier and cleaner to run the restatectl command
by accepting the restate nodes addresses via a unified argument --address

  • Support passing multiple addresses
  • The command will try to fetch the nodes configuration from the first reachable node
  • All other subcommands can then use the connection info to use the nodes configuration directly
    or to connect to a specific node role

@muhamadazmy muhamadazmy force-pushed the pr2592 branch 2 times, most recently from c68d2ad to 512163f Compare January 31, 2025 09:20
@muhamadazmy muhamadazmy marked this pull request as ready for review January 31, 2025 09:22
@muhamadazmy muhamadazmy requested a review from pcholakov January 31, 2025 09:30
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work @muhamadazmy. I like how the ConnectionInfo now contains the logic to retrieve the available nodes :-)

I left a few smaller comments. The one thing which is potentially a bit dangerous is how the cluster-provision works right now. It would be great if we can ensure that it behaves idempotently when retrying it.

Comment on lines 59 to 61
let Some((_, channel)) = connection_info.channels().next() else {
anyhow::bail!("At least one node address is required");
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we have to make sure for the cluster-provision command is that the call is idempotent. So if the user retries, then it shouldn't result into another node being provisioned (unless he explicitly ask us to do so). Accidentally provisioning multiple nodes is a very tricky situation to be in. Moreover, if the user is using the embedded metadata store, then he needs to hit a node that has the metadata-server role. If he uses an external metadata store, then he can hit any node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tillrohrmann I understand. This is why I am only taking the first address provided by the command line. This should be idempotent unless the user changed the params they are providing to the command line.

An improvement maybe to fail if multiple addresses are given and ask the user to only pass one and to stick to it maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following this discussion - this calls into NodeCtlSvcHandler::provision_cluster, from where we make a series of updates to cluster metadata using the metadata client configuration on the node on which we land. The updates themselves are conditional updates to versioned metadata values, and we appear to cater for retries. In particular, writing the initial nodes config is conditional on Precondition::DoesNotExist. What am I missing?

Comment on lines 48 to 53
chain_table.add_row(vec![
Cell::new(log_id),
Cell::new(err.code()).fg(Color::DarkRed),
Cell::new(err.message()).fg(Color::DarkRed),
]);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to remove this logic if cannot find the tail for a given log_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intended. I can partially restore it

Comment on lines 147 to 156
let mut response = connection
.try_each(None, |channel| async {
let mut client = NodeCtlSvcClient::new(channel)
.accept_compressed(CompressionEncoding::Gzip)
.send_compressed(CompressionEncoding::Gzip);

client.get_metadata(req).await
})
.await?
.into_inner();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure to get the latest Logs if we can obtain it from multiple nodes?

Comment on lines 60 to 67
.try_each(None, |channel| async {
let mut client = NodeCtlSvcClient::new(channel)
.accept_compressed(CompressionEncoding::Gzip)
.send_compressed(CompressionEncoding::Gzip);

client.get_metadata(req).await
})
.await?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here regarding what is the latest Logs version.

tools/restatectl/src/connection.rs Show resolved Hide resolved
None => Either::Right(nodes_config.iter()),
}
.sorted_by(|left, right| {
// nodes that we already have channels open for, gets higher presence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// nodes that we already have channels open for, gets higher presence.
// nodes that we already have channels open for get higher presence.


#[derive(Debug, thiserror::Error)]
pub enum ConnectionInfoError {
#[error("Could not retrieve nodes configuration. Possible un provisioned cluster!")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("Could not retrieve nodes configuration. Possible un provisioned cluster!")]
#[error("Could not retrieve nodes configuration. Has the cluster been provisioned yet?")]

#[derive(Debug, thiserror::Error)]
pub enum ConnectionInfoError {
#[error("Could not retrieve nodes configuration. Possible un provisioned cluster!")]
MissingNodesConfiguration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: Wondering whether a CodedError with more details and a short description how to provision the cluster could be helpful here.

fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
writeln!(
f,
"Encountered multiple errors trying to retrieve nodes configurations:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Encountered multiple errors trying to retrieve nodes configurations:"
"Encountered multiple errors trying to retrieve nodes' configuration:"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to use NodesErrors also for the try_each call where we are not necessarily trying to retrieve the NodesConfiguration. Maybe this needs an update of the Display implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense

Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement, @muhamadazmy! Some nitpicks but overall looks good to me.

One question I had was whether it's useful to report the node from which we obtained the response reported by a particular invocation? The node sorting should make that somewhat repeatable but it may help operators in case different nodes have diverged in their views.

None => Either::Right(nodes_config.iter()),
}
.sorted_by(|left, right| {
// nodes that we already have channels open for, gets higher presence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
// nodes that we already have channels open for, gets higher presence.
// nodes for which we already have open channels get higher precedence.

@@ -0,0 +1,267 @@
// Copyright (c) 2023 - 2025 Restate Software, Inc., Restate GmbH.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright (c) 2023 - 2025 Restate Software, Inc., Restate GmbH.
// Copyright (c) 2025 Restate Software, Inc., Restate GmbH.

Comment on lines +149 to +230
pub async fn try_each<F, T, Fut>(
&self,
role: Option<Role>,
mut closure: F,
) -> Result<Response<T>, ConnectionInfoError>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only ever construct either ClusterCtrlSvcClient or NodeCtlSvcClient maybe it would be useful to just have two convenience wrappers that pass the closure a pre-instantiated client? That would eliminate a fair amount of duplicated code across the various commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried that but it's much harder than it sounds because of the compiler complaining about captured variable cannot escape FnMut closure body, etc.. so I let it go for now. I will definitely try to do this later.

Some(role) => Either::Left(nodes_config.iter_role(role)),
None => Either::Right(nodes_config.iter()),
}
.sorted_by(|left, right| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The left/right naming threw me for a second because the stream is of type Either - I'd maybe just use a/b here instead :-)


#[derive(Debug, thiserror::Error)]
pub enum ConnectionInfoError {
#[error("Could not retrieve nodes configuration. Possible un provisioned cluster!")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unprovisioned sounds a bit awkward, maybe something like this is more consistent with the tone we take elsewhere?

Suggested change
#[error("Could not retrieve nodes configuration. Possible un provisioned cluster!")]
#[error("Could not retrieve nodes configuration. Have you provisioned this cluster?")]

}

if !answer {
// all nodes has errored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: *have! :-)

Suggested change
// all nodes has errored
// all nodes have returned error

};

let mut response = match client.get_metadata(req).await {
Ok(response) => response.into_inner(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should somehow inform the operator about what node was used to render the response? Maybe not by default but only if debug logging / verbose mode is active?

@muhamadazmy muhamadazmy force-pushed the pr2592 branch 3 times, most recently from 16571cc to d3e4791 Compare February 1, 2025 08:58
Summary:
Make it easier and cleaner to run the restatectl command
by accepting the restate nodes addresses via a unified argument `--address`

- Support passing multiple addresses
- The command will try to fetch the nodes configuration from the first reachable node
- All other subcommands can then use the connection info to use the nodes configuration directly
  or to connect to a specific node role
@muhamadazmy
Copy link
Contributor Author

Thank you @pcholakov for your very helpful review. I think I processed all your comments. Thanks again :)

@muhamadazmy
Copy link
Contributor Author

Thank you so much @tillrohrmann for your insightful review as always. I already processed all your comments, specially the ones regarding Logs version. I did some refactoring on how we fetch the logs.

for the provision command. I made sure to restore the logic regarding the addresses and i also made sure the user can only provide one address. Although before I only used the first but it's clearer to error if multiple given so there is no confusion over which one was provisioned.

Would love if you can do another pass over the code. Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants