Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol stuck debug #882

Closed
ppca opened this issue Oct 10, 2024 · 3 comments · Fixed by #883
Closed

Protocol stuck debug #882

ppca opened this issue Oct 10, 2024 · 3 comments · Fixed by #883
Assignees
Labels
Emerging Tech Emerging Tech flying formation at Pagoda Near BOS NEAR BOS team at Pagoda

Comments

@ppca
Copy link
Contributor

ppca commented Oct 10, 2024

We saw both on mainnet, testnet and dev that our protocol could be stuck, meaning that we don't see any logic running inside the loop in MpcSignProtocol.run().

This essentially means the nodes will not respond to triple, presig or signature generation at all.

@ppca ppca added Near BOS NEAR BOS team at Pagoda Emerging Tech Emerging Tech flying formation at Pagoda labels Oct 10, 2024
@ppca ppca self-assigned this Oct 10, 2024
@ppca
Copy link
Contributor Author

ppca commented Oct 10, 2024

I've spent last 3 days digging into mainnet and dev being stuck.
The common thing between them is 1) one node was kinda unavailable; 2) the iter count of the protocol are both stuck.

mainnet iter count: the stuck timestamp is the same as when 502 starts happening on /msg for aurora
Screenshot 2024-10-09 at 7 04 47 PM

I note down most of the evidence and links in dev debug doc and mainnet debug doc

Findings are as follows:

  1. when one node is offline, i.e. /state cannot be reached, then the other nodes will be stuck, or at least progress very very slowly (1 iteration every few minutes as compared to hundreds per min normally), because there are no time out on the call to get /state:
    let Ok(resp) = self.http.get(url.clone()).send().await else {
    . So it can take arbitrarily wrong for this call to return an error. After adding a 2s timeout, problem is solved;
  2. when one node's /msg endpoint cannot be reached, like /msg 502 error on aurora mainnet node, then same issue happens: other nodes can take arbitrarily wrong to return error on /msg call. Fix would be to add timeout too.
  3. when fetching contract state/config is not successful, all triple/presig/sig logic will be skipped, we are essentially stuck. . This cannot be helped, because we need the most up to date contract state form near rpc calls.

PR here: https://github.com/near/mpc/pull/883/files

I have not figured out:

  1. why aurora got into 502 in the first place. We did not have any logs for them at that time, we have fixed that by asking them to adjust their gcp settings. Thanks to @kmaus-near

@ppca ppca moved this from Backlog to In Progress in Emerging Technologies Oct 10, 2024
@volovyks
Copy link
Collaborator

This cannot be helped, because we need the most up to date contract state form near rpc calls.
Is our node reporting itself as "unavailable" when other nodes are trying to ping it?

@ppca
Copy link
Contributor Author

ppca commented Oct 10, 2024

This cannot be helped, because we need the most up to date contract state form near rpc calls. Is our node reporting itself as "unavailable" when other nodes are trying to ping it?

No, I mean each node need to call near rpc to get the latest contract state, and there's no way we don't do that. So if near rpc is somehow not accepting calls to get contract state, then our protocol will be stuck. And since we need that updated contract state, we cannot do anything about this.

@ppca ppca closed this as completed in #883 Oct 10, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Emerging Technologies Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Emerging Tech Emerging Tech flying formation at Pagoda Near BOS NEAR BOS team at Pagoda
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants