Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that exceptions during Discovery are correctly handled #348

Closed
emmacasolin opened this issue Feb 25, 2022 · 1 comment
Closed

Ensure that exceptions during Discovery are correctly handled #348

emmacasolin opened this issue Feb 25, 2022 · 1 comment
Assignees
Labels
development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy

Comments

@emmacasolin
Copy link
Contributor

Specification

                      ┌─────┐        ┌─────┐
                      │ N 1 │        │ N 2 │
                      └──┬──┘        └──┬──┘
  Discovery Queue        │              │
┌────────────────┐       │              │
│ T4  T3  T2  T1 ├─────► T1      PolykeyAgent.stop()
└────────────────┘       │              │
                         │              │
                         │              X
                   Discovery.stop()
                         │
                         │
                         │
                         T1 ───────────►
                         │      GRPCClient.createClient()
                         │          Retries for 20s
                         │
               ErrorNodeConnectionTimeout
                         │
                         │
                         │
             NodeConnectionManager.stop()
                         │
                         │
                         │
                  PolykeyAgent.stop()

When stopping the Discovery domain, you need to await for the current task T1 to finish (i.e. one iteration of the discovery queue, where we discover a node/identity and its linked nodes/identities).This is because we don't have the ability to abort currently asynchronous side-effectful tasks which is scheduled in #297. The task itself involves establishing a node connection to the remote agent N 2, however, an edge case that we have not fully considered is one where N 2 has shutdown and is no longer running. In such a situation, the connection timeout which is passed from NodeConnectionManager to NodeConnection to GRPCClientAgent to GRPCClient is what is going to determine how long to wait for connection readiness (and thus how long until we can catch an error and exit the discovery process). This timeout is set to 20s for NodeConnectionManager, which is propagated to all connection timeouts.

In instances of this behaviour, you'll see retried attempts to connect through the proxy. Then the ErrorGRPCClientTimeout should be thrown, which is then rethrown as ErrorNodeConnectionTimeout. You should get this exception on withConnF, which is used by requestChainData in NodeManager, which is called by Discovery.

We need to ensure that this is indeed the sequence of events in practice, and we need to ensure that errors are correctly caught and logged out.

Additional context

Tasks

  1. In our Discovery, the default timeout shouldn't be 20s, that's too long. The withConnF method should be able to override the default timeout set in NodeConnectionManager, for example by providing a value as a parameter.
  2. We need Asynchronous Promise Cancellation with Cancellable Promises, AbortController and Generic Timer #297 so we can actually stop the T1 when we stop the discovery instead of waiting for it to finish. In this case if T1 finishes even after stopping, ensure that T1 is removed from the DB, so you don't redo the work.
  3. We need Integrate Error Chaining in a js-errors or @matrixai/errors package #304 so we can have a clearer error trace, so you can more easily see how the exceptions form. There is a possibility that there are more edge cases that will be exposed from this.
  4. The discovery must log every exception that occurs even if it recovers from it similar to how network proxies report the exceptions.
@emmacasolin
Copy link
Contributor Author

This issue was too vague and has subsequently been split into two separate issues:

  1. Reduce the timeout for establishing a Node Connection within the Discovery domain (by adding timer override to NodeConnectionManager) #353 for being able to reduce the startup timeout for node connections created by the discovery domain
  2. Refactor error handling of failed Node Connections created from the Discovery domain #354 - for logging out the exceptions that occur during discovery

Comments have also been added to the descriptions of #297 and #304 with respect to how those issues relate to this one.

Closing this issue now.

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy label Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy
Development

No branches or pull requests

2 participants