Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process #349

Open
emmacasolin opened this issue Feb 25, 2022 · 4 comments
Labels
design Requires design development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy

Comments

@emmacasolin
Copy link
Contributor

emmacasolin commented Feb 25, 2022

Specification

There are various edge cases that can occur during discovery that have not been covered by our tests. These edge cases are the combinations of the following variables:

A - State of the node we're trying to contact during discovery:

  1. Node does not exist
  2. Node exists and is online
  3. Node exists but is offline
  4. Node switches from online to offline during discovery

B - State of the Node Connections we have stored in the Node Connection Manager:

  1. We already have an existing connection to the node we want to contact
  2. We do not already have an existing connection to the node we want to contact

C - State of the Node Connection we're using to contact the node:

  1. Ready
  2. Destroyed
  3. Connection is destroyed while we're using it

D - Stage of the discovery process we're in:

  1. Contacting the node that we're currently discovering (i.e. next in the discovery queue)
  2. Contacting a linked node to the node we're currently discovering (i.e. may or may not already be in the discovery queue)
  3. Contacting a linked node to an identity we're currently discovering (i.e. may or may not already be in the discovery queue)

E - State of the discovery module:

  1. Started
  2. Stopping (finishing up the current task before we can stop)

Some of these situations are normal and should not change the behaviour seen in discovery, whereas others should lead to a known error being thrown and caught. What happens after the error is caught depends on D and E.

Subsequently

  • If E1 then we should start discovering the next vertex in the queue
  • If E2 then we should exit the discovery process at this point and finish stopping the discovery module

Known expected errors for the situations listed above:

  • A1 - we should not have an existing connection already so we must be in B2. When we try to find the node to create the connection, NodeConnectionManager.findNode() should throw ErrorNodeGraphNodeIdNotFound, which should be propagated to the discovery process where it is caught
  • A3+C1 - when we try to create the node connection, NodeConnection.createNodeConnection() should catch ErrorGRPCClientTimeout and rethrow it as ErrorNodeConnectionTimeout, which should be propagated to the discovery process where it is caught
  • C2 (and potentially C3) - when we try to create the node connection, NodeConnection.createNodeConnection() should throw ErrorNodeConnectionDestroyed, which should be propagated to the discovery process where it is caught
  • A4 should follow this sequence
                      ┌─────┐        ┌─────┐
                      │ N 1 │        │ N 2 │
                      └──┬──┘        └──┬──┘
  Discovery Queue        │              │
┌────────────────┐       │              │
│ T4  T3  T2  T1 ├─────► T1      PolykeyAgent.stop()
└────────────────┘       │              │
                         │              │
                         │              X
                   Discovery.stop()
                         │
                         │
                         │
                         T1 ───────────►
                         │      GRPCClient.createClient()
                         │          Retries for 20s
                         │
               ErrorNodeConnectionTimeout
                         │
                         │
                         │
             NodeConnectionManager.stop()
                         │
                         │
                         │
                  PolykeyAgent.stop()

We should aim to test as many of these combinations as possible. Some may require mocking due to the complexity or randomness of the series of events that lead to their occurrence, and some may not be able to be tested at all. Wherever possible the cases should be fully simulated as much as possible.

We already have tests covering the following combinations in tests/discovery/Discovery.test.ts:

  • 'discovery by node' - A2+B2+C1+D1+E1, A2+B2+C1+D2+E1, A2+B1+C1+D1+E1, A2+B1+C1+D2+E1, A2+B1+C1+D3+E1
  • 'discovery by identity' - A2+B2+C1+D3+E1
  • 'discovery persistence across restarts' - A2+C1+E2 (due to timing it's difficult to tell what the states are for B and D, and they may change per execution)

Additional context

Tasks

  1. Test the expected response to different errors using mocking (for D and E)
  2. Test as many of the combinations of scenarios from above as is possible using as little mocking as possible (write comments for combinations that cannot be tested with the expected behaviour)
@emmacasolin emmacasolin added the development Standard development label Feb 25, 2022
@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 25, 2022

This issue needs a list of edge cases specified, for example:

  1. Target node does not exist
  2. Target node shutdown in the middle of discovery task
  3. Discovery stop, task is being awaited for, and target node is shutdown - expect the NodeConnectionTimeout because it fails to establish a connection

For every edge case you need to post a diagram what you're attempting to do, and then also a prototype snippet, and also a test log where things didn't go right.

See the diagram here in #332 regarding node connection state transitions, this should be relevant to forming your model on what's happening in the discovery.

Then the issue title should be changed as well, to list the edgecases.

Also list any edgecase you've also covered in the tests. Make sure to write which test name it is.

@CMCDragonkai CMCDragonkai added the design Requires design label Feb 25, 2022
@emmacasolin emmacasolin changed the title Write tests for edge cases during Discovery Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process Feb 28, 2022
@emmacasolin
Copy link
Contributor Author

A test to cover A3 (the node we're trying to contact is offline) was written, however due to unexpected behaviour it was removed from the codebase. This is the test (inside tests/discovery/Discovery.test.ts):

  test('discovering an offline node', async () => {
    const discovery = await Discovery.createDiscovery({
      db,
      keyManager,
      gestaltGraph,
      identitiesManager,
      nodeManager,
      sigchain,
      logger,
    });
    const nodeOfflineId = nodeA.keyManager.getNodeId();
    await nodeA.stop();
    await discovery.queueDiscoveryByNode(nodeOfflineId);
    // Need to set node B
    await nodeGraph.setNode(nodeB.keyManager.getNodeId(), {
      host: nodeB.revProxy.getIngressHost(),
      port: nodeB.revProxy.getIngressPort(),
    });
    await discovery.queueDiscoveryByIdentity(testToken.providerId, identityId);
    const gestalt = await poll<Gestalt>(
      async () => {
        const gestalts = await poll<Array<Gestalt>>(
          async () => {
            return await gestaltGraph.getGestalts();
          },
          (_, result) => {
            if (result.length === 1) return true;
            return false;
          },
          100,
        );
        return gestalts[0];
      },
      (_, result) => {
        if (result === undefined) return false;
        if (Object.keys(result.matrix).length === 2) return true;
        return false;
      },
      100,
    );
    const gestaltMatrix = gestalt.matrix;
    const gestaltNodes = gestalt.nodes;
    const gestaltIdentities = gestalt.identities;
    expect(Object.keys(gestaltMatrix)).toHaveLength(1);
    expect(Object.keys(gestaltNodes)).toHaveLength(1);
    expect(Object.keys(gestaltIdentities)).toHaveLength(1);
    const gestaltString = JSON.stringify(gestalt);
    expect(gestaltString).not.toContain(nodesUtils.encodeNodeId(nodeOfflineId));
    expect(gestaltString).toContain(
      nodesUtils.encodeNodeId(nodeB.keyManager.getNodeId()),
    );
    expect(gestaltString).toContain(identityId);
    // Reverse side-effects
    await gestaltGraph.unsetNode(nodeB.keyManager.getNodeId());
    await gestaltGraph.unsetIdentity(testToken.providerId, identityId);
    await discovery.stop();
    await discovery.destroy();
    await nodeA.start({ password });
    await nodeGraph.unsetNode(nodeB.keyManager.getNodeId());
  });

It behaves as expected to begin with (i.e. ErrorConnectionEndTimeout is thrown and caught by discovery, which logs it out) however we then immediately enter the afterAll() block despite still being in the middle of the test...

The only other strange thing that happens in the test output is that this error shows up in between when the ErrorConnectionEndTimeout is thrown and when it is subsequently caught inside discovery

INFO:ConnectionForward 127.0.0.1:45699:Starting Connection Forward
WARN:ConnectionReverse 127.0.0.1:40538:Server Error: ErrorConnectionEndTimeout
INFO:ConnectionReverse 127.0.0.1:40538:Stopped Connection Reverse
ERROR:fwxProxy:Failed CONNECT to 127.0.0.1:45699 - ErrorConnectionStart: UTP_ECONNRESET
INFO:fwxProxy:Handling CONNECT to 127.0.0.1:45699
INFO:ConnectionForward 127.0.0.1:45699:Starting Connection Forward
  console.error
    Failed to connect to 127.0.0.1:45699/?nodeId=vl8gi849l5vp1popi0hcjii4j9pplm2qikplhevn5qlsrnva68ld0 through proxy 127.0.0.1:33535 with status 502

      at Object.<anonymous>.exports.log (node_modules/@grpc/grpc-js/src/logging.ts:57:13)
      at ClientRequest.<anonymous> (node_modules/@grpc/grpc-js/src/http_proxy.ts:269:9)

INFO:127.0.0.1:45699:Destroying NodeConnection
INFO:127.0.0.1:45699:Destroyed NodeConnection
ERROR:Discovery Test:Failed to discover vl8gi849l5vp1popi0hcjii4j9pplm2qikplhevn5qlsrnva68ld0 - ErrorNodeConnectionTimeout
INFO:Discovery Test:this log is inside the `afterAll`

The test continues running, meanwhile the afterAll() block gets stuck on trying to stop the fwdProxy until the test eventually fails because something we still need has been stopped:

 FAIL  tests/discovery/Discovery.test.ts (48.707 s)
  Discovery
    ✕ discovering an offline node (2232 ms)
    ○ skipped discovery readiness
    ○ skipped discovery by node
    ○ skipped discovery by identity
    ○ skipped updates previously discovered gestalts
    ○ skipped discovery persistence across restarts

  ● Discovery › discovering an offline node

    ErrorIdentitiesManagerNotRunning:

      146 |   }
      147 |
    > 148 |   @ready(new identitiesErrors.ErrorIdentitiesManagerNotRunning())
          |          ^
      149 |   public async getTokens(providerId: ProviderId): Promise<ProviderTokens> {
      150 |     return await this._transaction(async () => {
      151 |       const providerTokens = await this.db.get<ProviderTokens>(

      at Object.<anonymous> (src/identities/IdentitiesManager.ts:148:10)
      at Object.<anonymous> (src/identities/index.ts:1:1)

At which point we finally see ERROR:fwxProxy:Failed CONNECT to 127.0.0.1:45699 - ErrorConnectionStartTimeout and the afterAll() block continues.

Note that this entering the afterAll happens regardless of whether 'discovering an offline node' is the final test to run or not:

ERROR:Discovery Test:Failed to discover vk3pquhqo0qv5k0jt82bpsr3g9rmc3981h0msshr00685a0bk95s0 - ErrorNodeConnectionTimeout // logging the error from 'discovering an offline node', the discovery module should continue though
INFO:Discovery Test:Creating Discovery // suddenly we start running the next test
INFO:Discovery Test:Starting Discovery
INFO:Discovery Test:Started Discovery
INFO:Discovery Test:Created Discovery
INFO:Discovery Test:this log is inside the `afterAll` // but very quickly we go straight into the `afterAll`
INFO:nodeB:Stopping PolykeyAgent

I'm wondering if this is potentially related to #347 in some way?

@CMCDragonkai
Copy link
Member

Btw that console.error is because you have to setup the grpc error handler independently in each test module that would execute the GRPC client or server. I forgot exactly where the example is, maybe in nodes tests.

@emmacasolin
Copy link
Contributor Author

Btw that console.error is because you have to setup the grpc error handler independently in each test module that would execute the GRPC client or server. I forgot exactly where the example is, maybe in nodes tests.

Updated output using the correct logger setup:

INFO:ConnectionForward 127.0.0.1:34930:Starting Connection Forward
WARN:ConnectionReverse 127.0.0.1:53117:Server Error: ErrorConnectionEndTimeout
INFO:ConnectionReverse 127.0.0.1:53117:Stopped Connection Reverse
ERROR:fwxProxy:Failed CONNECT to 127.0.0.1:34930 - ErrorConnectionStart: UTP_ECONNRESET
ERROR:grpc:Failed to connect to 127.0.0.1:34930/?nodeId=va88qcpd50a1lhfa9nq5t4b1feub5bjmce8o1122foosv9ljmahb0 through proxy 127.0.0.1:36597 with status 502
INFO:fwxProxy:Handling CONNECT to 127.0.0.1:34930
INFO:ConnectionForward 127.0.0.1:34930:Starting Connection Forward
INFO:127.0.0.1:34930:Destroying NodeConnection
INFO:127.0.0.1:34930:Destroyed NodeConnection
ERROR:Discovery Test:Failed to discover va88qcpd50a1lhfa9nq5t4b1feub5bjmce8o1122foosv9ljmahb0 - ErrorNodeConnectionTimeout
INFO:Discovery Test:Creating Discovery
INFO:Discovery Test:Starting Discovery
INFO:Discovery Test:Started Discovery
INFO:Discovery Test:Created Discovery
INFO:Discovery Test:this log is inside the `afterAll`
INFO:nodeB:Stopping PolykeyAgent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Requires design development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy
Development

No branches or pull requests

3 participants