Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process #349

emmacasolin · 2022-02-25T00:24:38Z

Specification

There are various edge cases that can occur during discovery that have not been covered by our tests. These edge cases are the combinations of the following variables:

A - State of the node we're trying to contact during discovery:

Node does not exist
Node exists and is online
Node exists but is offline
Node switches from online to offline during discovery

B - State of the Node Connections we have stored in the Node Connection Manager:

We already have an existing connection to the node we want to contact
We do not already have an existing connection to the node we want to contact

C - State of the Node Connection we're using to contact the node:

Ready
Destroyed
Connection is destroyed while we're using it

D - Stage of the discovery process we're in:

Contacting the node that we're currently discovering (i.e. next in the discovery queue)
Contacting a linked node to the node we're currently discovering (i.e. may or may not already be in the discovery queue)
Contacting a linked node to an identity we're currently discovering (i.e. may or may not already be in the discovery queue)

E - State of the discovery module:

Started
Stopping (finishing up the current task before we can stop)

Some of these situations are normal and should not change the behaviour seen in discovery, whereas others should lead to a known error being thrown and caught. What happens after the error is caught depends on D and E.

If D1 then the node should be removed from the discovery queue and added to the TTL to be revisited at a low priority - Generic Non-Blocking Task Management ("Queue") for discovery and nodes domains #329 and Add the ability to search across Disconnected Identities during Social Discovery #330
If D2 or D3 then the node should be added to the end of the discovery queue (if not already in it)

Subsequently

If E1 then we should start discovering the next vertex in the queue
If E2 then we should exit the discovery process at this point and finish stopping the discovery module

Known expected errors for the situations listed above:

A1 - we should not have an existing connection already so we must be in B2. When we try to find the node to create the connection, NodeConnectionManager.findNode() should throw ErrorNodeGraphNodeIdNotFound, which should be propagated to the discovery process where it is caught
A3+C1 - when we try to create the node connection, NodeConnection.createNodeConnection() should catch ErrorGRPCClientTimeout and rethrow it as ErrorNodeConnectionTimeout, which should be propagated to the discovery process where it is caught
C2 (and potentially C3) - when we try to create the node connection, NodeConnection.createNodeConnection() should throw ErrorNodeConnectionDestroyed, which should be propagated to the discovery process where it is caught
A4 should follow this sequence

                      ┌─────┐        ┌─────┐
                      │ N 1 │        │ N 2 │
                      └──┬──┘        └──┬──┘
  Discovery Queue        │              │
┌────────────────┐       │              │
│ T4  T3  T2  T1 ├─────► T1      PolykeyAgent.stop()
└────────────────┘       │              │
                         │              │
                         │              X
                   Discovery.stop()
                         │
                         │
                         │
                         T1 ───────────►
                         │      GRPCClient.createClient()
                         │          Retries for 20s
                         │
               ErrorNodeConnectionTimeout
                         │
                         │
                         │
             NodeConnectionManager.stop()
                         │
                         │
                         │
                  PolykeyAgent.stop()

We should aim to test as many of these combinations as possible. Some may require mocking due to the complexity or randomness of the series of events that lead to their occurrence, and some may not be able to be tested at all. Wherever possible the cases should be fully simulated as much as possible.

We already have tests covering the following combinations in tests/discovery/Discovery.test.ts:

'discovery by node' - A2+B2+C1+D1+E1, A2+B2+C1+D2+E1, A2+B1+C1+D1+E1, A2+B1+C1+D2+E1, A2+B1+C1+D3+E1
'discovery by identity' - A2+B2+C1+D3+E1
'discovery persistence across restarts' - A2+C1+E2 (due to timing it's difficult to tell what the states are for B and D, and they may change per execution)

Additional context

Issue for refactoring error handling inside the discovery domain (may discover more errors while addressing this issue that will need to be included in the tests here): Refactor error handling of failed Node Connections created from the Discovery domain #354

Tasks

Test the expected response to different errors using mocking (for D and E)
Test as many of the combinations of scenarios from above as is possible using as little mocking as possible (write comments for combinations that cannot be tested with the expected behaviour)

The text was updated successfully, but these errors were encountered:

CMCDragonkai · 2022-02-25T00:48:09Z

This issue needs a list of edge cases specified, for example:

Target node does not exist
Target node shutdown in the middle of discovery task
Discovery stop, task is being awaited for, and target node is shutdown - expect the NodeConnectionTimeout because it fails to establish a connection

For every edge case you need to post a diagram what you're attempting to do, and then also a prototype snippet, and also a test log where things didn't go right.

See the diagram here in #332 regarding node connection state transitions, this should be relevant to forming your model on what's happening in the discovery.

Then the issue title should be changed as well, to list the edgecases.

Also list any edgecase you've also covered in the tests. Make sure to write which test name it is.

emmacasolin · 2022-02-28T02:00:12Z

A test to cover A3 (the node we're trying to contact is offline) was written, however due to unexpected behaviour it was removed from the codebase. This is the test (inside tests/discovery/Discovery.test.ts):

  test('discovering an offline node', async () => {
    const discovery = await Discovery.createDiscovery({
      db,
      keyManager,
      gestaltGraph,
      identitiesManager,
      nodeManager,
      sigchain,
      logger,
    });
    const nodeOfflineId = nodeA.keyManager.getNodeId();
    await nodeA.stop();
    await discovery.queueDiscoveryByNode(nodeOfflineId);
    // Need to set node B
    await nodeGraph.setNode(nodeB.keyManager.getNodeId(), {
      host: nodeB.revProxy.getIngressHost(),
      port: nodeB.revProxy.getIngressPort(),
    });
    await discovery.queueDiscoveryByIdentity(testToken.providerId, identityId);
    const gestalt = await poll<Gestalt>(
      async () => {
        const gestalts = await poll<Array<Gestalt>>(
          async () => {
            return await gestaltGraph.getGestalts();
          },
          (_, result) => {
            if (result.length === 1) return true;
            return false;
          },
          100,
        );
        return gestalts[0];
      },
      (_, result) => {
        if (result === undefined) return false;
        if (Object.keys(result.matrix).length === 2) return true;
        return false;
      },
      100,
    );
    const gestaltMatrix = gestalt.matrix;
    const gestaltNodes = gestalt.nodes;
    const gestaltIdentities = gestalt.identities;
    expect(Object.keys(gestaltMatrix)).toHaveLength(1);
    expect(Object.keys(gestaltNodes)).toHaveLength(1);
    expect(Object.keys(gestaltIdentities)).toHaveLength(1);
    const gestaltString = JSON.stringify(gestalt);
    expect(gestaltString).not.toContain(nodesUtils.encodeNodeId(nodeOfflineId));
    expect(gestaltString).toContain(
      nodesUtils.encodeNodeId(nodeB.keyManager.getNodeId()),
    );
    expect(gestaltString).toContain(identityId);
    // Reverse side-effects
    await gestaltGraph.unsetNode(nodeB.keyManager.getNodeId());
    await gestaltGraph.unsetIdentity(testToken.providerId, identityId);
    await discovery.stop();
    await discovery.destroy();
    await nodeA.start({ password });
    await nodeGraph.unsetNode(nodeB.keyManager.getNodeId());
  });

It behaves as expected to begin with (i.e. ErrorConnectionEndTimeout is thrown and caught by discovery, which logs it out) however we then immediately enter the afterAll() block despite still being in the middle of the test...

The only other strange thing that happens in the test output is that this error shows up in between when the ErrorConnectionEndTimeout is thrown and when it is subsequently caught inside discovery

INFO:ConnectionForward 127.0.0.1:45699:Starting Connection Forward
WARN:ConnectionReverse 127.0.0.1:40538:Server Error: ErrorConnectionEndTimeout
INFO:ConnectionReverse 127.0.0.1:40538:Stopped Connection Reverse
ERROR:fwxProxy:Failed CONNECT to 127.0.0.1:45699 - ErrorConnectionStart: UTP_ECONNRESET
INFO:fwxProxy:Handling CONNECT to 127.0.0.1:45699
INFO:ConnectionForward 127.0.0.1:45699:Starting Connection Forward
  console.error
    Failed to connect to 127.0.0.1:45699/?nodeId=vl8gi849l5vp1popi0hcjii4j9pplm2qikplhevn5qlsrnva68ld0 through proxy 127.0.0.1:33535 with status 502

      at Object.<anonymous>.exports.log (node_modules/@grpc/grpc-js/src/logging.ts:57:13)
      at ClientRequest.<anonymous> (node_modules/@grpc/grpc-js/src/http_proxy.ts:269:9)

INFO:127.0.0.1:45699:Destroying NodeConnection
INFO:127.0.0.1:45699:Destroyed NodeConnection
ERROR:Discovery Test:Failed to discover vl8gi849l5vp1popi0hcjii4j9pplm2qikplhevn5qlsrnva68ld0 - ErrorNodeConnectionTimeout
INFO:Discovery Test:this log is inside the `afterAll`

The test continues running, meanwhile the afterAll() block gets stuck on trying to stop the fwdProxy until the test eventually fails because something we still need has been stopped:

 FAIL  tests/discovery/Discovery.test.ts (48.707 s)
  Discovery
    ✕ discovering an offline node (2232 ms)
    ○ skipped discovery readiness
    ○ skipped discovery by node
    ○ skipped discovery by identity
    ○ skipped updates previously discovered gestalts
    ○ skipped discovery persistence across restarts

  ● Discovery › discovering an offline node

    ErrorIdentitiesManagerNotRunning:

      146 |   }
      147 |
    > 148 |   @ready(new identitiesErrors.ErrorIdentitiesManagerNotRunning())
          |          ^
      149 |   public async getTokens(providerId: ProviderId): Promise<ProviderTokens> {
      150 |     return await this._transaction(async () => {
      151 |       const providerTokens = await this.db.get<ProviderTokens>(

      at Object.<anonymous> (src/identities/IdentitiesManager.ts:148:10)
      at Object.<anonymous> (src/identities/index.ts:1:1)

At which point we finally see ERROR:fwxProxy:Failed CONNECT to 127.0.0.1:45699 - ErrorConnectionStartTimeout and the afterAll() block continues.

Note that this entering the afterAll happens regardless of whether 'discovering an offline node' is the final test to run or not:

ERROR:Discovery Test:Failed to discover vk3pquhqo0qv5k0jt82bpsr3g9rmc3981h0msshr00685a0bk95s0 - ErrorNodeConnectionTimeout // logging the error from 'discovering an offline node', the discovery module should continue though
INFO:Discovery Test:Creating Discovery // suddenly we start running the next test
INFO:Discovery Test:Starting Discovery
INFO:Discovery Test:Started Discovery
INFO:Discovery Test:Created Discovery
INFO:Discovery Test:this log is inside the `afterAll` // but very quickly we go straight into the `afterAll`
INFO:nodeB:Stopping PolykeyAgent

I'm wondering if this is potentially related to #347 in some way?

CMCDragonkai · 2022-02-28T02:12:17Z

Btw that console.error is because you have to setup the grpc error handler independently in each test module that would execute the GRPC client or server. I forgot exactly where the example is, maybe in nodes tests.

emmacasolin · 2022-02-28T02:30:06Z

Btw that console.error is because you have to setup the grpc error handler independently in each test module that would execute the GRPC client or server. I forgot exactly where the example is, maybe in nodes tests.

Updated output using the correct logger setup:

INFO:ConnectionForward 127.0.0.1:34930:Starting Connection Forward
WARN:ConnectionReverse 127.0.0.1:53117:Server Error: ErrorConnectionEndTimeout
INFO:ConnectionReverse 127.0.0.1:53117:Stopped Connection Reverse
ERROR:fwxProxy:Failed CONNECT to 127.0.0.1:34930 - ErrorConnectionStart: UTP_ECONNRESET
ERROR:grpc:Failed to connect to 127.0.0.1:34930/?nodeId=va88qcpd50a1lhfa9nq5t4b1feub5bjmce8o1122foosv9ljmahb0 through proxy 127.0.0.1:36597 with status 502
INFO:fwxProxy:Handling CONNECT to 127.0.0.1:34930
INFO:ConnectionForward 127.0.0.1:34930:Starting Connection Forward
INFO:127.0.0.1:34930:Destroying NodeConnection
INFO:127.0.0.1:34930:Destroyed NodeConnection
ERROR:Discovery Test:Failed to discover va88qcpd50a1lhfa9nq5t4b1feub5bjmce8o1122foosv9ljmahb0 - ErrorNodeConnectionTimeout
INFO:Discovery Test:Creating Discovery
INFO:Discovery Test:Starting Discovery
INFO:Discovery Test:Started Discovery
INFO:Discovery Test:Created Discovery
INFO:Discovery Test:this log is inside the `afterAll`
INFO:nodeB:Stopping PolykeyAgent

emmacasolin added the development Standard development label Feb 25, 2022

CMCDragonkai added the design Requires design label Feb 25, 2022

emmacasolin mentioned this issue Feb 27, 2022

Refactor error handling of failed Node Connections created from the Discovery domain #354

Open

emmacasolin changed the title ~~Write tests for edge cases during Discovery~~ Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process Feb 28, 2022

CMCDragonkai mentioned this issue Apr 26, 2022

Generic Non-Blocking Task Management ("Queue") for discovery and nodes domains #329

Closed

CMCDragonkai added the r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy label Jul 24, 2022

CMCDragonkai mentioned this issue Nov 7, 2022

Discovery Refactoring - Derived from JOSE replacement #493

Closed

11 tasks

CMCDragonkai assigned tegefaulkes Jul 10, 2023

tegefaulkes removed their assignment Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process #349

Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process #349

emmacasolin commented Feb 25, 2022 •

edited

Loading

CMCDragonkai commented Feb 25, 2022 •

edited

Loading

emmacasolin commented Feb 28, 2022

CMCDragonkai commented Feb 28, 2022

emmacasolin commented Feb 28, 2022

Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process #349

Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process #349

Comments

emmacasolin commented Feb 25, 2022 • edited Loading

Specification

Additional context

Tasks

CMCDragonkai commented Feb 25, 2022 • edited Loading

emmacasolin commented Feb 28, 2022

CMCDragonkai commented Feb 28, 2022

emmacasolin commented Feb 28, 2022

emmacasolin commented Feb 25, 2022 •

edited

Loading

CMCDragonkai commented Feb 25, 2022 •

edited

Loading