Seed Agent Cluster Auto-Configuration #403

CMCDragonkai · 2022-07-09T05:55:29Z

Specification

The seed node cluster is what is behind mainnet.polykey.io and testnet.polykey.io requires some auto-configuration to gain knowledge of each other so that they can share their DHT workload which include signalling and relaying.

Currently seed nodes are launched without knowledge of any seed nodes. This makes sense, since they are the first seed nodes. However as we scale the number of seed nodes, it would make sense that seed nodes can automatically discover each other and establish connections. This would make easier to launch clusters of seed nodes.

There are several challenges here and questions we must work out:

Does it mean it is possible to run multiple seed nodes with the same NodeID?
If we need to have multiple seed nodes, we must then pregenerate their root keys and preserve their recovery codes Testnet securely maintain a pool of recovery codes #285
If multiple seed nodes have different NodeIDs, are their root keys connected to each other in a trust relationship (either hierarchically via PKI, or a loose-mesh via the gestalt graph + root chain)
- How does this impact how this trust information is propagated eventually-consistently across the network?
- How does this deal with attacks/impersonation/DHT poisoning/sybil...?
- How does this deal with revocation?
- What does this mean for our default seed node list that is configured in the PK software distribution
If seed nodes are scaled up and down, how do they acquire their recovery keys securely and without conflict?
- See: https://gitlab.com/MatrixAI/Engineering/Polykey/Polykey-Infrastructure/-/issues/6
When seed nodes need to discover each other automatically, we have to use one of the auto-configuration networking technologies.
- Multicast DNS - Local Network Traversal - Multicast Discovery js-mdns#1
- AWS service discovery
- Should support IPv6 IPv6 Support #400
- https://en.wikipedia.org/wiki/Zero-configuration_networking
If the seed cluster are all behind 1 IP address/hostname (like our NLB) this means:
- Multiple node ids - multiple host names - multiple IP addresses
- Multiple node ids to 1 IP address
- Multiple node ids to 1 host name
- 1 hostname can resolve to multiple IP addresses (randomly too)
- The same node id on multiple IP addresses and multiple host names
- Testnet Deployment via CI/CD #396 (comment) - discussion on the multi-level complexity of AWS
Using a network load balancer means we need to preserve stickiness for "flows", we must ensure that this doesn't break down our network connections mid-flight and mid-conversation.
- AWS sets this to 120s timeout for a UDP flow, this is not configurable.
- AWS load balances according to origin IP address, and maintains the stickiness for the lifetime of a flow
- The stickiness must be preserved between NLB to multiple listeners, from listener to multiple target groups, from target group to multiple targets.
- Testnet Deployment via CI/CD #396 (comment) - discussion about how stickiness works on NLB
Load balancing introduces network proxies. These network proxies must preserve the client IP address, otherwise NAT-busting signalling will not work.
- We've enabled this option for NLB
- There's a special protocol for UDP/TCP for preserving client IPs in case it's not possible to be done at the IP-layer, but must be done on the UDP/TCP layer
  - https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#proxy-protocol
  - https://github.com/haproxy/haproxy/blob/master/doc/proxy-protocol.txt
  - We could integrate this into our Proxy class

Additional context

Testnet Deployment via CI/CD #396 - initial automation of the testnet.polykey.io discovered these challenges in deploying in AWS
https://gitlab.com/MatrixAI/Engineering/Polykey/Polykey-Infrastructure/-/issues/6 - recovery code injection from secret managers
Testnet securely maintain a pool of recovery codes #285 - maintaining recovery keys for the testnet
https://adam-p.ca/blog/2022/03/x-forwarded-for/ - getting the real IP on layer 7 (note that we are preserving the client IP by default right now, but not all systems do this)
Cloudflare is becoming more used as the gateway to all polykey services, it's interesting to see that they are becoming that API gateway, and then do add-on services on top... and they skipped the VM and containers and went straight to serverless with WASM. WASM with WASI is the new unikernel system

Tasks

Research DNS load balancing as an alternative
Work out how distributed PK with multiple nodes sharing the same IP address will work
Answer every question above

The text was updated successfully, but these errors were encountered:

CMCDragonkai · 2022-10-31T04:42:34Z

We've removed the load balancer at this point, and now using DNS load balancing instead.

This is because the load balancer implies a sort of distributed PK, while PK currently is designed to be decentralised but not distributed.

By distributed we mean that multiple nodes could share the same NodeId, and shard/replicate their data between the nodes so they act in unison. This is quite complex because we have lots of state that would need to be replicated or sharded. Plus our NodeGraph would need to be updated to use more sophisticated and ambiguous key path which would be NodeId/Host/Port -> {}.

Since it's not a priority, the NLB has been removed and instead we focus on cloudflare DNS load balancing.

For this particular issue we can address the above questions:

Does it mean it is possible to run multiple seed nodes with the same NodeID?

No.
If we need to have multiple seed nodes, we must then pregenerate their root keys and preserve their recovery codes Testnet securely maintain a pool of recovery codes #285

This is being done manually by pregenerating recovery codes and pulling them from AWS Secrets Manager. Later in the future, these secrets in the Secrets Manager be pushed from a local PK node.
If multiple seed nodes have different NodeIDs, are their root keys connected to each other in a trust relationship (either hierarchically via PKI, or a loose-mesh via the gestalt graph + root chain)

Each node will have their own NodeId. DNS load balancing will shard the distribution of signalling and relaying.
Trusted seed node list will be placed into the source code. There won't be any PKI nor gestalt link between the seed node list at this point in time.
This means seed nodes are not dynamically scaled at this point in time.
If seed nodes are scaled up and down, how do they acquire their recovery keys securely and without conflict?

Done manually by creating a set of recovery codes ahead of time.
This means seed nodes are not dynamically scaled at this point in time.
When seed nodes need to discover each other automatically, we have to use one of the auto-configuration networking technologies.

Because the seed node list will be fixed, they will use each other's public IP address for now until we can have local discovery too.
If the seed cluster are all behind 1 IP address/hostname (like our NLB) this means:

PK is not currently distributed. It's decentralised, so each seed node will have their own public IP address.
Using a network load balancer means we need to preserve stickiness for "flows", we must ensure that this doesn't break down our network connections mid-flight and mid-conversation.

Not relevant anymore because no more NLB in front.
Load balancing introduces network proxies. These network proxies must preserve the client IP address, otherwise NAT-busting signalling will not work.

Not relevant anymore because no more NLB in front.

CMCDragonkai · 2022-10-31T04:46:35Z

The only auto configuration being done now is:

Testnet securely maintain a pool of recovery codes #285 - using the AWS secrets manager to AOT produce a fixed set Node IDs, the recovery code and passwords to be trusted.
Infrastructure setup for testnet should automate multiple instances for multiple nodes #488 - creating multiple EC2 instances and orchestrating an instance-per node (and this means there's 1 TD, 1 Service, 1 Task, 1 Instance, 1 EIP for every node)

That means there's not that much "automation" happening. Any more automation would require something to actually orchestrate the above mechanisms, and that will require further work with Pulumi, and integration into AWS eventbridge with lambdas to be able to "automate" AWS.

CMCDragonkai · 2022-10-31T04:47:01Z

In this case this can be closed, since all the questions have been answered, and we know what our immediate next steps are. @tegefaulkes

CMCDragonkai added the development Standard development label Jul 9, 2022

This was referenced Jul 9, 2022

Testnet Deployment via CI/CD #396

Merged

Testnet securely maintain a pool of recovery codes #285

Closed

Decentralised NAT Signalling #365

Closed

HTTP status page for Polykey Agent #412

Open

This was referenced Jul 29, 2022

Updating remote node information after root keypair and NodeID has changed. #386

Open

Integration Tests for testnet.polykey.com MatrixAI/Polykey-CLI#71

Closed

CMCDragonkai mentioned this issue Aug 12, 2022

Integration tests for testnet.polykey.io #441

Merged

11 tasks

This was referenced Oct 5, 2022

Slice out KeyManager into 2 parts: KeyRing, CertificateManager #472

Closed

CSR feature for Keys domain to allow KN root certificates to be signed by an external CA #154

Open

CMCDragonkai assigned CMCDragonkai and tegefaulkes Oct 17, 2022

CMCDragonkai closed this as completed Oct 31, 2022

CMCDragonkai added r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices labels Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed Agent Cluster Auto-Configuration #403

Seed Agent Cluster Auto-Configuration #403

CMCDragonkai commented Jul 9, 2022 •

edited

Loading

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Oct 31, 2022

Seed Agent Cluster Auto-Configuration #403

Seed Agent Cluster Auto-Configuration #403

Comments

CMCDragonkai commented Jul 9, 2022 • edited Loading

Specification

Additional context

Tasks

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Oct 31, 2022

CMCDragonkai commented Jul 9, 2022 •

edited

Loading