Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed Agent Cluster Auto-Configuration #403

Closed
CMCDragonkai opened this issue Jul 9, 2022 · 3 comments
Closed

Seed Agent Cluster Auto-Configuration #403

CMCDragonkai opened this issue Jul 9, 2022 · 3 comments
Assignees
Labels
development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 9, 2022

Specification

The seed node cluster is what is behind mainnet.polykey.io and testnet.polykey.io requires some auto-configuration to gain knowledge of each other so that they can share their DHT workload which include signalling and relaying.

Currently seed nodes are launched without knowledge of any seed nodes. This makes sense, since they are the first seed nodes. However as we scale the number of seed nodes, it would make sense that seed nodes can automatically discover each other and establish connections. This would make easier to launch clusters of seed nodes.

There are several challenges here and questions we must work out:

  • Does it mean it is possible to run multiple seed nodes with the same NodeID?
  • If we need to have multiple seed nodes, we must then pregenerate their root keys and preserve their recovery codes Testnet securely maintain a pool of recovery codes #285
  • If multiple seed nodes have different NodeIDs, are their root keys connected to each other in a trust relationship (either hierarchically via PKI, or a loose-mesh via the gestalt graph + root chain)
    • How does this impact how this trust information is propagated eventually-consistently across the network?
    • How does this deal with attacks/impersonation/DHT poisoning/sybil...?
    • How does this deal with revocation?
    • What does this mean for our default seed node list that is configured in the PK software distribution
  • If seed nodes are scaled up and down, how do they acquire their recovery keys securely and without conflict?
  • When seed nodes need to discover each other automatically, we have to use one of the auto-configuration networking technologies.
  • If the seed cluster are all behind 1 IP address/hostname (like our NLB) this means:
    • Multiple node ids - multiple host names - multiple IP addresses
    • Multiple node ids to 1 IP address
    • Multiple node ids to 1 host name
    • 1 hostname can resolve to multiple IP addresses (randomly too)
    • The same node id on multiple IP addresses and multiple host names
    • Testnet Deployment via CI/CD #396 (comment) - discussion on the multi-level complexity of AWS
  • Using a network load balancer means we need to preserve stickiness for "flows", we must ensure that this doesn't break down our network connections mid-flight and mid-conversation.
    • AWS sets this to 120s timeout for a UDP flow, this is not configurable.
    • AWS load balances according to origin IP address, and maintains the stickiness for the lifetime of a flow
    • The stickiness must be preserved between NLB to multiple listeners, from listener to multiple target groups, from target group to multiple targets.
    • Testnet Deployment via CI/CD #396 (comment) - discussion about how stickiness works on NLB
    • image
  • Load balancing introduces network proxies. These network proxies must preserve the client IP address, otherwise NAT-busting signalling will not work.

Additional context

Tasks

  1. Research DNS load balancing as an alternative
  2. Work out how distributed PK with multiple nodes sharing the same IP address will work
  3. Answer every question above
@CMCDragonkai
Copy link
Member Author

image

We've removed the load balancer at this point, and now using DNS load balancing instead.

This is because the load balancer implies a sort of distributed PK, while PK currently is designed to be decentralised but not distributed.

By distributed we mean that multiple nodes could share the same NodeId, and shard/replicate their data between the nodes so they act in unison. This is quite complex because we have lots of state that would need to be replicated or sharded. Plus our NodeGraph would need to be updated to use more sophisticated and ambiguous key path which would be NodeId/Host/Port -> {}.

Since it's not a priority, the NLB has been removed and instead we focus on cloudflare DNS load balancing.

For this particular issue we can address the above questions:

  • Does it mean it is possible to run multiple seed nodes with the same NodeID?

    No.

  • If we need to have multiple seed nodes, we must then pregenerate their root keys and preserve their recovery codes Testnet securely maintain a pool of recovery codes #285

    This is being done manually by pregenerating recovery codes and pulling them from AWS Secrets Manager. Later in the future, these secrets in the Secrets Manager be pushed from a local PK node.

  • If multiple seed nodes have different NodeIDs, are their root keys connected to each other in a trust relationship (either hierarchically via PKI, or a loose-mesh via the gestalt graph + root chain)

    Each node will have their own NodeId. DNS load balancing will shard the distribution of signalling and relaying.
    Trusted seed node list will be placed into the source code. There won't be any PKI nor gestalt link between the seed node list at this point in time.
    This means seed nodes are not dynamically scaled at this point in time.

  • If seed nodes are scaled up and down, how do they acquire their recovery keys securely and without conflict?

    Done manually by creating a set of recovery codes ahead of time.
    This means seed nodes are not dynamically scaled at this point in time.

  • When seed nodes need to discover each other automatically, we have to use one of the auto-configuration networking technologies.

    Because the seed node list will be fixed, they will use each other's public IP address for now until we can have local discovery too.

  • If the seed cluster are all behind 1 IP address/hostname (like our NLB) this means:

    PK is not currently distributed. It's decentralised, so each seed node will have their own public IP address.

  • Using a network load balancer means we need to preserve stickiness for "flows", we must ensure that this doesn't break down our network connections mid-flight and mid-conversation.

    Not relevant anymore because no more NLB in front.

  • Load balancing introduces network proxies. These network proxies must preserve the client IP address, otherwise NAT-busting signalling will not work.

    Not relevant anymore because no more NLB in front.

@CMCDragonkai
Copy link
Member Author

The only auto configuration being done now is:

  1. Testnet securely maintain a pool of recovery codes #285 - using the AWS secrets manager to AOT produce a fixed set Node IDs, the recovery code and passwords to be trusted.
  2. Infrastructure setup for testnet should automate multiple instances for multiple nodes #488 - creating multiple EC2 instances and orchestrating an instance-per node (and this means there's 1 TD, 1 Service, 1 Task, 1 Instance, 1 EIP for every node)

That means there's not that much "automation" happening. Any more automation would require something to actually orchestrate the above mechanisms, and that will require further work with Pulumi, and integration into AWS eventbridge with lambdas to be able to "automate" AWS.

@CMCDragonkai
Copy link
Member Author

In this case this can be closed, since all the questions have been answered, and we know what our immediate next steps are. @tegefaulkes

@CMCDragonkai CMCDragonkai added r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices labels Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

No branches or pull requests

2 participants