Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents should balance streams across all available SPIRE servers #728

Closed
2 of 3 tasks
ZymoticB opened this issue Feb 12, 2019 · 3 comments
Closed
2 of 3 tasks

Agents should balance streams across all available SPIRE servers #728

ZymoticB opened this issue Feb 12, 2019 · 3 comments

Comments

@ZymoticB
Copy link
Contributor

ZymoticB commented Feb 12, 2019

When running SPIRE server in a HA configuration, agents should be balanced across the available servers. There are two primary factors to making this successful: having a balancing strategy, and ensuring that balancing strategy is periodically applied especially after a server failure.

Imagine this scenario, you have 10k agents connected to 2 servers. You have deployed a balancing strategy that is perfect, the agents are evenly balanced {5000, 5000}. Now, one of the servers fails, all the agent streams then move to the server that is still available {10000,0}. The failed server recovers, however, the streams will not rebalance unless the agent is restarted so you are stuck in this {10000,0} configuration.

Consider another scenario, you have your agents perfectly balanced {5000, 5000}. Unfortunately, you planned badly, and a single spire server can only actually handle 7500 agents; if one of your servers fails you're going to have a bad time! You add a new server, your agents are now balanced like {5000,5000,0}. You need to restart all the agents in your fleet to rebalance the streams.

A complete solution here likely involves at least the following check-boxes:

  • Periodically resolve DNS to ensure dynamic membership of SPIRE servers.
  • A "reasonably good" balancing strategy, from experience, a randomized response from a DNS resolver for an A record isn't enough. http://www.eecs.umich.edu/techreports/cse/96/CSE-TR-316-96.pdf covers some options, the HRW option is likely appropriate as it doesn't require collecting any information or sharing any state.
  • Agent must periodically re-establish their streams to ensure they are balanced across an ever-changing group of servers.
@evan2645
Copy link
Member

@ZymoticB there have been a number of changes since this issue was opened around connection management, client balancing gRPC algo choice, etc... I believe that the first two boxes here can be checked (as they're handled by the gRPC libs), but I am not sure if the third is also managed for us? Any idea if this is still an issue?

@ZymoticB
Copy link
Contributor Author

ZymoticB commented Sep 17, 2019

  1. was my mistake, the agent RPCs are all unary so that's not an issue
  2. is also already in a reasonable place because the RPCs are unary to a client side round robin load balancer does a good enough job.
  3. is still somewhat of an issue because grpc only re-resolves every 30 minutes which is a terrible default.

Should probably open a new issue specific to 1

@evan2645
Copy link
Member

I opened #1192 to address the dns resolution interval. Going to close this out in favor of that one, please let me know if I've missed anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants