Agents should balance streams across all available SPIRE servers #728

ZymoticB · 2019-02-12T00:28:30Z

When running SPIRE server in a HA configuration, agents should be balanced across the available servers. There are two primary factors to making this successful: having a balancing strategy, and ensuring that balancing strategy is periodically applied especially after a server failure.

Imagine this scenario, you have 10k agents connected to 2 servers. You have deployed a balancing strategy that is perfect, the agents are evenly balanced {5000, 5000}. Now, one of the servers fails, all the agent streams then move to the server that is still available {10000,0}. The failed server recovers, however, the streams will not rebalance unless the agent is restarted so you are stuck in this {10000,0} configuration.

Consider another scenario, you have your agents perfectly balanced {5000, 5000}. Unfortunately, you planned badly, and a single spire server can only actually handle 7500 agents; if one of your servers fails you're going to have a bad time! You add a new server, your agents are now balanced like {5000,5000,0}. You need to restart all the agents in your fleet to rebalance the streams.

A complete solution here likely involves at least the following check-boxes:

Periodically resolve DNS to ensure dynamic membership of SPIRE servers.
A "reasonably good" balancing strategy, from experience, a randomized response from a DNS resolver for an A record isn't enough. http://www.eecs.umich.edu/techreports/cse/96/CSE-TR-316-96.pdf covers some options, the HRW option is likely appropriate as it doesn't require collecting any information or sharing any state.
Agent must periodically re-establish their streams to ensure they are balanced across an ever-changing group of servers.

evan2645 · 2019-09-17T21:31:32Z

@ZymoticB there have been a number of changes since this issue was opened around connection management, client balancing gRPC algo choice, etc... I believe that the first two boxes here can be checked (as they're handled by the gRPC libs), but I am not sure if the third is also managed for us? Any idea if this is still an issue?

ZymoticB · 2019-09-17T21:42:53Z

was my mistake, the agent RPCs are all unary so that's not an issue
is also already in a reasonable place because the RPCs are unary to a client side round robin load balancer does a good enough job.
is still somewhat of an issue because grpc only re-resolves every 30 minutes which is a terrible default.

Should probably open a new issue specific to 1

evan2645 · 2019-10-16T19:45:16Z

I opened #1192 to address the dns resolution interval. Going to close this out in favor of that one, please let me know if I've missed anything

esweiss added performance labels Feb 20, 2019

evan2645 mentioned this issue Mar 15, 2019

Releasing connection when there are errors on responses #795

Merged

3 tasks

evan2645 closed this as completed Oct 16, 2019

KenGuan666 mentioned this issue Nov 28, 2023

Conserve server resources by making agent only connect to one server at a time #4696

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents should balance streams across all available SPIRE servers #728

Agents should balance streams across all available SPIRE servers #728

ZymoticB commented Feb 12, 2019 •

edited by evan2645

Loading

evan2645 commented Sep 17, 2019

ZymoticB commented Sep 17, 2019 •

edited

Loading

evan2645 commented Oct 16, 2019

Agents should balance streams across all available SPIRE servers #728

Agents should balance streams across all available SPIRE servers #728

Comments

ZymoticB commented Feb 12, 2019 • edited by evan2645 Loading

evan2645 commented Sep 17, 2019

ZymoticB commented Sep 17, 2019 • edited Loading

evan2645 commented Oct 16, 2019

ZymoticB commented Feb 12, 2019 •

edited by evan2645

Loading

ZymoticB commented Sep 17, 2019 •

edited

Loading