Lightweight consul agent #9534

ltagliamonte-dd · 2021-01-08T19:46:21Z

Nowadays especially in Kubernets based infrastructures there are some features that the consul agents/servers implements that imho are just a scalability burden in big consul installation (2-3k nodes and above).

In the specific would be nice to have a consul-agent that just offers the http/grpc API cache and the DNS interface and drop everything around:

memberlist
gossip protocol
checks
Network coordinates

Basically have an agent that acts just like a local DB that tools like service mesh/discovery can leverage to read from.
i'd love to start a discussion with the team here for understanding if something like a lightweight agent could be even implemented or just adding features gates flags to the existing agent is a possible alternative

jsosulska · 2021-01-08T22:51:33Z

Hi @ltagliamonte-dd,

This is something we've been talking about on the team for quite a while now. I'd like to address some of your points, as there's a lot to unpack.

Basically have an agent that acts just like a local DB that tools like service mesh/discovery can leverage to read from.

Before diving too far into this, I want to ask - It sounds like your motivation here is to make Consul agents better able to run inside each pod. Is that right?

Nowadays especially in Kubernets based infrastructures there are some features that the consul agents/servers implements that imho are just a scalability burden in big consul installation (2-3k nodes and above).

I think the issues here are less about "scale" than about pod ephemerality or churn. A lot of the things you mentioned removing are powered by Serf which we've found to be highly scalable across many large deployments. We have had users running 5k+ nodes in a Consul DC in prod for years in its current architectural design, and not that many folks run more than 5-10k pods in a cluster. That said, Kuberenetes brings its own challenges over VMs, and we’re open to assessing other, lighter-weight solutions as we go forward!

So we don't recommend running the Consul agent as a sidecar for every pod due to churn. 5k node clusters are fine when they are mostly unchanging, but 5k pods can cause stability issues, like during deployments or rolling outages. We see this as an issue in very short lived containers causing constant gossip churn today.

That's why our helm chart installs the Consul Agent as a DaemonSet. We do know of a few pretty large users who are successfully running Consul agents inside pods at scale, though so it's possible, but not what we suggest.

The DaemonSet pattern has a whole host of issues though and we certainly do want to move to a more lightweight architecture for Kubernetes in the near future. The exact design of that is something we're thinking about right now.

Your proposal here is one option we are considering, but may not be the best next step considering some of the other issues we also need to solve that are closely related. For example, the fact that in Consul currently, Agents are the source of truth for service registrations. Some things you can only configure if you can talk directly to the agent, rather than through a central API like Kube operators are used to. That said, other options can potentially provide an even cleaner and lighter solution.

We're considering how to address all of these problems and what sequence to iterate to provide the most value and solve as many problems as possible etc. If you have any more context on scaling issues you've seen, or how you imagine this could work we'd love to hear that!

Looking forward to hearing back from you on this!

ltagliamonte-dd · 2021-01-09T02:02:56Z

@jsosulska thank you for getting back to me,
I Absolutely agree there is a lot to unpack.

Today I ran consul as daemonset and my environment is very dynamic all my k8s clusters (several) are in autoscaling, and we make a lot of deploys a day (microservice owners can deploy at their will) and as you notice this dynamicity bring in stress on the consul infra, both client and servers.

Today I use consul DNS interface to power the service discovery among my kubernetes clusters (we use AWS CNI so our network is flat and we run multiple clusters as it is a single big one).
In a near future i'd like to leverage the local agent data for my service mesh as well, so what I just need is the fresh data each local agent has, all protocols mentioned in my prev comments are just entropy vectors.

I believe that dropping part of this entropy would help in scaling consul solutions even further, and they aren't really features I would use in a kubernetes environment.

Agents are the source of truth for service registration

this doesn't apply in kubernetes, sync-catalog takes care of reg/dereg services.

I'm really happy that the group is active on this front, and I'd like to help the project in better supporting larger scale and different use cases like i have.

ltagliamonte-dd · 2021-01-12T16:58:02Z

Hello @jsosulska any updates to share from you internal discussions with the team?
I'm highly motivated on this project, cause i think could be beneficial for all people that run dynamic environments like myself.

ltagliamonte-dd · 2021-02-11T16:58:46Z

@jsosulska i'd like to hear from you if you have any news.
The whole scope of this thread is to understand if HC is interested in this feature in the consul agent and if we can start a collaboration, otherwise will just start a new project ourselves.

blake · 2021-07-20T19:48:15Z

Hi @ltagliamonte-dd, we are definitely interested in supporting a lightweight / no agent deployment model. I'd be happy to chat with you further to better understand your requirements.

On a related note, we recently discovered the disable_coordinates option was missing from the agent configuration docs. This option allows you to disable sending of network coordinate info, as you're looking to do in your environment. The docs were updated in PR #10617, and have been published to the website.

ltagliamonte-dd · 2021-07-20T20:16:25Z

Hello @blake thank you for the follow up, thank you for pointing me to the disable_coordinates options as well.
As i said my use case is more broad, I basically use the consul agent as local DNS cache behind coredns, I don't have checks and the services IPs aren't registered on real nodes (like the service sync does) so i don't use near feature.

I ran several k8s clusters in a flat network, and i use the consul domain "to merge" all the clusters in a single addressable domain.

For scalability i'd like to turn off everything that isn't local service cache in the agents.
I know that you have very big consul installation, but my environment is dynamic nodes come and go, a service is deployed or autoscaled.
Because of the entropy of the k8s system there is always something going on in the consul sub-system.

jsosulska added type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release type/enhancement Proposed improvement or new feature labels Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightweight consul agent #9534

Lightweight consul agent #9534

ltagliamonte-dd commented Jan 8, 2021 •

edited

Loading

jsosulska commented Jan 8, 2021

ltagliamonte-dd commented Jan 9, 2021 •

edited

Loading

ltagliamonte-dd commented Jan 12, 2021

ltagliamonte-dd commented Feb 11, 2021

blake commented Jul 20, 2021

ltagliamonte-dd commented Jul 20, 2021

Lightweight consul agent #9534

Lightweight consul agent #9534

Comments

ltagliamonte-dd commented Jan 8, 2021 • edited Loading

jsosulska commented Jan 8, 2021

ltagliamonte-dd commented Jan 9, 2021 • edited Loading

ltagliamonte-dd commented Jan 12, 2021

ltagliamonte-dd commented Feb 11, 2021

blake commented Jul 20, 2021

ltagliamonte-dd commented Jul 20, 2021

ltagliamonte-dd commented Jan 8, 2021 •

edited

Loading

ltagliamonte-dd commented Jan 9, 2021 •

edited

Loading