How to do HA catalog-sync #58

jacksontj · 2019-01-25T19:22:51Z

While looking into using consul-k8s for my cluster I am unable to find any docs/code/comments/issues regarding an HA setup for consul-k8s. In addition to the concerns in #57 RE liveliness/readiness checks -- how do I have more than one of these running?

I have basically 2 HA concerns: (1) failure duration and (2) failure impact.

Failure duration
Assuming Liveliness/Readiness checks? #57 is resolved, we could in theory have more than one pod running behind a lock (presumably in consul) such that when the first pod fails the second can kick in pretty quickly. This would help mitigate the time that there isn't a consul-k8s working in the cluster
failure impact
As it is today consul-k8s is a single worker which syncs the entire state of consul/k8s -- I'm interested in ways of reducing the amount of work it needs to do specifically to reduce impact in a failure. As an extreme example (which I realize doesn't exactly work given the current design, but it illustrates the point) if we had consul-k8s running on each node in a k8s cluster and was responsible for syncing the state of things local to the node (as I said before, doesn't quite work for some things like LBs etc.) then the failure of a single process impacts the sync state of all pods on that node.

tsmgeek · 2019-01-31T18:26:29Z

K8s should handle this if you set it up as a "Deployment", but only in the instance that the executable within the pod crashes or stops. If for any reason it just stops syncing but stays running it will cause an issue.
I guess what they need to have an open port that responds OK if the last sync was within X, then you can monitor this using a liveliness check and it will kill the pod.

jacksontj · 2019-01-31T19:32:22Z

After a lot of investigation it seems that fundamentally our needs won't be met by consul-k8s (a combination of this and #57) -- which lead me to create katalog-sync, the highlights are:

syncing direct to a consul-agent: this means syncing is scoped per-node (when deployed as a deemonset)
agent-services in consul: meaning the health of services is tied to node health
sync readiness state from k8s as check to consul: meaning you only have to define the check in k8s and consul will reflect that state
(optional) sidecar to put in with your services to go "ready" when the pod is registered in consul.

IMO these questions should still be answered by consul-k8s, but some of them (specifically the failure impact and readiness checks during pod startup) seem to not be tractable with the current design (although it does allow for syncing cluster-wide resources such as services). So it seems like maybe a combination is the best way to go?

tsmgeek · 2019-01-31T19:40:17Z

Agreed there seems to be key functionally missing from consul-k8s.
Im still working on another problem related to services doing hide and seek.

hmlkao · 2021-01-04T15:59:59Z

Any best practice or progress how to run consul-k8s in HA mode?

We solved an issue that reschedule to another node takes long time (5:40 min with default K8s component values) when node on which was running consul-k8s pod falls down. Even after tuning K8s component values it takes much longer (~30s) than it would be acceptable.

lkysow · 2021-01-05T21:41:10Z

Any best practice or progress how to run consul-k8s in HA mode?

We solved an issue that reschedule to another node takes long time (5:40 min with default K8s component values) when node on which was running consul-k8s pod falls down. Even after tuning K8s component values it takes much longer (~30s) than it would be acceptable.

Hi, which consul-k8s pod got rescheduled? Are you using service mesh or catalog sync?

hmlkao · 2021-01-06T07:41:10Z

Hi @lkysow , we are using catalog sync without service mesh, deployed over consul-helm (v0.8.1), in one-way sync K8s > Consul mode.

lkysow · 2021-01-06T18:24:55Z

Do you know why the rescheduling of the catalog sync pod took 5:40?

hmlkao · 2021-01-07T15:18:47Z

It is standard behaviour of K8s with default values of controller-manager node-monitor-grace-period (40 s) and pod-eviction-timeout (5 min), details are in article at Medium.

When node on which is catalog-sync pod running fails (dies or is reset or whatever). It takes 5:40 min until the pods are evicted. during this time are not any changes synced to Consul and it can cause problems with access to apps.

Is there any dangerous to run consul-k8s catalog-sync in more replicas?

lkysow · 2021-01-07T18:21:25Z

Ahh I see, sorry I missed it was because the node died.

Is there any dangerous to run consul-k8s catalog-sync in more replicas?

Currently that is not supported unfortunately but it looks like we need to fix this so you can run >1 replicas and maybe use some sort of leader election to swap between them.

hmlkao · 2021-01-07T18:47:20Z

It would be great if it was supported.

Meanwhile I did some tests with more running replicas on our dev cluster and it looks like it works as expected.
It can (maybe) cause some race condition but it is more acceptable then outage caused by failed node.

lkysow · 2021-04-08T21:17:39Z

Note: #479 will make the service mesh injector HA but it won't address the catalog sync HA.

Provide a valid maxUnavailable value when using a single replica

david-yu · 2022-04-20T06:49:04Z

Hi there, Consul K8s PM here. I'm going to close this issue since at this time we likely won't be making changes to support running catalog sync on multiple replicas. I have to acknowledge that it's been a long time since this issue was filed. Since that time, our priorities have shifted towards building a robust service mesh which enables service discovery on K8s. If you do have an PRs that you would like us to review that enables multiple replicas, please go ahead and file a PR and we can review. Thank you.

Dentrax · 2022-12-20T07:28:14Z

Hey @david-yu, any reason why consul-k8s does not support leader election as of now? It'd be great to learn some background. I can file a proposal for this if reasonable. Then a follow-up PR to add leader-election since I have previous experience of adding this to some open-source projects. Our use-case is to run consul-k8s in HA mode. (podAntiAffinity + min 3 replicas). Waiting your thoughts.

david-yu · 2022-12-20T17:28:47Z

Hi @Dentrax if you're interested in working on a PR could you file a new issue with a proposal of how you'd like to see this problem solved? We probably would like to review before you go down this path. As said previously this is something that probably involves even deeper changes to Catalog sync than just deploying multiple replicas.

lkysow added type/question Question about product, ideally should be pointed to discuss.hashicorp.com theme/health-checks About Consul health checking labels Oct 15, 2019

ndhanushkodi pushed a commit to ndhanushkodi/consul-k8s that referenced this issue Jul 9, 2021

Merge pull request hashicorp#60 from hashicorp/bugfix/hashicorpgh-58

859d8ed

Provide a valid maxUnavailable value when using a single replica

david-yu changed the title ~~How to do HA consul-k8s~~ How to do HA catalog-sync Jul 16, 2021

lkysow added the area/sync Related to catalog sync label Nov 4, 2021

david-yu closed this as completed Apr 20, 2022

Dentrax mentioned this issue Dec 21, 2022

Catalog Sync: Enable Leader Election for HA Mode #1815

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do HA catalog-sync #58

How to do HA catalog-sync #58

jacksontj commented Jan 25, 2019

tsmgeek commented Jan 31, 2019

jacksontj commented Jan 31, 2019

tsmgeek commented Jan 31, 2019

hmlkao commented Jan 4, 2021

lkysow commented Jan 5, 2021

hmlkao commented Jan 6, 2021

lkysow commented Jan 6, 2021

hmlkao commented Jan 7, 2021

lkysow commented Jan 7, 2021

hmlkao commented Jan 7, 2021

lkysow commented Apr 8, 2021

david-yu commented Apr 20, 2022

Dentrax commented Dec 20, 2022 •

edited

Loading

david-yu commented Dec 20, 2022

How to do HA catalog-sync #58

How to do HA catalog-sync #58

Comments

jacksontj commented Jan 25, 2019

tsmgeek commented Jan 31, 2019

jacksontj commented Jan 31, 2019

tsmgeek commented Jan 31, 2019

hmlkao commented Jan 4, 2021

lkysow commented Jan 5, 2021

hmlkao commented Jan 6, 2021

lkysow commented Jan 6, 2021

hmlkao commented Jan 7, 2021

lkysow commented Jan 7, 2021

hmlkao commented Jan 7, 2021

lkysow commented Apr 8, 2021

david-yu commented Apr 20, 2022

Dentrax commented Dec 20, 2022 • edited Loading

david-yu commented Dec 20, 2022

Dentrax commented Dec 20, 2022 •

edited

Loading