Healthcheck via /healthz endpoint is write-intensive operation for storage backend #1386

zigmund · 2019-01-21T14:21:04Z

Hi there.

Our dex installation - 3 replicas in 3 k8s-clusters (9 replicas overall) with etcd-cluster as shared storage.
Each replica checked for health with k8s deployments healthcheck via /healthz endpoint.
Also we have few out-of-k8s loadbalancers who checks every k8s service via every k8s-node using same endpoints.

I noticed high memory and disk usage on etcd nodes and after some investigation found that root cause was dex healthchecks. For every healthcheck dex make 2 write operations - CreateAuthRequest and DeleteAuthRequest. For example, 1 healthcheck per second will cause ~4 append operations on etcd and 12-14 disk write operations on every etcd node. Since we have 9 replicas on dex and few external loadbalances, healthcheck stream simply kills etcd cluster.

I understand that we have uncommon architecture, we will review and change healthcheck policy for dex in our case. But I think that dex healthcheck should be more storage-friendly.

srenatus · 2019-01-21T17:33:32Z

I understand that we have uncommon architecture, we will review and change healthcheck policy for dex in our case. But I think that dex healthcheck should be more storage-friendly.

What would you propose there? Not using storage in a health check will not let you assume that "passing health check" actually means "working system". 🤔

zigmund · 2019-01-22T02:26:13Z

Yes, I understand that. And that is why I don't use /static for healthcheck, for example.

There is issue in etcd to implement connection status methods but they pinned milestone to 3.5.

For now for etcd it can be implemented as reads but not writes, with less impact on storage backend.

Also I believe that separate health method will be better for storage instead of generic since different storage drivers may have different connection status monitoring capabilities.

srenatus · 2019-01-22T08:08:35Z

💭 It just occurred to me -- how is the health check, which kind of simulates a login, different from an actual login? Don't you have issues with multiple concurrent login attempts, too, then? Or would you assume that they never happen in the same bursts as your health checks?

Also, can this be mitigated by running the health checks less often than every second? That strikes me as a little on the paranoid end of the scale. Maybe you'd be better served if the clients could cope with a failure, than having some LB make believe there's no failures ever 😄

we will review and change healthcheck policy for dex in our case. But I think that dex healthcheck should be more storage-friendly.

Yeah, I suppose that's what I was proposing here 😉

zigmund · 2019-01-22T15:23:59Z

It just occurred to me -- how is the health check, which kind of simulates a login, different from an actual login? Don't you have issues with multiple concurrent login attempts, too, then? Or would you assume that they never happen in the same bursts as your health checks?

It looks like continuously load testing. In our case we will have sum of loads from healthchecks + users.

I see here 2 problems:

Two writes per one healthcheck. Abstract example: k8s deployment, few replicas, healthcheck every 5s. Scale it to handreds of replicas and kill etcd.
In addition to previous point - synchronous healthcheck. Without rate limiter in front of the dex press f5 to kill etcd if it is still alive.

srenatus · 2019-01-23T08:51:36Z

Two writes per one healthcheck. Abstract example: k8s deployment, few replicas, healthcheck every 5s. Scale it to handreds of replicas and kill etcd.

You can't scale etcd in the process, to accommodate the extra needs arising from more dex replicas?

In addition to previous point - synchronous healthcheck. Without rate limiter in front of the dex press f5 to kill etcd if it is still alive.

That's right, and an important point for other storage backends, too. There's been a bit of a discussion here: #1292.

I'm happy to review any small additions that would unblock you there -- just put them forward. I can't promise anything, but having a less-write-intensive healthz endpoint; or having a query parameter to disable half the health check -- if it's helping you, let's do it. 😃

srenatus · 2019-02-04T13:04:25Z

Just stumbled upon an old issue that might be almost the same thing, at least when it comes to root causes: #1091

Also #853 seems like a more generic version of this issue.

ericchiang · 2019-02-04T16:38:37Z

Maybe instead of querying the backend every time somone hits /healthz, we could query the backend every 30s, then use that cached result to respond to /healthz?

ericchiang · 2019-02-04T17:56:46Z

sent #1397

zigmund · 2019-02-05T02:14:03Z

Thanks

ericchiang mentioned this issue Feb 4, 2019

server: update health check endpoint to query storage periodically #1397

Merged

ericchiang closed this as completed Feb 5, 2019

sagikazarmark mentioned this issue Jan 15, 2021

Refactor health checks #1941

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthcheck via /healthz endpoint is write-intensive operation for storage backend #1386

Healthcheck via /healthz endpoint is write-intensive operation for storage backend #1386

zigmund commented Jan 21, 2019

srenatus commented Jan 21, 2019

zigmund commented Jan 22, 2019

srenatus commented Jan 22, 2019

zigmund commented Jan 22, 2019 •

edited

Loading

srenatus commented Jan 23, 2019

srenatus commented Feb 4, 2019

ericchiang commented Feb 4, 2019

ericchiang commented Feb 4, 2019

zigmund commented Feb 5, 2019

Healthcheck via /healthz endpoint is write-intensive operation for storage backend #1386

Healthcheck via /healthz endpoint is write-intensive operation for storage backend #1386

Comments

zigmund commented Jan 21, 2019

srenatus commented Jan 21, 2019

zigmund commented Jan 22, 2019

srenatus commented Jan 22, 2019

zigmund commented Jan 22, 2019 • edited Loading

srenatus commented Jan 23, 2019

srenatus commented Feb 4, 2019

ericchiang commented Feb 4, 2019

ericchiang commented Feb 4, 2019

zigmund commented Feb 5, 2019

zigmund commented Jan 22, 2019 •

edited

Loading