-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Healthcheck via /healthz endpoint is write-intensive operation for storage backend #1386
Comments
What would you propose there? Not using storage in a health check will not let you assume that "passing health check" actually means "working system". 🤔 |
Yes, I understand that. And that is why I don't use /static for healthcheck, for example. There is issue in etcd to implement connection status methods but they pinned milestone to 3.5. For now for etcd it can be implemented as reads but not writes, with less impact on storage backend. Also I believe that separate health method will be better for storage instead of generic since different storage drivers may have different connection status monitoring capabilities. |
💭 It just occurred to me -- how is the health check, which kind of simulates a login, different from an actual login? Don't you have issues with multiple concurrent login attempts, too, then? Or would you assume that they never happen in the same bursts as your health checks? Also, can this be mitigated by running the health checks less often than every second? That strikes me as a little on the paranoid end of the scale. Maybe you'd be better served if the clients could cope with a failure, than having some LB make believe there's no failures ever 😄
Yeah, I suppose that's what I was proposing here 😉 |
It looks like continuously load testing. In our case we will have sum of loads from healthchecks + users. I see here 2 problems:
|
You can't scale etcd in the process, to accommodate the extra needs arising from more dex replicas?
That's right, and an important point for other storage backends, too. There's been a bit of a discussion here: #1292. I'm happy to review any small additions that would unblock you there -- just put them forward. I can't promise anything, but having a less-write-intensive healthz endpoint; or having a query parameter to disable half the health check -- if it's helping you, let's do it. 😃 |
Maybe instead of querying the backend every time somone hits /healthz, we could query the backend every 30s, then use that cached result to respond to /healthz? |
sent #1397 |
Thanks |
Hi there.
Our dex installation - 3 replicas in 3 k8s-clusters (9 replicas overall) with etcd-cluster as shared storage.
Each replica checked for health with k8s deployments healthcheck via /healthz endpoint.
Also we have few out-of-k8s loadbalancers who checks every k8s service via every k8s-node using same endpoints.
I noticed high memory and disk usage on etcd nodes and after some investigation found that root cause was dex healthchecks. For every healthcheck dex make 2 write operations - CreateAuthRequest and DeleteAuthRequest. For example, 1 healthcheck per second will cause ~4 append operations on etcd and 12-14 disk write operations on every etcd node. Since we have 9 replicas on dex and few external loadbalances, healthcheck stream simply kills etcd cluster.
I understand that we have uncommon architecture, we will review and change healthcheck policy for dex in our case. But I think that dex healthcheck should be more storage-friendly.
The text was updated successfully, but these errors were encountered: