-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache NodeInfo outside of healthcheck service #3767
Comments
So then SM would still need to query agent for node info (that would already be cached and wouldn't require interaction with Scylla), or did you mean that the background task and caching happens on SM side?
Simpler solution would be to just change healthchecks |
I mean to have a background task in SM and cache on SM side. Healthcheck service (but other services should target the same cache eventually) is supposed to hit the already cached (newest) object. |
Another issue to address: |
The purpose of the health check service is simple: to report whether:
Scylla Manager may report false positives for the CQL ping and alternator ping if the agent's API is unresponsive or overloaded, and does not respond within the expected time. This occurs because both the CQL and alternator pings involve retrieving basic information about the nodes using the agent's node_info endpoint, which concatenates configuration-related responses from the Scylla API and returns them to the caller.
NodeInfo is necessary to properly build the client and contains information about encryption and ports configured to establish the session. However, this setup causes the health check service not only to verify the CQL session but also assumes that the agent's API is fully responsive. If the API is not responsive at the expected level, the health check produces false positives regarding the ability to create the CQL session and query the data.
The logic must be changed, and the health check service MUST be decoupled from the agent completely. To achieve this, the agent is expected to start a background goroutine that periodically checks Scylla's config and updates the cached config. The health check service is expected to maintain a reference to the cache and retrieve the latest Scylla config from there without directly interfering with the agent's API. If the API is unresponsive, then the cache may be outdated for some time, but situations where the configuration of a particular node changes are very rare.
The goal:
The health check should not call the agent's API at all.
Another service working in a separate goroutine is responsible for updating the cache.
The cache never expires; it is periodically updated by the ConfigCacheUpdater service.
The health check simply accesses the cache every time it needs Scylla config.
cc: @d-helios @gmizrahi @mykaul
The text was updated successfully, but these errors were encountered: