Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache NodeInfo outside of healthcheck service #3767

Closed
karol-kokoszka opened this issue Mar 26, 2024 · 4 comments · Fixed by #3816
Closed

Cache NodeInfo outside of healthcheck service #3767

karol-kokoszka opened this issue Mar 26, 2024 · 4 comments · Fixed by #3816
Assignees
Milestone

Comments

@karol-kokoszka
Copy link
Collaborator

karol-kokoszka commented Mar 26, 2024

The purpose of the health check service is simple: to report whether:

  • The agent's API (REST ping) is reachable.
  • The alternator session (alternator ping) is reachable.
  • The CQL session (CQL ping) is reachable.

Scylla Manager may report false positives for the CQL ping and alternator ping if the agent's API is unresponsive or overloaded, and does not respond within the expected time. This occurs because both the CQL and alternator pings involve retrieving basic information about the nodes using the agent's node_info endpoint, which concatenates configuration-related responses from the Scylla API and returns them to the caller.

NodeInfo is necessary to properly build the client and contains information about encryption and ports configured to establish the session. However, this setup causes the health check service not only to verify the CQL session but also assumes that the agent's API is fully responsive. If the API is not responsive at the expected level, the health check produces false positives regarding the ability to create the CQL session and query the data.

          /--- agent's API to get Scylla's config (and cache it for short time)
Manager   ---- create CQL session with a single node and query simple data

set the status of healthcheck_cql_state basing on these two calls

The logic must be changed, and the health check service MUST be decoupled from the agent completely. To achieve this, the agent is expected to start a background goroutine that periodically checks Scylla's config and updates the cached config. The health check service is expected to maintain a reference to the cache and retrieve the latest Scylla config from there without directly interfering with the agent's API. If the API is unresponsive, then the cache may be outdated for some time, but situations where the configuration of a particular node changes are very rare.

The goal:

The health check should not call the agent's API at all.
Another service working in a separate goroutine is responsible for updating the cache.
The cache never expires; it is periodically updated by the ConfigCacheUpdater service.
The health check simply accesses the cache every time it needs Scylla config.

cc: @d-helios @gmizrahi @mykaul

@Michal-Leszczynski
Copy link
Collaborator

To achieve this, the agent is expected to start a background goroutine that periodically checks Scylla's config and updates the cached config.

So then SM would still need to query agent for node info (that would already be cached and wouldn't require interaction with Scylla), or did you mean that the background task and caching happens on SM side?

The logic must be changed, and the health check service MUST be decoupled from the agent completely.

Simpler solution would be to just change healthchecks nodeInfo to return "expired" info when newer cannot be queried and try to update it the next time it's needed. On the other hand, current node info caching works only for healthcheck service, so perhaps decoupling it from healthcheck could mean that other services can use this cache as well.

@karol-kokoszka
Copy link
Collaborator Author

So then SM would still need to query agent for node info (that would already be cached and wouldn't require interaction with Scylla), or did you mean that the background task and caching happens on SM side?

I mean to have a background task in SM and cache on SM side.
The background task should periodically call to update the cache, but if it doesn't manager to update the cache then it's still fine.

Healthcheck service (but other services should target the same cache eventually) is supposed to hit the already cached (newest) object.

@karol-kokoszka
Copy link
Collaborator Author

Another issue to address:
#3815

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants