Cache NodeInfo outside of healthcheck service #3767

karol-kokoszka · 2024-03-26T15:51:16Z

The purpose of the health check service is simple: to report whether:

The agent's API (REST ping) is reachable.
The alternator session (alternator ping) is reachable.
The CQL session (CQL ping) is reachable.

Scylla Manager may report false positives for the CQL ping and alternator ping if the agent's API is unresponsive or overloaded, and does not respond within the expected time. This occurs because both the CQL and alternator pings involve retrieving basic information about the nodes using the agent's node_info endpoint, which concatenates configuration-related responses from the Scylla API and returns them to the caller.

NodeInfo is necessary to properly build the client and contains information about encryption and ports configured to establish the session. However, this setup causes the health check service not only to verify the CQL session but also assumes that the agent's API is fully responsive. If the API is not responsive at the expected level, the health check produces false positives regarding the ability to create the CQL session and query the data.

          /--- agent's API to get Scylla's config (and cache it for short time)
Manager   ---- create CQL session with a single node and query simple data

set the status of healthcheck_cql_state basing on these two calls

The logic must be changed, and the health check service MUST be decoupled from the agent completely. To achieve this, the agent is expected to start a background goroutine that periodically checks Scylla's config and updates the cached config. The health check service is expected to maintain a reference to the cache and retrieve the latest Scylla config from there without directly interfering with the agent's API. If the API is unresponsive, then the cache may be outdated for some time, but situations where the configuration of a particular node changes are very rare.

The goal:

The health check should not call the agent's API at all.
Another service working in a separate goroutine is responsible for updating the cache.
The cache never expires; it is periodically updated by the ConfigCacheUpdater service.
The health check simply accesses the cache every time it needs Scylla config.

cc: @d-helios @gmizrahi @mykaul

The text was updated successfully, but these errors were encountered:

Michal-Leszczynski · 2024-03-27T10:58:00Z

To achieve this, the agent is expected to start a background goroutine that periodically checks Scylla's config and updates the cached config.

So then SM would still need to query agent for node info (that would already be cached and wouldn't require interaction with Scylla), or did you mean that the background task and caching happens on SM side?

The logic must be changed, and the health check service MUST be decoupled from the agent completely.

Simpler solution would be to just change healthchecks nodeInfo to return "expired" info when newer cannot be queried and try to update it the next time it's needed. On the other hand, current node info caching works only for healthcheck service, so perhaps decoupling it from healthcheck could mean that other services can use this cache as well.

karol-kokoszka · 2024-03-27T11:01:52Z

So then SM would still need to query agent for node info (that would already be cached and wouldn't require interaction with Scylla), or did you mean that the background task and caching happens on SM side?

I mean to have a background task in SM and cache on SM side.
The background task should periodically call to update the cache, but if it doesn't manager to update the cache then it's still fine.

Healthcheck service (but other services should target the same cache eventually) is supposed to hit the already cached (newest) object.

karol-kokoszka · 2024-04-15T13:50:43Z

Order:

karol-kokoszka · 2024-04-19T15:50:00Z

Another issue to address:
#3815

karol-kokoszka added the healthcheck label Mar 26, 2024

karol-kokoszka added this to the 3.2.8 milestone Apr 3, 2024

karol-kokoszka self-assigned this Apr 5, 2024

karol-kokoszka added the epic label Apr 15, 2024

This was referenced Apr 15, 2024

feat(config-cache): initial stub for cluster config cache service #3803

Merged

Store CQL and Alternator TLS configs directly in NodeInfo instead of using map #3815

Closed

karol-kokoszka mentioned this issue Apr 19, 2024

Feature branch for Config Cache Service #3816

Merged

karol-kokoszka mentioned this issue Apr 29, 2024

Add label parameter to cluster and tasks #3828

Closed

karol-kokoszka closed this as completed in #3816 May 8, 2024

karol-kokoszka mentioned this issue May 9, 2024

Release 3.2.8 #3842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache NodeInfo outside of healthcheck service #3767

Cache NodeInfo outside of healthcheck service #3767

karol-kokoszka commented Mar 26, 2024 •

edited

Loading

Michal-Leszczynski commented Mar 27, 2024

karol-kokoszka commented Mar 27, 2024

karol-kokoszka commented Apr 15, 2024

karol-kokoszka commented Apr 19, 2024

Cache NodeInfo outside of healthcheck service #3767

Cache NodeInfo outside of healthcheck service #3767

Comments

karol-kokoszka commented Mar 26, 2024 • edited Loading

Michal-Leszczynski commented Mar 27, 2024

karol-kokoszka commented Mar 27, 2024

karol-kokoszka commented Apr 15, 2024

karol-kokoszka commented Apr 19, 2024

karol-kokoszka commented Mar 26, 2024 •

edited

Loading