fix(core) make sure that cassandra handles cluster event consistency #5032

scaat · 2019-09-13T03:46:40Z

Summary

The default is local_one, which does not match the strong consistency of cassandra, which may result in inaccurate data.

This PR takes the consistent configuration from the configuration of kong to ensure that it meets the user's expectations.

…from the configuration file.

thibaultcha · 2019-09-18T00:05:12Z

Hello,

Thank you for proposing the patch. The decision to not use the user-specified Cassandra consistency in the cluster events module is a deliberate one.

Putting aside the fact that the Cassandra consistency settings need a lot more granularity than today's configuration properties offer (cc @bungle, with whom we are working on this), the cluster events module deserves its own consistency settings. In order to avoid over-configuration and, most of all, misconfiguration from users unfamiliar with Cassandra, the consistency settings were hard-coded to the most appropriate default: a datacenter-aware, high availability consistency setting. Given the chattiness of the cluster events insertion and polling, we believe the value is appropriate, and we decided to overcome eventual consistency scenarios by introducing the db_update_propagation configuration property. What this property does is querying the cluster_events table with a "grace period" (or backwards window) so that if an event pre-dating the current window was received by the replica node, it could still be picked up and processed by polling nodes (more details in the original PR introducing the current caching and invalidation mechanisms: #2561).

As such, I don't think that we should merge such a patch. For now, you can configure db_update_propagation to a value higher than 0, and a "safety" grace period should, imho, be introduced in order to always give a chance for eventually consistent past events to be polled and processed. In fact, it was my assumption that this was already the case, but I must have been mistaken thinking it was already implemented when it must have been plans I had in mind but never got a chance to address.

scaat · 2019-09-18T01:40:31Z

Thank you for your reply @thibaultcha , I have two questions I would like to ask:

The default setting is local_one. Even if db_update_propagation is added, is it possible that some events are lost when a node pulls a cluster event?
When my usage scenario is multiple data centers, and each data center has 3 nodes, each data center has a repl_factor of 3, and the consistency is local_quorum, the default configuration local_one of this cluster event can guarantee strong Consistency and no loss of cluster events?

thibaultcha · 2019-09-19T22:11:25Z

The default setting is local_one. Even if db_update_propagation is added, is it possible that some events are lost when a node pulls a cluster event?

By using LOCAL_ONE, the cluster events polling will only contact a single peer, so if the peer has not received the event (i.e. the replica row) between the instants t and t + db_update_propagation + 0.001 (see https://github.com/Kong/kong/blob/master/kong/cluster_events.lua#L220), it may be lost. However, using a higher consistency setting may be too expensive for a cluster at scale handling a lot of CRUD operations, so this change is somewhat risky. A lower consistency with a grace period is a more efficient alternative imho, and also helps covering the cross-datacenter latency issues (without having to reply on a non-local consistency setting). Besides, note that when an invalidation cluster event is received, the associated updated entity will be queried as well. Until we can offer more granularity between proxy server and Admin API queries (the former requiring availability and the latter requiring consistency), the db_update_propagation property is a helpful solution.

When my usage scenario is multiple data centers, and each data center has 3 nodes, each data center has a repl_factor of 3, and the consistency is local_quorum, the default configuration local_one of this cluster event can guarantee strong Consistency and no loss of cluster events?

With db_update_propagation, yes, but not with guarantees as strong as using a higher consistency setting. And I explained above several times, we may still prefer using LOCAL_ONE for the cluster events queries. Increasing the grace period (and always having one by default) is very much needed imho. And again, granular consistency settings would allow for a healthy consistency setting between proxy server queries, Admin API queries, and cluster events.

If you are running a large C*-backed production cluster, we'd be interested in hearing your feedback if you chose to run your cluster events queries with a higher consistency setting.

scaat · 2019-09-20T02:20:19Z

Thank you for your reply. @thibaultcha

If the loss of the event cannot be avoided, what should be done when the event is lost?

thibaultcha · 2019-09-23T18:19:13Z

@scaat The /cache Admin API endpoint was implemented for this purpose. I don't believe that it has yet been properly documented, but you can browse the source here: https://github.com/Kong/kong/blob/master/kong/api/routes/cache.lua

scaat · 2019-09-24T09:33:08Z

Thank you for your reply. @thibaultcha

In addition to manually using the /cache API, can I directly set db_cache_ttl to automatically clean up the cache?

In short, is the effect of using /cache API the same as setting db_cache_ttl?

thibaultcha · 2019-09-26T03:45:19Z

In addition to manually using the /cache API, can I directly set db_cache_ttl to automatically clean up the cache?

Yes, this is also the use-case behing db_cache_ttl (a safety measure). Beware of possible spikes in DB traffic when a significant portion of your cache reaches TTL at the same time, it can happen.

kikito · 2019-10-07T13:17:22Z

Hi, since the conversation seems to have ended and we decided not to merge this PR, I'm going to close it down. If anyone feels the need to reopen, please do so.

scaat added 2 commits September 13, 2019 11:38

fix(core) make sure that cassandra handles cluster event consistency …

1e78fdb

…from the configuration file.

fix(core) make sure that cassandra handles cluster event consistency …

6fed445

…from the configuration file.

bungle requested a review from thibaultcha September 13, 2019 09:08

thibaultcha added the pr/do not merge label Sep 18, 2019

kikito closed this Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core) make sure that cassandra handles cluster event consistency #5032

fix(core) make sure that cassandra handles cluster event consistency #5032

scaat commented Sep 13, 2019

thibaultcha commented Sep 18, 2019

scaat commented Sep 18, 2019

thibaultcha commented Sep 19, 2019

scaat commented Sep 20, 2019

thibaultcha commented Sep 23, 2019

scaat commented Sep 24, 2019 •

edited

Loading

thibaultcha commented Sep 26, 2019

kikito commented Oct 7, 2019

fix(core) make sure that cassandra handles cluster event consistency #5032

fix(core) make sure that cassandra handles cluster event consistency #5032

Conversation

scaat commented Sep 13, 2019

Summary

thibaultcha commented Sep 18, 2019

scaat commented Sep 18, 2019

thibaultcha commented Sep 19, 2019

scaat commented Sep 20, 2019

thibaultcha commented Sep 23, 2019

scaat commented Sep 24, 2019 • edited Loading

thibaultcha commented Sep 26, 2019

kikito commented Oct 7, 2019

scaat commented Sep 24, 2019 •

edited

Loading