Add GSI cache maintentance and tests #5184
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes the problem where a global-single-instance grain becomes permanently unavailable on a cluster because its directory entry points to a remote cluster that is no longer responsive, even if that cluster has been removed from the multi-cluster.
I added tests to expose the problem behavior, and code that implements a fix.
Tests:
The tests use two clusters A and B, create a grain in A, then access it from B (which means cluster B now caches the reference to the grain in A). Then I block communication from A to B (to simulate cluster A going down), and change the multi-cluster configuration to remove A from the multi-cluster (to simulate an admin responding to the outage). Then I try to access the grain from B. Without the fix, this times out trying to contact a non-responding silo in cluster A. With the fix (and if I wait long enough for the cache cleanup to complete before trying to access the grain) this succeeds because the dangling cached reference was removed.
The difference between the two tests is scale (number of grains, silos, clients).
Fix:
I added code that validates GSI remote references stored in the grain directory. It is triggered both periodically (30s default), and also immediately after a multi-cluster configuration change.
The validation logic is pretty simple: To check if a reference to a remote cluster should be kept, it pings the silo, and the silo responds with its cluster ID (this is not a throughput hazard… the whole validation happens at a low period, there is only one ping per remote silo, and the ping is a tiny message).
If the response indicates the cluster is not part of the configuration, OR if the response times out, we remove the GSI remote reference from the grain directory.
This is always safe from a correctness perspective (the reference is logically just a cache), and from a performance perspective, I believe it is ok also: if things are in flux or we timed out for random reasons, it just means the cache may be removed sooner than necessary, but still not “extremely too soon” since this whole validation triggers at a modest period (30s default).