Skip to content

Add GSI cache maintentance and tests #5184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

sebastianburckhardt
Copy link
Contributor

Fixes the problem where a global-single-instance grain becomes permanently unavailable on a cluster because its directory entry points to a remote cluster that is no longer responsive, even if that cluster has been removed from the multi-cluster.

I added tests to expose the problem behavior, and code that implements a fix.

Tests:

The tests use two clusters A and B, create a grain in A, then access it from B (which means cluster B now caches the reference to the grain in A). Then I block communication from A to B (to simulate cluster A going down), and change the multi-cluster configuration to remove A from the multi-cluster (to simulate an admin responding to the outage). Then I try to access the grain from B. Without the fix, this times out trying to contact a non-responding silo in cluster A. With the fix (and if I wait long enough for the cache cleanup to complete before trying to access the grain) this succeeds because the dangling cached reference was removed.

The difference between the two tests is scale (number of grains, silos, clients).

Fix:

I added code that validates GSI remote references stored in the grain directory. It is triggered both periodically (30s default), and also immediately after a multi-cluster configuration change.

The validation logic is pretty simple: To check if a reference to a remote cluster should be kept, it pings the silo, and the silo responds with its cluster ID (this is not a throughput hazard… the whole validation happens at a low period, there is only one ping per remote silo, and the ping is a tiny message).

If the response indicates the cluster is not part of the configuration, OR if the response times out, we remove the GSI remote reference from the grain directory.

This is always safe from a correctness perspective (the reference is logically just a cache), and from a performance perspective, I believe it is ok also: if things are in flux or we timed out for random reasons, it just means the cache may be removed sooner than necessary, but still not “extremely too soon” since this whole validation triggers at a modest period (30s default).

@sergeybykov sergeybykov added this to the 2.3.0 milestone Dec 3, 2018
@sergeybykov sergeybykov merged commit a322ba3 into dotnet:master Dec 12, 2018
sergeybykov pushed a commit to sergeybykov/orleans that referenced this pull request Dec 12, 2018
ReubenBond pushed a commit that referenced this pull request Dec 12, 2018
Add GSI cache maintenance and tests (#5184)
Revert: Fix call chain reentrancy (#5145, #5225) (#5249)
sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 6, 2019
sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 6, 2019
sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 22, 2019
sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 22, 2019
sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 27, 2019
@github-actions github-actions bot locked and limited conversation to collaborators Dec 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants