Add GSI cache maintentance and tests #5184

sebastianburckhardt · 2018-11-19T16:37:52Z

Fixes the problem where a global-single-instance grain becomes permanently unavailable on a cluster because its directory entry points to a remote cluster that is no longer responsive, even if that cluster has been removed from the multi-cluster.

I added tests to expose the problem behavior, and code that implements a fix.

Tests:

The tests use two clusters A and B, create a grain in A, then access it from B (which means cluster B now caches the reference to the grain in A). Then I block communication from A to B (to simulate cluster A going down), and change the multi-cluster configuration to remove A from the multi-cluster (to simulate an admin responding to the outage). Then I try to access the grain from B. Without the fix, this times out trying to contact a non-responding silo in cluster A. With the fix (and if I wait long enough for the cache cleanup to complete before trying to access the grain) this succeeds because the dangling cached reference was removed.

The difference between the two tests is scale (number of grains, silos, clients).

Fix:

I added code that validates GSI remote references stored in the grain directory. It is triggered both periodically (30s default), and also immediately after a multi-cluster configuration change.

The validation logic is pretty simple: To check if a reference to a remote cluster should be kept, it pings the silo, and the silo responds with its cluster ID (this is not a throughput hazard… the whole validation happens at a low period, there is only one ping per remote silo, and the ping is a tiny message).

If the response indicates the cluster is not part of the configuration, OR if the response times out, we remove the GSI remote reference from the grain directory.

This is always safe from a correctness perspective (the reference is logically just a cache), and from a performance perspective, I believe it is ok also: if things are in flux or we timed out for random reasons, it just means the cache may be removed sooner than necessary, but still not “extremely too soon” since this whole validation triggers at a modest period (30s default).

Add GSI cache maintenance and tests (#5184) Revert: Fix call chain reentrancy (#5145, #5225) (#5249)

sebastianburckhardt added 2 commits November 19, 2018 08:04

add 2 tests to check GSI cache cleaning after removing a cluster

e5cf4ee

add periodic directory cleanup to remove invalid GSI cache entries

82b807d

sergeybykov added this to the 2.3.0 milestone Dec 3, 2018

Add explanatory comment

83c9ab4

sergeybykov merged commit a322ba3 into dotnet:master Dec 12, 2018

sergeybykov pushed a commit to sergeybykov/orleans that referenced this pull request Dec 12, 2018

Add GSI cache maintentance and tests (dotnet#5184)

d411b47

sergeybykov mentioned this pull request Dec 12, 2018

Cherry-pick post 2.2.0-rc1 fixes for 2.2.0 final #3 #5254

Merged

ReubenBond pushed a commit that referenced this pull request Dec 12, 2018

Cherry-pick post 2.2.0-rc1 fixes for 2.2.0 final #3 (#5254)

a561906

Add GSI cache maintenance and tests (#5184) Revert: Fix call chain reentrancy (#5145, #5225) (#5249)

sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 6, 2019

Backport dotnet#5184 (Add GSI cache maintentance and tests) to 1.5.6

949185e

sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 6, 2019

Backport dotnet#5184 (Add GSI cache maintentance and tests) to 1.5.6

ca1bb4c

sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 22, 2019

Backport dotnet#5184 (Add GSI cache maintentance and tests) to 1.5.6.

65eec88

sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 22, 2019

Backport dotnet#5184 (Add GSI cache maintentance and tests) to 1.5.6.

990470c

sergeybykov added a commit to sergeybykov/orleans that referenced this pull request Feb 27, 2019

Backport dotnet#5184 (Add GSI cache maintentance and tests) to 1.5.6.

d226488

sergeybykov mentioned this pull request Feb 27, 2019

Backport PRs #3974 and #5184 to 1.5.7 #5404

Merged

ReubenBond pushed a commit that referenced this pull request Feb 27, 2019

Backport #5184 (Add GSI cache maintentance and tests) to 1.5.6.

fce1d7d

github-actions bot locked and limited conversation to collaborators Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GSI cache maintentance and tests #5184

Add GSI cache maintentance and tests #5184

Uh oh!

sebastianburckhardt commented Nov 19, 2018

Uh oh!

Uh oh!

Add GSI cache maintentance and tests #5184

Add GSI cache maintentance and tests #5184

Uh oh!

Conversation

sebastianburckhardt commented Nov 19, 2018

Uh oh!

Uh oh!