-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CachedGrainLocator Unregister Ordering #8547
CachedGrainLocator Unregister Ordering #8547
Conversation
… grain directory so we stop serving bad entries as soon as possible
I fear that it will introduce a race condition with |
It is possible we clear a correct entry from the cache however that tradeoff seems worthwhile to the alternative of serving a wrong entry from the cache for a longer period of time. I also believe this same race exists today. Two separate requests can end up on the wrong silo after a grain moves, if one is slower than the other, the first could return, clear the entry and retry causing a lookup to populate the cache with the new correct value, then the second call returns and clears the good entry thinking it is wrong Or is there a separate race you are thinking about? |
Sorry if I wasn't clear, but my fear is that you have in parallel two calls: one lookup, one unregistration:
Now you have a poisoned entry in the cache. I wonder if we should just try to clear the cache before AND after the directory registration call? Mayve the best thing would be to "block" the lookup call when there is a concurrent unregistration call targeting the same activation. |
From my recollection of this path and correct me if below is wrong. The race you mention above has very similar behavior as today. Today essentially the cache is poisoned while we wait for the async unregister to finish and then remove the poisoned entry. The benefit of clearing before instead of after is that in cases where the grain was deactivated gracefully, the lookup that is racing will actually return the correct result (since to deactivate it would have cleared the entry from the directory), and the async Unregister call should be a noop. Only when the grain was deactivated ungracefully (silo crash scenario) would the directory return the wrong answer causing us to temporarily cache the wrong entry again. If this happens it'll behave the same as today where another few calls can hit the wrong silo till we clear the cache again after discovering, we went to the wrong silo. I think modifying to something like this should provide a good balance on ensuring we don't re-insert a bad entry into the cache and also minimize extra full lookups
|
Let me know if above change sounds good and I'll update code / tests to reflect. |
I think it does, yes. |
…t might re-add to cache is cleared when Unregister is done
@benjaminpetit PTAL |
When a grain directory experiences high latency it's possible that we hold a bad cached value in memory longer than desired. This can result in silos that retry due to NonExistantActivation exceptions continuing to retry to the same silo that does not own a grain. Instead, if we remove from the local cache first no matter how long the Unregister takes other calls will be able to re-read the grain directory to find the new grain location.
Microsoft Reviewers: Open in CodeFlow