Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CachedGrainLocator Unregister Ordering #8547

Conversation

JohnMorman
Copy link
Contributor

@JohnMorman JohnMorman commented Jul 18, 2023

When a grain directory experiences high latency it's possible that we hold a bad cached value in memory longer than desired. This can result in silos that retry due to NonExistantActivation exceptions continuing to retry to the same silo that does not own a grain. Instead, if we remove from the local cache first no matter how long the Unregister takes other calls will be able to re-read the grain directory to find the new grain location.

Microsoft Reviewers: Open in CodeFlow

… grain directory so we stop serving bad entries as soon as possible
@benjaminpetit
Copy link
Member

I fear that it will introduce a race condition with Lookup since this method also populate the cache, no?

@JohnMorman
Copy link
Contributor Author

JohnMorman commented Jul 31, 2023

It is possible we clear a correct entry from the cache however that tradeoff seems worthwhile to the alternative of serving a wrong entry from the cache for a longer period of time. I also believe this same race exists today. Two separate requests can end up on the wrong silo after a grain moves, if one is slower than the other, the first could return, clear the entry and retry causing a lookup to populate the cache with the new correct value, then the second call returns and clears the good entry thinking it is wrong

Or is there a separate race you are thinking about?

@benjaminpetit
Copy link
Member

Sorry if I wasn't clear, but my fear is that you have in parallel two calls: one lookup, one unregistration:

  • unregistration call starts. It clear the cache, then is about to send the unregistration request but...
  • lookup call starts. Cache is empty, so it will get the info from the directory.
  • directory unregistration is successful
  • lookup call now add the result in cache.

Now you have a poisoned entry in the cache.

I wonder if we should just try to clear the cache before AND after the directory registration call?

Mayve the best thing would be to "block" the lookup call when there is a concurrent unregistration call targeting the same activation.

@JohnMorman
Copy link
Contributor Author

From my recollection of this path and correct me if below is wrong. The race you mention above has very similar behavior as today. Today essentially the cache is poisoned while we wait for the async unregister to finish and then remove the poisoned entry. The benefit of clearing before instead of after is that in cases where the grain was deactivated gracefully, the lookup that is racing will actually return the correct result (since to deactivate it would have cleared the entry from the directory), and the async Unregister call should be a noop. Only when the grain was deactivated ungracefully (silo crash scenario) would the directory return the wrong answer causing us to temporarily cache the wrong entry again. If this happens it'll behave the same as today where another few calls can hit the wrong silo till we clear the cache again after discovering, we went to the wrong silo.

I think modifying to something like this should provide a good balance on ensuring we don't re-insert a bad entry into the cache and also minimize extra full lookups

public async Task Unregister(ActivationAddress address, UnregistrationCause cause)
{
    // Remove from local cache first incase the GrainDirectory Unregister may take time. 
    // This will avoid retries hitting the same remote silo based on cached data over and over 
    this.cache.Remove(address.Grain);
    await GetGrainDirectory(address.Grain).Unregister(address.ToGrainAddress());

    if (this.cache.LookUp(address.Grain, out var entry) && entry[0].Item2 == address.Activation)
    {
        this.cache.Remove(address.Grain);
    }
}

@JohnMorman
Copy link
Contributor Author

Let me know if above change sounds good and I'll update code / tests to reflect.

@benjaminpetit
Copy link
Member

I think it does, yes.

@ReubenBond
Copy link
Member

@benjaminpetit PTAL

@benjaminpetit benjaminpetit merged commit 6a4cc90 into dotnet:main Aug 16, 2023
@JohnMorman JohnMorman deleted the dev/jmorman/grainDirectoryUnregisterOrdering branch August 22, 2023 12:50
@github-actions github-actions bot locked and limited conversation to collaborators Dec 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants