CachedGrainLocator Unregister Ordering #8547

JohnMorman · 2023-07-18T15:43:02Z

When a grain directory experiences high latency it's possible that we hold a bad cached value in memory longer than desired. This can result in silos that retry due to NonExistantActivation exceptions continuing to retry to the same silo that does not own a grain. Instead, if we remove from the local cache first no matter how long the Unregister takes other calls will be able to re-read the grain directory to find the new grain location.

Microsoft Reviewers: Open in CodeFlow

… grain directory so we stop serving bad entries as soon as possible

benjaminpetit · 2023-07-31T08:05:14Z

I fear that it will introduce a race condition with Lookup since this method also populate the cache, no?

JohnMorman · 2023-07-31T13:33:10Z

It is possible we clear a correct entry from the cache however that tradeoff seems worthwhile to the alternative of serving a wrong entry from the cache for a longer period of time. I also believe this same race exists today. Two separate requests can end up on the wrong silo after a grain moves, if one is slower than the other, the first could return, clear the entry and retry causing a lookup to populate the cache with the new correct value, then the second call returns and clears the good entry thinking it is wrong

Or is there a separate race you are thinking about?

benjaminpetit · 2023-08-01T12:40:10Z

Sorry if I wasn't clear, but my fear is that you have in parallel two calls: one lookup, one unregistration:

unregistration call starts. It clear the cache, then is about to send the unregistration request but...
lookup call starts. Cache is empty, so it will get the info from the directory.
directory unregistration is successful
lookup call now add the result in cache.

Now you have a poisoned entry in the cache.

I wonder if we should just try to clear the cache before AND after the directory registration call?

Mayve the best thing would be to "block" the lookup call when there is a concurrent unregistration call targeting the same activation.

JohnMorman · 2023-08-01T21:17:59Z

From my recollection of this path and correct me if below is wrong. The race you mention above has very similar behavior as today. Today essentially the cache is poisoned while we wait for the async unregister to finish and then remove the poisoned entry. The benefit of clearing before instead of after is that in cases where the grain was deactivated gracefully, the lookup that is racing will actually return the correct result (since to deactivate it would have cleared the entry from the directory), and the async Unregister call should be a noop. Only when the grain was deactivated ungracefully (silo crash scenario) would the directory return the wrong answer causing us to temporarily cache the wrong entry again. If this happens it'll behave the same as today where another few calls can hit the wrong silo till we clear the cache again after discovering, we went to the wrong silo.

I think modifying to something like this should provide a good balance on ensuring we don't re-insert a bad entry into the cache and also minimize extra full lookups

public async Task Unregister(ActivationAddress address, UnregistrationCause cause)
{
    // Remove from local cache first incase the GrainDirectory Unregister may take time. 
    // This will avoid retries hitting the same remote silo based on cached data over and over 
    this.cache.Remove(address.Grain);
    await GetGrainDirectory(address.Grain).Unregister(address.ToGrainAddress());

    if (this.cache.LookUp(address.Grain, out var entry) && entry[0].Item2 == address.Activation)
    {
        this.cache.Remove(address.Grain);
    }
}

JohnMorman · 2023-08-03T13:57:51Z

Let me know if above change sounds good and I'll update code / tests to reflect.

benjaminpetit · 2023-08-03T14:36:34Z

I think it does, yes.

…t might re-add to cache is cleared when Unregister is done

ReubenBond · 2023-08-16T16:13:28Z

@benjaminpetit PTAL

Ensure Unregister Removes from local cache first before removing from…

6a09897

… grain directory so we stop serving bad entries as soon as possible

ReubenBond assigned benjaminpetit Jul 18, 2023

JohnMorman added 2 commits August 4, 2023 06:50

Add check post Unregister to ensure that any race with Unregister tha…

d191ec4

…t might re-add to cache is cleared when Unregister is done

Add comment for race condition

0bbed12

benjaminpetit merged commit 6a4cc90 into dotnet:main Aug 16, 2023

JohnMorman deleted the dev/jmorman/grainDirectoryUnregisterOrdering branch August 22, 2023 12:50

github-actions bot locked and limited conversation to collaborators Dec 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CachedGrainLocator Unregister Ordering #8547

CachedGrainLocator Unregister Ordering #8547

JohnMorman commented Jul 18, 2023 •

edited by ghost

Loading

benjaminpetit commented Jul 31, 2023

JohnMorman commented Jul 31, 2023 •

edited

Loading

benjaminpetit commented Aug 1, 2023

JohnMorman commented Aug 1, 2023

JohnMorman commented Aug 3, 2023

benjaminpetit commented Aug 3, 2023

ReubenBond commented Aug 16, 2023

CachedGrainLocator Unregister Ordering #8547

CachedGrainLocator Unregister Ordering #8547

Conversation

JohnMorman commented Jul 18, 2023 • edited by ghost Loading

Microsoft Reviewers: Open in CodeFlow

benjaminpetit commented Jul 31, 2023

JohnMorman commented Jul 31, 2023 • edited Loading

benjaminpetit commented Aug 1, 2023

JohnMorman commented Aug 1, 2023

JohnMorman commented Aug 3, 2023

benjaminpetit commented Aug 3, 2023

ReubenBond commented Aug 16, 2023

JohnMorman commented Jul 18, 2023 •

edited by ghost

Loading

JohnMorman commented Jul 31, 2023 •

edited

Loading