Large number of TimeoutExceptions caused by locking in OrchardCore.Environment.Shell.Scope.ShellScope+<ActivateShellInternalAsync> #14381

ShaneCourtrille · 2023-09-21T22:35:55Z

I'm mostly curious as to why there is an IDistributedCache being used for activating a ShellContext since this means that you can only have one ShellContext being activated no matter how many instances/tenants you have.

Shouldn't this always be a local lock so that each instance can start a shell context independent of other instances?

jtkech · 2023-09-21T22:44:40Z

Shell activation (run only once for a given shell) should be run atomically because for example it may execute database migrations. The time out and expiration values are configurable through ShellContextOptions, the default values are 30 seconds.

ShaneCourtrille · 2023-09-22T22:43:38Z

@jtkech The problem we're running into is that we see this exception occurring for 40 minutes across all tenants (including Default) after a new instance is started up. I'm continuing to investigate to see if these exceptions are cause or symptom of another problem.

jtkech · 2023-09-22T23:19:24Z

Okay, let me know.

Did you enable OrchardCore.Redis.Lock that registers an IDistributedLock?

If yes, normally a lock auto expires with an expiration of 30s by default.

If no, we use an ILocalLock wth an infinite expiration, in that case it would mean that it gets stuck in the shell activation.

ShaneCourtrille · 2023-09-29T16:22:27Z

@jtkech Yes, we're using Redis.

I've been able to recreate this locally with a single tenant by just hitting it with a bunch (k6 test using 200 vus/400 iterations) of parallel requests. Works perfectly fine without Redis but as soon as I enabled Redis I only had a 67% pass rate. The weird thing is the entire test completes within 32 seconds. I wouldn't expect the lock time to matter so much with a single tenant so I'm looking into this now.

ShaneCourtrille · 2023-09-29T17:06:44Z

@jtkech Interestingly enough I was able to get 100% pass rate using this simple change in the ShellScope.UsingAsync method. I'm going to do some more testing with this but wonder if you can see any problems it it?

try
{
	if (activateShell)
	{
		await ActivateShellInternalAsync();
	}
}
catch (TimeoutException)
{
	if (!ShellContext.IsActivated)
	{
		throw;
	}
}

jtkech · 2023-09-29T21:28:00Z

Yes, we throw a TimeoutException if the lock can't be acquired.

// Try to acquire a lock before using a new scope, so that a next process ...
(var locker, var locked) = await ShellContext.TryAcquireShellActivateLockAsync();
if (!locked)
{
    throw new TimeoutException(
        $"Failed to acquire a lock before activating the tenant: {ShellContext.Settings.Name}");
}

So looks like on startup the shell activation is too long, the default expiration and timeout are 30s, can you try to increase these values, for example to 60s

shellServices.Configure<ShellContextOptions>(o =>
{
    o.ShellActivateLockTimeout = 60_000;
    o.ShellActivateLockExpiration = 60_000;
});

Otherwise good idea to only throw if the shell is not activated, for example

// Try to acquire a lock before using a new scope, so that a next process ...
(var locker, var locked) = await ShellContext.TryAcquireShellActivateLockAsync();
if (!locked)
{
    if (ShellContext.IsActivated)
    {
        return;
    }

    throw new TimeoutException($"...");
}

Or maybe just increase the default values.

jtkech · 2023-10-01T22:53:47Z

Just tried by using the Blog recipe, a local Redis instance, and Redis.Lock enabled.

The test send 1000 requests with a concurrency level of 100, I did not have any issue.

But still a good idea to not throw if the shell is activated but only log an error or warning instead.

ShaneCourtrille · 2023-10-24T16:21:31Z

@jtkech Local Redis might be why you didn't see an issue. I used an Azure Redis instance for my testing when I was able to recreate locally.

The change is going into production tomorrow so we'll be able to tell the impact next week since we have to restart our instances twice weekly due to memory leakage. If it works I can PR it for you since I'm already setup to do so and I could use the practice. Let me know if you want that or if you'll just PR it because it's so simple and might only take you 10 min compared to my hour+ :D

jtkech · 2023-10-24T20:26:47Z

Okay cool, about the PR no problem as you want, just let me know.

Hmm, maybe also add a log message (maybe the warning level) when the lock failed to be acquired.

And a comment ;)

jtkech · 2023-11-26T05:50:58Z

@ShaneCourtrille

Okay I retried it with the max concurrent requests I was able to do and I could not repro, which is good somewhere ;)

We have a retry logic to acquire a lock where we increase the delay before retrying, the max delay being 10s.

What I think happens is that if there is at least 3 requests still waiting with this max retry delay, one may experience a timeout while waiting before a new retry.

I did a PR to apply our suggested change.

ShaneCourtrille added the bug 🐛 label Sep 21, 2023

sebastienros added this to the 1.x milestone Sep 28, 2023

jtkech mentioned this issue Nov 26, 2023

Don't throw lock timeout if shell activated #14756

Merged

jtkech closed this as completed in #14756 Nov 26, 2023

MikeAlhayek modified the milestones: 1.x, 1.8 Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of TimeoutExceptions caused by locking in OrchardCore.Environment.Shell.Scope.ShellScope+<ActivateShellInternalAsync> #14381

Large number of TimeoutExceptions caused by locking in OrchardCore.Environment.Shell.Scope.ShellScope+<ActivateShellInternalAsync> #14381

ShaneCourtrille commented Sep 21, 2023

jtkech commented Sep 21, 2023

ShaneCourtrille commented Sep 22, 2023

jtkech commented Sep 22, 2023

ShaneCourtrille commented Sep 29, 2023

ShaneCourtrille commented Sep 29, 2023 •

edited

Loading

jtkech commented Sep 29, 2023 •

edited

Loading

jtkech commented Oct 1, 2023

ShaneCourtrille commented Oct 24, 2023

jtkech commented Oct 24, 2023 •

edited

Loading

jtkech commented Nov 26, 2023

Large number of TimeoutExceptions caused by locking in OrchardCore.Environment.Shell.Scope.ShellScope+<ActivateShellInternalAsync> #14381

Large number of TimeoutExceptions caused by locking in OrchardCore.Environment.Shell.Scope.ShellScope+<ActivateShellInternalAsync> #14381

Comments

ShaneCourtrille commented Sep 21, 2023

jtkech commented Sep 21, 2023

ShaneCourtrille commented Sep 22, 2023

jtkech commented Sep 22, 2023

ShaneCourtrille commented Sep 29, 2023

ShaneCourtrille commented Sep 29, 2023 • edited Loading

jtkech commented Sep 29, 2023 • edited Loading

jtkech commented Oct 1, 2023

ShaneCourtrille commented Oct 24, 2023

jtkech commented Oct 24, 2023 • edited Loading

jtkech commented Nov 26, 2023

ShaneCourtrille commented Sep 29, 2023 •

edited

Loading

jtkech commented Sep 29, 2023 •

edited

Loading

jtkech commented Oct 24, 2023 •

edited

Loading