Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of TimeoutExceptions caused by locking in OrchardCore.Environment.Shell.Scope.ShellScope+<ActivateShellInternalAsync> #14381

Closed
ShaneCourtrille opened this issue Sep 21, 2023 · 10 comments · Fixed by #14756
Labels
Milestone

Comments

@ShaneCourtrille
Copy link
Contributor

I'm mostly curious as to why there is an IDistributedCache being used for activating a ShellContext since this means that you can only have one ShellContext being activated no matter how many instances/tenants you have.

Shouldn't this always be a local lock so that each instance can start a shell context independent of other instances?

@jtkech
Copy link
Member

jtkech commented Sep 21, 2023

Shell activation (run only once for a given shell) should be run atomically because for example it may execute database migrations. The time out and expiration values are configurable through ShellContextOptions, the default values are 30 seconds.

@ShaneCourtrille
Copy link
Contributor Author

@jtkech The problem we're running into is that we see this exception occurring for 40 minutes across all tenants (including Default) after a new instance is started up. I'm continuing to investigate to see if these exceptions are cause or symptom of another problem.

@jtkech
Copy link
Member

jtkech commented Sep 22, 2023

Okay, let me know.

Did you enable OrchardCore.Redis.Lock that registers an IDistributedLock?

If yes, normally a lock auto expires with an expiration of 30s by default.

If no, we use an ILocalLock wth an infinite expiration, in that case it would mean that it gets stuck in the shell activation.

@sebastienros sebastienros added this to the 1.x milestone Sep 28, 2023
@ShaneCourtrille
Copy link
Contributor Author

@jtkech Yes, we're using Redis.

I've been able to recreate this locally with a single tenant by just hitting it with a bunch (k6 test using 200 vus/400 iterations) of parallel requests. Works perfectly fine without Redis but as soon as I enabled Redis I only had a 67% pass rate. The weird thing is the entire test completes within 32 seconds. I wouldn't expect the lock time to matter so much with a single tenant so I'm looking into this now.

@ShaneCourtrille
Copy link
Contributor Author

ShaneCourtrille commented Sep 29, 2023

@jtkech Interestingly enough I was able to get 100% pass rate using this simple change in the ShellScope.UsingAsync method. I'm going to do some more testing with this but wonder if you can see any problems it it?

try
{
	if (activateShell)
	{
		await ActivateShellInternalAsync();
	}
}
catch (TimeoutException)
{
	if (!ShellContext.IsActivated)
	{
		throw;
	}
}

@jtkech
Copy link
Member

jtkech commented Sep 29, 2023

Yes, we throw a TimeoutException if the lock can't be acquired.

// Try to acquire a lock before using a new scope, so that a next process ...
(var locker, var locked) = await ShellContext.TryAcquireShellActivateLockAsync();
if (!locked)
{
    throw new TimeoutException(
        $"Failed to acquire a lock before activating the tenant: {ShellContext.Settings.Name}");
}

So looks like on startup the shell activation is too long, the default expiration and timeout are 30s, can you try to increase these values, for example to 60s

shellServices.Configure<ShellContextOptions>(o =>
{
    o.ShellActivateLockTimeout = 60_000;
    o.ShellActivateLockExpiration = 60_000;
});

Otherwise good idea to only throw if the shell is not activated, for example

// Try to acquire a lock before using a new scope, so that a next process ...
(var locker, var locked) = await ShellContext.TryAcquireShellActivateLockAsync();
if (!locked)
{
    if (ShellContext.IsActivated)
    {
        return;
    }

    throw new TimeoutException($"...");
}

Or maybe just increase the default values.

@jtkech
Copy link
Member

jtkech commented Oct 1, 2023

Just tried by using the Blog recipe, a local Redis instance, and Redis.Lock enabled.

The test send 1000 requests with a concurrency level of 100, I did not have any issue.

But still a good idea to not throw if the shell is activated but only log an error or warning instead.

@ShaneCourtrille
Copy link
Contributor Author

@jtkech Local Redis might be why you didn't see an issue. I used an Azure Redis instance for my testing when I was able to recreate locally.

The change is going into production tomorrow so we'll be able to tell the impact next week since we have to restart our instances twice weekly due to memory leakage. If it works I can PR it for you since I'm already setup to do so and I could use the practice. Let me know if you want that or if you'll just PR it because it's so simple and might only take you 10 min compared to my hour+ :D

@jtkech
Copy link
Member

jtkech commented Oct 24, 2023

Okay cool, about the PR no problem as you want, just let me know.

Hmm, maybe also add a log message (maybe the warning level) when the lock failed to be acquired.

And a comment ;)

@jtkech
Copy link
Member

jtkech commented Nov 26, 2023

@ShaneCourtrille

Okay I retried it with the max concurrent requests I was able to do and I could not repro, which is good somewhere ;)

We have a retry logic to acquire a lock where we increase the delay before retrying, the max delay being 10s.

What I think happens is that if there is at least 3 requests still waiting with this max retry delay, one may experience a timeout while waiting before a new retry.

I did a PR to apply our suggested change.

@MikeAlhayek MikeAlhayek modified the milestones: 1.x, 1.8 Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants