Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenCleanupService creates a sub optimal query plan with an expensive keylookup #1304

Closed
pregress opened this issue Jun 20, 2024 · 7 comments

Comments

@pregress
Copy link

pregress commented Jun 20, 2024

Which version of Duende IdentityServer are you using?
7.0.5

Which version of .NET are you using?
.NET 8

Describe the bug

https://github.com/DuendeSoftware/IdentityServer/blob/e9860c6488f90e8fbc11a4452b9dd111dbfae933/src/EntityFramework.Storage/TokenCleanup/TokenCleanupService.cs#L92

Fetches the whole PersistedGrant object for a cleanup, This causes very expensive key lookups.
See execution plan: https://www.brentozar.com/pastetheplan/?id=SyUSxtbI0
This could be fixed by using the existing index correctly IX_PersistedGrants_Expiration if only Id & Expiration are returned.
But these requires a change to the public api of IOperationalStoreNotification since this works with an IEnumerable<PersistedGrant>

The fix would be:

ar query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.Expiration < DateTime.UtcNow)
                .OrderBy(x => x.Expiration)
                .Select(x => new { x.Id, x.Expiration });
// ..
if (_operationalStoreNotification != null)
{
    await _operationalStoreNotification.PersistedGrantsRemovedAsync(expiredGrants); // <= different type maybe or seperate implementation if _operationalStoreNotification is null. 
}

To Reproduce

  1. Have a lot of PersistedGrants
  2. Run the token cleanup
  3. Check query store.

Expected behavior

Cleanup should not produce an execution plan with a key lookup.

Log output/exception with stacktrace
N/A

Additional context
Adding include columns to the existing index would also solve this, but then it's just a copy of the whole table.

Query plan also contains the openjson (DuendeSoftware/products#1564) which also screws with the estimated number of rows..., while the estimated number of rows should always be equal to the batch size. In our case it's not a big deal, but could imagine that this would go wrong.

@AndersAbel
Copy link
Member

Thank you for sharing the details of the performance of this query. We rewrote the token cleanup for v7 with the intention to fix race conditions and also make it more performant.

But we do have some limitations:

But these requires a change to the public api of IOperationalStoreNotification since this works with an IEnumerable

We are very careful when introducing breaking changes. We considered the fix you propose to not have to load the full entities, but decided that it didn't motivate a breaking change. The fix we considered was to check if there are any subscribers to the notifications and if not, simply not load the entities at all.

It is very valuable for us to get the actual performance numbers and the query plan details. The conclusions I make from this is that:

  1. The query utilizes the index as expected which is a very efficient operation.
  2. A clustered index key lookup is usually a fast operation, so while it takes 99% of the cost in this query, it might indicate that the index scan is even faster.

Do you have any actual durations for this, i.e. what is the query duration in milliseconds?

Any optimizations to this query would also have to take the next query into account. This query will bring all the relevant database pages into RAM and CPU cache. The next query deletes the data and that operation requires access to the database pages too. That next query would benefit from the database pages already being in RAM/cache memory.

Regarding the OpenJSON issue that is really not an IdentityServer issue. It is an incompatibility between EF Core and SQL Server. If it would be a major performance issue we would still have to handle it. But our estimate this far has been that the way our queries are written the OpenJSON row estimate issue should not be an issue.

If you have a query plan for the deletion query that would be valuable to see to understand how OpenJSON is handled.

@pregress
Copy link
Author

I agree, the query uses the index, and this is efficient, but the key look up make the query plan suboptimal.

Here are the performance metrics for:

Delete expired with subquery

(@__First_Expiration_0 datetime2(7),@__Last_Expiration_1 datetime2(7),@__foundIds_2 nvarchar(4000))DELETE FROM [p]
FROM [PersistedGrants] AS [p]
WHERE [p].[Id] IN (
    SELECT [p0].[Id]
    FROM [PersistedGrants] AS [p0]
    WHERE [p0].[Expiration] < GETUTCDATE() AND [p0].[Expiration] >= @__First_Expiration_0 AND [p0].[Expiration] <= @__Last_Expiration_1 AND [p0].[Id] IN (
        SELECT [f].[value]
        FROM OPENJSON(@__foundIds_2) WITH ([value] bigint '$') AS [f]
    )
)
INTERVAL CPU[%] DATA IO[%] LOG IO[%] DURATION[hh:mm:ss] EXECUTIONS COUNT
5/26 - 5/27 0.06 0.61 0.17 00:09:40.120 444
5/27 - 5/28 0.06 0.55 0.16 00:13:57.300 403
5/28 - 5/29 0.07 0.7 0.16 00:14:22.730 420
5/29 - 5/30 0.08 0.73 0.18 00:14:30.200 455
5/30 - 5/31 0.09 0.98 0.19 00:19:15.630 474
5/31 - 6/1 0.09 1.12 0.19 00:43:22.420 460
6/1 - 6/2 0.08 0.83 0.15 00:21:14.860 374
6/2 - 6/3 0.06 0.82 0.08 00:51:13.220 203
6/3 - 6/4 0.11 1.14 0.27 00:29:57.910 692

Here is the execution plan for the actual delete query: https://www.brentozar.com/pastetheplan/?id=B1SS51dL0
Also containing a table spool although it's only a 6% cost, I don't think there is anything wrong with the actual delete. Writing is expensive on slow azure storage.
https://erikdarling.com/optimizing-sql-server-query-plans-with-table-spools-in-them/

Select expired records (with expensive keylookup)

INTERVAL CPU[%] DATA IO[%] LOG IO[%] DURATION[hh:mm:ss] EXECUTIONS COUNT
5/26 - 5/27 0.01 0.26 0 00:02:48.630 444
5/27 - 5/28 0.01 0.22 0 00:02:59.690 404
5/28 - 5/29 0.01 0.14 0 00:01:59.950 420
5/29 - 5/30 0.02 0.33 0 00:03:45.790 456
5/30 - 5/31 0.01 0.19 0 00:02:20.290 475
5/31 - 6/1 0.01 0.23 0 00:04:29.770 461
6/1 - 6/2 0.01 0.23 0 00:02:54.680 376
6/2 - 6/3 0 0.05 0 00:01:36.050 205
6/3 - 6/4 0.02 0.48 0 00:06:05.070 692
6/4 - 6/5 0.02 0.34 0 00:03:40.900 400

Queryplan: https://www.brentozar.com/pastetheplan/?id=SyUSxtbI0

We are running on Azure SQL, so every unnecessary page retrieval (eg: a key lookup for data that is not used by X% of the implementations) is always bad for performance. We scaled our azure sql temporary to mitigate this issue.

@pregress pregress changed the title TokenCleanupService doesn't fully utilize the existing database indexes. TokenCleanupService creates a sub optimal query plan with an expensive keylookup Jun 25, 2024
@AndersAbel
Copy link
Member

Thanks for sharing the numbers. I did some calculations to understand how much of the total DB time is caused by the select. For example the first line, the delete is 580 seconds, the select is 168 seconds. The total operation time is 748. The select makes up 22% of the total DB time.

If I get this right, the select only makes up 11% of the total DB time. I don't think it's worth spending more time optimizing this on the EF abstraction layer.

For high volume/performance deployments we do recommend tuning the database with the features available on the DB engine used and possibly replacing/reworking the queries to be optimal with the specific database.

If you are using SQL Server I think that you could possibly get performance gains by changing the indexes:

  1. Change the clustered index to be the expiry time.
  2. Create a new unique index on the key, which includes all columns.

This would make the delete operations efficient as they would be a range scan on the beginning of the clustered index. The read-by-key operations would be able be completed from the index only, without having to do any lookup on the clustered index. The drawback would be increased storage requirements and probably slower write/insert. As with any DB tuning you would have to test and measure the performance on your specific environment.

@AndersAbel
Copy link
Member

I noticed now that the execution plans looks like being the estimated execution plans. Could you please share the actual execution plans as that would be needed to find out if the OpenJSON problem gives a performance issue here.

@pregress
Copy link
Author

pregress commented Jul 3, 2024

We will implement our own TokenCleanupService that utilizes the existing indexes.

Regarding the actual execution plan, we don't have the actual plan in query store.

@pregress pregress closed this as completed Jul 3, 2024
@pregress
Copy link
Author

pregress commented Jul 5, 2024

For those needing the implementation.

public class FastTokenCleanUpService : TokenCleanupService
{
    private readonly OperationalStoreOptions _options;
    private readonly IPersistedGrantDbContext _persistedGrantDbContext;
    private readonly ILogger<TokenCleanupService> _logger;

    public FastTokenCleanUpService(OperationalStoreOptions options,
        IPersistedGrantDbContext persistedGrantDbContext,
        ILogger<TokenCleanupService> logger,
        IOperationalStoreNotification operationalStoreNotification = null) : base(options, persistedGrantDbContext, logger, operationalStoreNotification)
    {
        _options = options;
        _persistedGrantDbContext = persistedGrantDbContext;
        _logger = logger;

        if (operationalStoreNotification != null)
        {
            throw new NotSupportedException(
                "FastTokenCleanUpService should not be used in combination with operationalStoreNotification, fall back on the default identity behavior");
        }
    }

    protected override async Task RemoveExpiredPersistedGrantsAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.Expiration < DateTime.UtcNow);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {grantCount} expired grants", found);
        }
    }

    protected override async Task RemoveConsumedPersistedGrantsAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        var delay = TimeSpan.FromSeconds(_options.ConsumedTokenCleanupDelay);
        var consumedTimeThreshold = DateTime.UtcNow.Subtract(delay);

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.ConsumedTime < consumedTimeThreshold);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {grantCount} consumed grants", found);
        }
    }

    protected override async Task RemoveDeviceCodesAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.DeviceFlowCodes
                .Where(x => x.Expiration < DateTime.UtcNow);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {deviceCodeCount} device flow codes", found);
        }
    }
}


// program.cs OR startup.cs
// after AddOperationalStore

    var serviceDescriptor = builder.Services.Single(descriptor => descriptor.ServiceType == typeof(ITokenCleanupService));
    builder.Services.Remove(serviceDescriptor);
    builder.Services.AddTransient<ITokenCleanupService, FastTokenCleanUpService>();

@AndersAbel
Copy link
Member

It looks like you have found a solution that you are happy with and that is enough to close the issue. If anyone else runs into similar issues, please open a new issue. With more profiling data we would be happy to investigate how our implementation can be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants