Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonoptimal database query #1169

Open
thefrozenfruit opened this issue May 19, 2023 · 0 comments
Open

Nonoptimal database query #1169

thefrozenfruit opened this issue May 19, 2023 · 0 comments

Comments

@thefrozenfruit
Copy link

Describe the bug

EntityFrameworkPersistenceProvider class methods for retrieving WorkflowInstance do not generate optimal database queries.

As an example, this code generates a query, which leads to duplicate data from Workflow table.

Code:

public async Task<WorkflowInstance> GetWorkflowInstance(string Id, CancellationToken cancellationToken = default)
{
    using (var db = ConstructDbContext())
    {
        var uid = new Guid(Id);
        var raw = await db.Set<PersistedWorkflow>()
            .Where(x => x.InstanceId == uid)
            .Include(wf => wf.ExecutionPointers)
            .ThenInclude(ep => ep.ExtensionAttributes)
            .Include(wf => wf.ExecutionPointers)
            .FirstAsync(cancellationToken);

        if (raw == null)
            return null;

        return raw.ToWorkflowInstance();
    }
}

Query:

SELECT
  [t].[PersistenceId],
  [t].[CompleteTime],
  [t].[CreateTime],
  [t].[Data], -- this is the cause of large dataflow (duplication)
  [t].[Description],
  [t].[InstanceId],
  [t].[NextExecution],
  [t].[Reference],
  [t].[Status],
  [t].[Version],
  [t].[WorkflowDefinitionId],
  [t0].[PersistenceId],
  [t0].[Active],
  [t0].[Children],
  [t0].[ContextItem],
  [t0].[EndTime],
  [t0].[EventData],
  [t0].[EventKey],
  [t0].[EventName],
  [t0].[EventPublished],
  [t0].[Id],
  [t0].[Outcome],
  [t0].[PersistenceData],
  [t0].[PredecessorId],
  [t0].[RetryCount],
  [t0].[Scope],
  [t0].[SleepUntil],
  [t0].[StartTime],
  [t0].[Status],
  [t0].[StepId],
  [t0].[StepName],
  [t0].[WorkflowId],
  [t0].[PersistenceId0],
  [t0].[AttributeKey],
  [t0].[AttributeValue],
  [t0].[ExecutionPointerId]
FROM (
    SELECT TOP(1)
        [w].[PersistenceId],
        [w].[CompleteTime],
        [w].[CreateTime],
        [w].[Data],
        [w].[Description],
        [w].[InstanceId],
        [w].[NextExecution],
        [w].[Reference],
        [w].[Status],
        [w].[Version],
        [w].[WorkflowDefinitionId]
    FROM [wfc].[Workflow] AS [w]
    WHERE [w].[InstanceId] = {uid}
) AS [t]
LEFT JOIN (
    SELECT
        [e].[PersistenceId],
        [e].[Active],
        [e].[Children],
        [e].[ContextItem],
        [e].[EndTime],
        [e].[EventData],
        [e].[EventKey],
        [e].[EventName],
        [e].[EventPublished],
        [e].[Id],
        [e].[Outcome],
        [e].[PersistenceData],
        [e].[PredecessorId],
        [e].[RetryCount],
        [e].[Scope],
        [e].[SleepUntil],
        [e].[StartTime],
        [e].[Status],
        [e].[StepId],
        [e].[StepName],
        [e].[WorkflowId],
        [e0].[PersistenceId] AS [PersistenceId0],
        [e0].[AttributeKey],
        [e0].[AttributeValue],
        [e0].[ExecutionPointerId]
    FROM [wfc].[ExecutionPointer] AS [e]
    LEFT JOIN [wfc].[ExtensionAttribute] AS [e0] ON [e].[PersistenceId] = [e0].[ExecutionPointerId]
) AS [t0] ON [t].[PersistenceId] = [t0].[WorkflowId]
ORDER BY [t].[PersistenceId], [t0].[PersistenceId], [t0].[PersistenceId0];

Our case contains a large amount of information in the Data column. It leads to performance degradation due to the large amount of duplicate data when the query is executed.

To avoid this side effect, you can split the query.

Code:

var raw = await db.Set<PersistedWorkflow>()
    .Where(x => x.InstanceId == uid)
    .Include(wf => wf.ExecutionPointers)
    .ThenInclude(ep => ep.ExtensionAttributes)
    .Include(wf => wf.ExecutionPointers)
    .AsSplitQuery() // new line
    .FirstOrDefaultAsync(cancellationToken);

Queries:

SELECT TOP(1)
    [w].[PersistenceId],
    [w].[CompleteTime],
    [w].[CreateTime],
    [w].[Data],
    [w].[Description],
    [w].[InstanceId],
    [w].[NextExecution],
    [w].[Reference],
    [w].[Status],
    [w].[Version],
    [w].[WorkflowDefinitionId]
FROM [wfc].[Workflow] AS [w]
WHERE [w].[InstanceId] = {uid}
ORDER BY [w].[PersistenceId]
SELECT
    [e0].[PersistenceId],
    [e0].[AttributeKey],
    [e0].[AttributeValue],
    [e0].[ExecutionPointerId],
    [t].[PersistenceId],
    [e].[PersistenceId]
FROM (
    SELECT TOP(1) [w].[PersistenceId]
    FROM [wfc].[Workflow] AS [w]
    WHERE [w].[InstanceId] = {uid}
) AS [t]
INNER JOIN [wfc].[ExecutionPointer] AS [e] ON [t].[PersistenceId] = [e].[WorkflowId]
INNER JOIN [wfc].[ExtensionAttribute] AS [e0] ON [e].[PersistenceId] = [e0].[ExecutionPointerId]
ORDER BY [t].[PersistenceId], [e].[PersistenceId]
SELECT
    [e].[PersistenceId],
    [e].[Active],
    [e].[Children],
    [e].[ContextItem],
    [e].[EndTime],
    [e].[EventData],
    [e].[EventKey],
    [e].[EventName],
    [e].[EventPublished],
    [e].[Id],
    [e].[Outcome],
    [e].[PersistenceData],
    [e].[PredecessorId],
    [e].[RetryCount],
    [e].[Scope],
    [e].[SleepUntil],
    [e].[StartTime],
    [e].[Status],
    [e].[StepId],
    [e].[StepName],
    [e].[WorkflowId],
    [t].[PersistenceId]
FROM (
    SELECT TOP(1) [w].[PersistenceId]
    FROM [wfc].[Workflow] AS [w]
    WHERE [w].[InstanceId] = {uid}
) AS [t]
INNER JOIN [wfc].[ExecutionPointer] AS [e] ON [t].[PersistenceId] = [e].[WorkflowId]
ORDER BY [t].[PersistenceId], [e].[PersistenceId]

Test

From SSMS

[Data] size = 507kB;
Execution pointer count = 1000;

Impl [Data] total size Execution time
left join (current) 507 000 kB ~ 6m36s520ms
split queries 507kB ~ 1s300ms

From external source

Count - number of execution pointers;
DataSize - the size of the information in the Data column;

SingleQuery - single query (current impl);
SplitQuery - split queries;

BenchmarkDotNet=v0.13.5, OS=Windows 10 (10.0.19045.2965/22H2/2022Update)
AMD Ryzen 7 4800HS with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 6.0.11 (6.0.1122.52304), X64 RyuJIT AVX2
  Job-DFXIRM : .NET 6.0.11 (6.0.1122.52304), X64 RyuJIT AVX2

IterationCount=3  LaunchCount=1  WarmupCount=1  
Method Id Count DataSize Mean Error StdDev Ratio Gen0 Gen1 Gen2 Allocated Alloc Ratio
SingleQuery 6A8F3(...)8476E [36] 108 7.71 kB 2,571.3 ms 735.28 ms 40.30 ms 1.00 7000.0000 - - 14.83 MB 1.00
SplitQuery 6A8F3(...)8476E [36] 108 7.71 kB 269.0 ms 4.84 ms 0.27 ms 0.10 500.0000 - - 2.25 MB 0.15
SingleQuery C0EDE(...)29FBD [36] 210 14.81 kB 9,141.0 ms 579.62 ms 31.77 ms 1.00 34000.0000 2000.0000 - 72.47 MB 1.00
SplitQuery C0EDE(...)29FBD [36] 210 14.81 kB 408.3 ms 34.98 ms 1.92 ms 0.04 1000.0000 - - 4.42 MB 0.06
SingleQuery C10E8(...)F4DBE [36] 333 23.38 kB 22,677.3 ms 5,308.82 ms 290.99 ms 1.00 69000.0000 68000.0000 67000.0000 242.66 MB 1.00
SplitQuery C10E8(...)F4DBE [36] 333 23.38 kB 633.7 ms 34.22 ms 1.88 ms 0.03 1000.0000 - - 7.12 MB 0.03
SingleQuery 4B444(...)B85E5 [36] 503 211.38 kB 308,729.0 ms 67,444.26 ms 3,696.85 ms 1.00 4583000.0000 4528000.0000 4453000.0000 22366.06 MB 1.000
SplitQuery 4B444(...)B85E5 [36] 503 211.38 kB 4,273.6 ms 480.21 ms 26.32 ms 0.01 19000.0000 15000.0000 9000.0000 69.83 MB 0.003

Expected behavior

Fetch data without unnecessary degradation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant