Transaction deadlocks within a simple DB model #14371

mareklinka · 2019-01-10T19:11:27Z

Hello all. I'm facing an extremely weird issue with EF Core 2.2 (also briefly tested against 2.0). A little bit of backstory:

We received a bug report from production for a project I'm currently working on. After investigation, I discovered the application is sometimes encountering transaction deadlocks, given certain specific conditions. I was trying to figure out a root cause for over a day, unsuccessfully.

Then I decided to create a minimal repro of the problem, which lead me here. Here's the situation:

Environment:
Windows 10 1809 x64,
SQL Server Express 2017 with latest patches
EF Core 2.2 on .NET Core 2.2

Technical details
I have a very simple DB model. It's basically a 3-level tree structure, with individual roots being disjunct (they don't share any entities). We modify the trees in transactions, using ReadCommited isolation level. The error occurs when our ASP.NET Core web app receives two requests to modify two different trees AND the transactions run for over 600ms. The deadlock only occurs if we delete some entities from the tree.

Under these conditions, the two operations sometime cause a deadlock, resulting in an exception and one request failing. But since these trees share no content, operations on the trees should not, in my opinion, produce any deadlocks.

I managed to reduce the problem into a unit test project, which I'm attaching. I even managed to find a work-around of a sort. It turns out if I switch one of the foreign keys in the model from Cascade to Restrict (and then manually delete the child entities), the deadlocks stop happening.
This is mentioned in the Solution 2 comment in file DatabaseContext.cs in my demo.

Another, even weirder solution, is to comment out a property of an entity. The code then works without deadlocks even with Cascade deletes. This is in file StateItem.cs as Solution 1.

I see no reason why any of the workarounds should prevent a database deadlock. Could someone please have a look at the code and try to explain what's going on in there? As I mentioned, this came from production and I'd like to resolve it sooner rather than later, so any insight is greatly appreciated.

The repro code
DeadlockRepro.zip

---- System.InvalidOperationException : An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure()' to the 'UseSqlServer' call.
-------- Microsoft.EntityFrameworkCore.DbUpdateException : An error occurred while updating the entries. See the inner exception for details.
------------ System.Data.SqlClient.SqlException : Transaction (Process ID 55) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
   at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
   at DeadlockRepro.UnitTest1.Test1() in C:\Users\Marek\source\repos\DeadlockRepro\UnitTest1.cs:line 33
----- Inner Stack Trace -----
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IReadOnlyList`1 entriesToSave, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
   at DeadlockRepro.UnitTest1.CauseDeadlock(Int32 id) in C:\Users\Marek\source\repos\DeadlockRepro\UnitTest1.cs:line 64
----- Inner Stack Trace -----
   at Microsoft.EntityFrameworkCore.Update.AffectedCountModificationCommandBatch.ConsumeAsync(RelationalDataReader reader, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Update.ReaderModificationCommandBatch.ExecuteAsync(IRelationalConnection connection, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(DbContext _, ValueTuple`2 parameters, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
----- Inner Stack Trace -----
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)
   at System.Data.SqlClient.SqlDataReader.TryHasMoreResults(Boolean& moreResults)
   at System.Data.SqlClient.SqlDataReader.TryNextResult(Boolean& more)
   at System.Data.SqlClient.SqlDataReader.<>c__DisplayClass187_0.<NextResultAsync>b__1(Task t)
   at System.Data.SqlClient.SqlDataReader.InvokeRetryable[T](Func`2 moreFunc, TaskCompletionSource`1 source, IDisposable objectToDispose)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.EntityFrameworkCore.Update.AffectedCountModificationCommandBatch.ConsumeAsync(RelationalDataReader reader, CancellationToken cancellationToken)

The text was updated successfully, but these errors were encountered:

divega · 2019-01-10T23:14:32Z

The two workarounds are actually similar: when you remove the FK property StateItem.InspectionId, EF Core will create an FK property "InspectionId" in shadow-state. In the latter case, the FK will be nullable, and by convention the constraint will have ON DELETE NO ACTION.

The real property isn't nullable, hence by convention the corresponding database constrain will be created with ON DELETE CASCADE.

So this workaround also changes the cascade behavior, like the first one, but on a different constraint.

divega · 2019-01-11T00:29:35Z

Here are the contents of the xml_deadlock_report with MaxBatchSize(1):

<deadlock>
 <victim-list>
  <victimProcess id="process1d7a8c24ca8" />
 </victim-list>
 <process-list>
  <process id="process1d7a8c24ca8" taskpriority="0" logused="2872" waitresource="KEY: 8:72057594043301888 (9dd6aff0b38d)" waittime="3969" ownerId="31287" transactionname="user_transaction" lasttranstarted="2019-01-10T16:48:02.680" XDES="0x1d7b59d8490" lockMode="RangeS-U" schedulerid="8" kpid="26864" status="suspended" spid="60" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2019-01-10T16:48:02.740" lastbatchcompleted="2019-01-10T16:48:02.740" lastattention="1900-01-01T00:00:00.740" clientapp="Core .Net SqlClient Data Provider" hostname="DIVEGA13" hostpid="1714060944" loginname="REDMOND\divega" isolationlevel="read committed (2)" xactid="31287" currentdb="8" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="2" stmtstart="58" stmtend="138" sqlhandle="0x02000000e0100005ffcfde2f1eba4da32aefa2075a13a1840000000000000000000000000000000000000000">
unknown    </frame>
    <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
(@p0 bigint)SET NOCOUNT ON;
DELETE FROM [StateItem]
WHERE [Id] = @p0;
SELECT @@ROWCOUNT;

   </inputbuf>
  </process>
  <process id="process1d7b5047c28" taskpriority="0" logused="3140" waitresource="KEY: 8:72057594043301888 (ef8398b4b071)" waittime="3969" ownerId="31285" transactionname="user_transaction" lasttranstarted="2019-01-10T16:48:02.680" XDES="0x1d7a5ce8490" lockMode="RangeS-U" schedulerid="6" kpid="26596" status="suspended" spid="59" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2019-01-10T16:48:02.743" lastbatchcompleted="2019-01-10T16:48:02.743" lastattention="1900-01-01T00:00:00.743" clientapp="Core .Net SqlClient Data Provider" hostname="DIVEGA13" hostpid="1714060944" loginname="REDMOND\divega" isolationlevel="read committed (2)" xactid="31285" currentdb="8" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="2" stmtstart="58" stmtend="138" sqlhandle="0x02000000e0100005ffcfde2f1eba4da32aefa2075a13a1840000000000000000000000000000000000000000">
unknown    </frame>
    <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
(@p0 bigint)SET NOCOUNT ON;
DELETE FROM [StateItem]
WHERE [Id] = @p0;
SELECT @@ROWCOUNT;

   </inputbuf>
  </process>
 </process-list>
 <resource-list>
  <keylock hobtid="72057594043301888" dbid="8" objectname="DeadlockDatabase.dbo.StateDetail" indexname="IX_StateDetail_VehicleStateItemId" id="lock1d7b56d1000" mode="RangeX-X" associatedObjectId="72057594043301888">
   <owner-list>
    <owner id="process1d7b5047c28" mode="RangeX-X" />
   </owner-list>
   <waiter-list>
    <waiter id="process1d7a8c24ca8" mode="RangeS-U" requestType="wait" />
   </waiter-list>
  </keylock>
  <keylock hobtid="72057594043301888" dbid="8" objectname="DeadlockDatabase.dbo.StateDetail" indexname="IX_StateDetail_VehicleStateItemId" id="lock1d7b56d1080" mode="X" associatedObjectId="72057594043301888">
   <owner-list>
    <owner id="process1d7a8c24ca8" mode="X" />
   </owner-list>
   <waiter-list>
    <waiter id="process1d7b5047c28" mode="RangeS-U" requestType="wait" />
   </waiter-list>
  </keylock>
 </resource-list>
</deadlock>

divega · 2019-01-11T01:52:46Z

Note for triage: I have tried several things that I have learned in the past help workaround deadlocks, and even a few long shots:

Different isolation levels
Disabling lock escalation on the tables
Reducing the size of batches (even down to 1)
Reducing the scope of the transaction to just SaveChanges
Fixing some minor bugs in the model (e.g. StateDetail.VehicleStateItemId wasn't being used)
Changing the type of the keys from bigint to int

Nothing helped: as long as the there are two levels of cascade deletes on Inspection --> StateItem --> StateDetail, if two concurrent sessions try to delete two different StateItems, a deadlock ends up occurring at the point the two sessions try to get access to StateDetail with different types of locks, presumably to apply the cascade delete.

At this point this looks like a limitation on SQL Server. Here are a few things I would like to discuss in triage:

Why is it that we don't see this more often? The model doesn't seem uncommon, but maybe I am missing something.
If we think this can affect lots of customers, and we just don't hear from them, should we stop doing cascade deletes by convention (at least with SQL Server)?
How to best follow up with SQL Server to get their opinion, and possibly have them consider a fix.

ajcvickers · 2019-01-14T20:39:22Z

@divega to follow up with @AndriySvyryd and to reach out to the SQL Server team.

divega · 2019-01-15T01:36:33Z

I changed the repro with some simplifications and to introduce some instrumentation/console outputs, and I was able to observe that the deadlock occurs with almost perfect regularity every other iteration.

This lead me to a new hypothesis: that the deadlock could be actually involve contention between a task currently running and locks held by the homologous task from the previous iteration. This could be the case if for example, the transaction was actually completing asynchronously and resources are still being held even if the SqlClient transaction was completed and the connection closed.

I run some tests that seem to support (or at least not refute) this hypothesis:

I introduced two new root entities and I changed the program so that each even numbered iteration to affect roots 1 and 2, and every odd numbered iteration affects roots 2 and 3. This reduced the number of deadlocks significantly (1 every 100 iterations), although the number of transactions is the same.

A few more observations:

The behavior is the same between .NET Core and .NET Framework
Iterations 1 to 10 are usually very slow. Things speed up later, and I can't explain why.

Here is the simplified repro (past this in a console application):

using Microsoft.EntityFrameworkCore;
using System;
using System.Collections.Generic;
using System.ComponentModel.DataAnnotations;
using System.Linq;
using System.Threading.Tasks;

namespace DeadlockRepro
{

    public class Blog
    {
        public int BlogId { get; set; }
        public ICollection<Post> Posts { get; } = new HashSet<Post>();
    }

    public class Post
    {
        public int PostId { get; set; }
        // Workaround 1: comment out this required attribute
        [Required]
        public Blog Blog { get; set; }
        public ICollection<Comment> Comments { get; } = new HashSet<Comment>();
    }

    public class Comment
    {
        public int CommentId { get; set; }
        // Workaround 2: comment out this required attribute
        [Required]
        public Post Post { get; set; }
    }


    public class DeadlockContext : DbContext
    {
        public DbSet<Blog> Blogs { get; set; }

        protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        {
            // disabling pooling reduces number of dealocks significantly
            optionsBuilder.UseSqlServer(
                @"Server=(localdb)\mssqllocaldb;Database=DeadlockDatabase;Trusted_Connection=True;ConnectRetryCount=0;Pooling=true;",
                options => options.MaxBatchSize(1));
        }
    }

    public class Program
    {
        static void Main(string[] args) 
        {
            using (var db = new DeadlockContext())
            {
                db.Database.EnsureDeleted();
                db.Database.EnsureCreated();

                db.Blogs.Add(new Blog());
                db.Blogs.Add(new Blog());
                db.Blogs.Add(new Blog());
                db.Blogs.Add(new Blog());
                db.SaveChanges();
            }

            for (int i = 0; i < 100; i++)
            {
                try
                {
                    var baseline = 0; 
                    // uncomment to alternate between blogs 1, 2 and 2, 3 for each iteration
                    // baseline = i % 2 == 0 ? 0 : 2; 
                    var t1 = CauseDeadlock(baseline + 1);
                    var t2 = CauseDeadlock(baseline + 2);
                                       
                    Task.WaitAll(t1, t2);
                    Console.WriteLine($"Iteration {i} completed successfully.");

                }
                catch (AggregateException ex)
                {
                    Console.WriteLine($"Ìteration {i} failed with {ex?.InnerException?.InnerException?.InnerException.Message}");
                }
            }
        }

        private static async Task CauseDeadlock(int id)
        {
            using (var db = new DeadlockContext())
            {
                var blog =
                    db.Blogs
                        .Include(b => b.Posts)
                        .ThenInclude(p => p.Comments)
                        .First(b => b.BlogId == id);

                foreach (var post in blog.Posts.ToList())
                {
                    blog.Posts.Remove(post);
                }

                blog.Posts.Add(new Post { Comments = { new Comment() } });
                blog.Posts.Add(new Post { Comments = { new Comment() } });
                blog.Posts.Add(new Post { Comments = { new Comment() } });

                await db.SaveChangesAsync();
            }
        }
    }
}

divega · 2019-01-15T01:38:03Z

@AndriySvyryd could check with the latest repro if you can find anything I could be missing?

AndriySvyryd · 2019-01-15T07:45:44Z

I think this is the same issue we had previously when we didn't have a deterministic order for SaveChanges. When it deadlocks the new posts and comments are being inserted before the old ones are deleted. I think we don't have more reports of this because the entities need to be queried in this specific way to produce this change application order.

We can probably fix it by adding an explicit dependency between deletes and inserts to the same table when applying the topological sort.

divega · 2019-01-15T19:22:24Z

We can probably fix it by adding an explicit dependency between deletes and inserts to the same table when applying the topological sort.

Oh, I thought we had already done that! 😮 Thanks @AndriySvyryd!

Clearing up milestone to discuss priority in the next triage.

AndriySvyryd · 2019-01-15T19:57:11Z

@divega Currently we only do that in the second order sort

Always sort deletes before updates and updates before inserts for the same table. Fixes #14371 Fixes #15180 Fixes #25228

…cks. This dependency isn't hard, so cycle breaking is introduced. Make the implementations of Multigraph TopologicalSort and BatchingTopologicalSort more alike and faster Fixes #14371

…cks. Fixes #14371

ajcvickers assigned divega Jan 14, 2019

ajcvickers added this to the 3.0.0 milestone Jan 14, 2019

divega removed this from the 3.0.0 milestone Jan 15, 2019

divega removed their assignment Jan 15, 2019

ajcvickers assigned AndriySvyryd Jan 16, 2019

ajcvickers added the type-bug label Jan 16, 2019

ajcvickers added this to the 3.0.0 milestone Jan 16, 2019

ajcvickers added the customer-reported label Jan 25, 2019

This was referenced Mar 22, 2019

Entity Framework concurrent savechanges deadlock dotnet/ef6#723

Closed

Deadlock when concurrently deleting disjoint graphs with two levels of cascade #15180

Closed

ajcvickers added the propose-punt label Jun 27, 2019

ajcvickers modified the milestones: 3.0.0, Backlog Jun 28, 2019

ajcvickers added punted-for-3.0 and removed propose-punt labels Jun 28, 2019

AndriySvyryd added the area-save-changes label Sep 2, 2020

ajcvickers modified the milestones: Backlog, 6.0.0 Nov 5, 2020

AndriySvyryd removed their assignment Sep 4, 2021

AndriySvyryd added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Sep 4, 2021

AndriySvyryd added a commit that referenced this issue Sep 4, 2021

Sort inserts to preserve tracking order on more cases.

1b924c7

Always sort deletes before updates and updates before inserts for the same table. Fixes #14371 Fixes #15180 Fixes #25228

AndriySvyryd removed the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Sep 4, 2021

AndriySvyryd self-assigned this Sep 4, 2021

AndriySvyryd mentioned this issue Sep 9, 2021

Always sort deletes before inserts for the same table to avoid deadlocks. #25952

Merged

AndriySvyryd removed their assignment Sep 9, 2021

AndriySvyryd added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Sep 9, 2021

AndriySvyryd closed this as completed in #25952 Sep 10, 2021

AndriySvyryd added a commit that referenced this issue Sep 10, 2021

Always sort deletes before inserts for the same table to avoid deadlo…

e1de574

…cks. Fixes #14371

ajcvickers modified the milestones: 6.0.0, 6.0.0-rc2 Sep 10, 2021

ajcvickers modified the milestones: 6.0.0-rc2, 6.0.0 Nov 8, 2021

AndriySvyryd mentioned this issue Mar 30, 2022

Reduce SaveChanges roundtrips #27713

Merged

YashanFernando mentioned this issue Jun 28, 2022

Deadlock on deleting separate entities in parallel #28331

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transaction deadlocks within a simple DB model #14371

Transaction deadlocks within a simple DB model #14371

mareklinka commented Jan 10, 2019

divega commented Jan 10, 2019

divega commented Jan 11, 2019 •

edited

Loading

divega commented Jan 11, 2019 •

edited

Loading

ajcvickers commented Jan 14, 2019

divega commented Jan 15, 2019

divega commented Jan 15, 2019

AndriySvyryd commented Jan 15, 2019

divega commented Jan 15, 2019 •

edited

Loading

AndriySvyryd commented Jan 15, 2019

Transaction deadlocks within a simple DB model #14371

Transaction deadlocks within a simple DB model #14371

Comments

mareklinka commented Jan 10, 2019

divega commented Jan 10, 2019

divega commented Jan 11, 2019 • edited Loading

divega commented Jan 11, 2019 • edited Loading

ajcvickers commented Jan 14, 2019

divega commented Jan 15, 2019

divega commented Jan 15, 2019

AndriySvyryd commented Jan 15, 2019

divega commented Jan 15, 2019 • edited Loading

AndriySvyryd commented Jan 15, 2019

divega commented Jan 11, 2019 •

edited

Loading

divega commented Jan 11, 2019 •

edited

Loading

divega commented Jan 15, 2019 •

edited

Loading