Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transaction deadlocks within a simple DB model #14371

Closed
mareklinka opened this issue Jan 10, 2019 · 9 comments · Fixed by #25952
Closed

Transaction deadlocks within a simple DB model #14371

mareklinka opened this issue Jan 10, 2019 · 9 comments · Fixed by #25952
Labels
area-save-changes closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. customer-reported punted-for-3.0 type-bug
Milestone

Comments

@mareklinka
Copy link

Hello all. I'm facing an extremely weird issue with EF Core 2.2 (also briefly tested against 2.0). A little bit of backstory:

We received a bug report from production for a project I'm currently working on. After investigation, I discovered the application is sometimes encountering transaction deadlocks, given certain specific conditions. I was trying to figure out a root cause for over a day, unsuccessfully.

Then I decided to create a minimal repro of the problem, which lead me here. Here's the situation:

Environment:
Windows 10 1809 x64,
SQL Server Express 2017 with latest patches
EF Core 2.2 on .NET Core 2.2

Technical details
I have a very simple DB model. It's basically a 3-level tree structure, with individual roots being disjunct (they don't share any entities). We modify the trees in transactions, using ReadCommited isolation level. The error occurs when our ASP.NET Core web app receives two requests to modify two different trees AND the transactions run for over 600ms. The deadlock only occurs if we delete some entities from the tree.

Under these conditions, the two operations sometime cause a deadlock, resulting in an exception and one request failing. But since these trees share no content, operations on the trees should not, in my opinion, produce any deadlocks.

I managed to reduce the problem into a unit test project, which I'm attaching. I even managed to find a work-around of a sort. It turns out if I switch one of the foreign keys in the model from Cascade to Restrict (and then manually delete the child entities), the deadlocks stop happening.
This is mentioned in the Solution 2 comment in file DatabaseContext.cs in my demo.

Another, even weirder solution, is to comment out a property of an entity. The code then works without deadlocks even with Cascade deletes. This is in file StateItem.cs as Solution 1.

I see no reason why any of the workarounds should prevent a database deadlock. Could someone please have a look at the code and try to explain what's going on in there? As I mentioned, this came from production and I'd like to resolve it sooner rather than later, so any insight is greatly appreciated.

The repro code
DeadlockRepro.zip

---- System.InvalidOperationException : An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure()' to the 'UseSqlServer' call.
-------- Microsoft.EntityFrameworkCore.DbUpdateException : An error occurred while updating the entries. See the inner exception for details.
------------ System.Data.SqlClient.SqlException : Transaction (Process ID 55) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
   at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
   at DeadlockRepro.UnitTest1.Test1() in C:\Users\Marek\source\repos\DeadlockRepro\UnitTest1.cs:line 33
----- Inner Stack Trace -----
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IReadOnlyList`1 entriesToSave, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
   at DeadlockRepro.UnitTest1.CauseDeadlock(Int32 id) in C:\Users\Marek\source\repos\DeadlockRepro\UnitTest1.cs:line 64
----- Inner Stack Trace -----
   at Microsoft.EntityFrameworkCore.Update.AffectedCountModificationCommandBatch.ConsumeAsync(RelationalDataReader reader, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Update.ReaderModificationCommandBatch.ExecuteAsync(IRelationalConnection connection, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(DbContext _, ValueTuple`2 parameters, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
----- Inner Stack Trace -----
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)
   at System.Data.SqlClient.SqlDataReader.TryHasMoreResults(Boolean& moreResults)
   at System.Data.SqlClient.SqlDataReader.TryNextResult(Boolean& more)
   at System.Data.SqlClient.SqlDataReader.<>c__DisplayClass187_0.<NextResultAsync>b__1(Task t)
   at System.Data.SqlClient.SqlDataReader.InvokeRetryable[T](Func`2 moreFunc, TaskCompletionSource`1 source, IDisposable objectToDispose)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.EntityFrameworkCore.Update.AffectedCountModificationCommandBatch.ConsumeAsync(RelationalDataReader reader, CancellationToken cancellationToken)
@divega
Copy link
Contributor

divega commented Jan 10, 2019

The two workarounds are actually similar: when you remove the FK property StateItem.InspectionId, EF Core will create an FK property "InspectionId" in shadow-state. In the latter case, the FK will be nullable, and by convention the constraint will have ON DELETE NO ACTION.

The real property isn't nullable, hence by convention the corresponding database constrain will be created with ON DELETE CASCADE.

So this workaround also changes the cascade behavior, like the first one, but on a different constraint.

@divega
Copy link
Contributor

divega commented Jan 11, 2019

Here are the contents of the xml_deadlock_report with MaxBatchSize(1):

<deadlock>
 <victim-list>
  <victimProcess id="process1d7a8c24ca8" />
 </victim-list>
 <process-list>
  <process id="process1d7a8c24ca8" taskpriority="0" logused="2872" waitresource="KEY: 8:72057594043301888 (9dd6aff0b38d)" waittime="3969" ownerId="31287" transactionname="user_transaction" lasttranstarted="2019-01-10T16:48:02.680" XDES="0x1d7b59d8490" lockMode="RangeS-U" schedulerid="8" kpid="26864" status="suspended" spid="60" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2019-01-10T16:48:02.740" lastbatchcompleted="2019-01-10T16:48:02.740" lastattention="1900-01-01T00:00:00.740" clientapp="Core .Net SqlClient Data Provider" hostname="DIVEGA13" hostpid="1714060944" loginname="REDMOND\divega" isolationlevel="read committed (2)" xactid="31287" currentdb="8" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="2" stmtstart="58" stmtend="138" sqlhandle="0x02000000e0100005ffcfde2f1eba4da32aefa2075a13a1840000000000000000000000000000000000000000">
unknown    </frame>
    <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
(@p0 bigint)SET NOCOUNT ON;
DELETE FROM [StateItem]
WHERE [Id] = @p0;
SELECT @@ROWCOUNT;

   </inputbuf>
  </process>
  <process id="process1d7b5047c28" taskpriority="0" logused="3140" waitresource="KEY: 8:72057594043301888 (ef8398b4b071)" waittime="3969" ownerId="31285" transactionname="user_transaction" lasttranstarted="2019-01-10T16:48:02.680" XDES="0x1d7a5ce8490" lockMode="RangeS-U" schedulerid="6" kpid="26596" status="suspended" spid="59" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2019-01-10T16:48:02.743" lastbatchcompleted="2019-01-10T16:48:02.743" lastattention="1900-01-01T00:00:00.743" clientapp="Core .Net SqlClient Data Provider" hostname="DIVEGA13" hostpid="1714060944" loginname="REDMOND\divega" isolationlevel="read committed (2)" xactid="31285" currentdb="8" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="2" stmtstart="58" stmtend="138" sqlhandle="0x02000000e0100005ffcfde2f1eba4da32aefa2075a13a1840000000000000000000000000000000000000000">
unknown    </frame>
    <frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
(@p0 bigint)SET NOCOUNT ON;
DELETE FROM [StateItem]
WHERE [Id] = @p0;
SELECT @@ROWCOUNT;

   </inputbuf>
  </process>
 </process-list>
 <resource-list>
  <keylock hobtid="72057594043301888" dbid="8" objectname="DeadlockDatabase.dbo.StateDetail" indexname="IX_StateDetail_VehicleStateItemId" id="lock1d7b56d1000" mode="RangeX-X" associatedObjectId="72057594043301888">
   <owner-list>
    <owner id="process1d7b5047c28" mode="RangeX-X" />
   </owner-list>
   <waiter-list>
    <waiter id="process1d7a8c24ca8" mode="RangeS-U" requestType="wait" />
   </waiter-list>
  </keylock>
  <keylock hobtid="72057594043301888" dbid="8" objectname="DeadlockDatabase.dbo.StateDetail" indexname="IX_StateDetail_VehicleStateItemId" id="lock1d7b56d1080" mode="X" associatedObjectId="72057594043301888">
   <owner-list>
    <owner id="process1d7a8c24ca8" mode="X" />
   </owner-list>
   <waiter-list>
    <waiter id="process1d7b5047c28" mode="RangeS-U" requestType="wait" />
   </waiter-list>
  </keylock>
 </resource-list>
</deadlock>

@divega
Copy link
Contributor

divega commented Jan 11, 2019

Note for triage: I have tried several things that I have learned in the past help workaround deadlocks, and even a few long shots:

  • Different isolation levels
  • Disabling lock escalation on the tables
  • Reducing the size of batches (even down to 1)
  • Reducing the scope of the transaction to just SaveChanges
  • Fixing some minor bugs in the model (e.g. StateDetail.VehicleStateItemId wasn't being used)
  • Changing the type of the keys from bigint to int

Nothing helped: as long as the there are two levels of cascade deletes on Inspection --> StateItem --> StateDetail, if two concurrent sessions try to delete two different StateItems, a deadlock ends up occurring at the point the two sessions try to get access to StateDetail with different types of locks, presumably to apply the cascade delete.

At this point this looks like a limitation on SQL Server. Here are a few things I would like to discuss in triage:

  1. Why is it that we don't see this more often? The model doesn't seem uncommon, but maybe I am missing something.
  2. If we think this can affect lots of customers, and we just don't hear from them, should we stop doing cascade deletes by convention (at least with SQL Server)?
  3. How to best follow up with SQL Server to get their opinion, and possibly have them consider a fix.

@ajcvickers
Copy link
Member

@divega to follow up with @AndriySvyryd and to reach out to the SQL Server team.

@ajcvickers ajcvickers added this to the 3.0.0 milestone Jan 14, 2019
@divega
Copy link
Contributor

divega commented Jan 15, 2019

I changed the repro with some simplifications and to introduce some instrumentation/console outputs, and I was able to observe that the deadlock occurs with almost perfect regularity every other iteration.

This lead me to a new hypothesis: that the deadlock could be actually involve contention between a task currently running and locks held by the homologous task from the previous iteration. This could be the case if for example, the transaction was actually completing asynchronously and resources are still being held even if the SqlClient transaction was completed and the connection closed.

I run some tests that seem to support (or at least not refute) this hypothesis:

I introduced two new root entities and I changed the program so that each even numbered iteration to affect roots 1 and 2, and every odd numbered iteration affects roots 2 and 3. This reduced the number of deadlocks significantly (1 every 100 iterations), although the number of transactions is the same.

A few more observations:

  1. The behavior is the same between .NET Core and .NET Framework
  2. Iterations 1 to 10 are usually very slow. Things speed up later, and I can't explain why.

Here is the simplified repro (past this in a console application):

using Microsoft.EntityFrameworkCore;
using System;
using System.Collections.Generic;
using System.ComponentModel.DataAnnotations;
using System.Linq;
using System.Threading.Tasks;

namespace DeadlockRepro
{

    public class Blog
    {
        public int BlogId { get; set; }
        public ICollection<Post> Posts { get; } = new HashSet<Post>();
    }

    public class Post
    {
        public int PostId { get; set; }
        // Workaround 1: comment out this required attribute
        [Required]
        public Blog Blog { get; set; }
        public ICollection<Comment> Comments { get; } = new HashSet<Comment>();
    }

    public class Comment
    {
        public int CommentId { get; set; }
        // Workaround 2: comment out this required attribute
        [Required]
        public Post Post { get; set; }
    }


    public class DeadlockContext : DbContext
    {
        public DbSet<Blog> Blogs { get; set; }

        protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        {
            // disabling pooling reduces number of dealocks significantly
            optionsBuilder.UseSqlServer(
                @"Server=(localdb)\mssqllocaldb;Database=DeadlockDatabase;Trusted_Connection=True;ConnectRetryCount=0;Pooling=true;",
                options => options.MaxBatchSize(1));
        }
    }

    public class Program
    {
        static void Main(string[] args) 
        {
            using (var db = new DeadlockContext())
            {
                db.Database.EnsureDeleted();
                db.Database.EnsureCreated();

                db.Blogs.Add(new Blog());
                db.Blogs.Add(new Blog());
                db.Blogs.Add(new Blog());
                db.Blogs.Add(new Blog());
                db.SaveChanges();
            }

            for (int i = 0; i < 100; i++)
            {
                try
                {
                    var baseline = 0; 
                    // uncomment to alternate between blogs 1, 2 and 2, 3 for each iteration
                    // baseline = i % 2 == 0 ? 0 : 2; 
                    var t1 = CauseDeadlock(baseline + 1);
                    var t2 = CauseDeadlock(baseline + 2);
                                       
                    Task.WaitAll(t1, t2);
                    Console.WriteLine($"Iteration {i} completed successfully.");

                }
                catch (AggregateException ex)
                {
                    Console.WriteLine($"Ìteration {i} failed with {ex?.InnerException?.InnerException?.InnerException.Message}");
                }
            }
        }

        private static async Task CauseDeadlock(int id)
        {
            using (var db = new DeadlockContext())
            {
                var blog =
                    db.Blogs
                        .Include(b => b.Posts)
                        .ThenInclude(p => p.Comments)
                        .First(b => b.BlogId == id);

                foreach (var post in blog.Posts.ToList())
                {
                    blog.Posts.Remove(post);
                }

                blog.Posts.Add(new Post { Comments = { new Comment() } });
                blog.Posts.Add(new Post { Comments = { new Comment() } });
                blog.Posts.Add(new Post { Comments = { new Comment() } });

                await db.SaveChangesAsync();
            }
        }
    }
}

@divega
Copy link
Contributor

divega commented Jan 15, 2019

@AndriySvyryd could check with the latest repro if you can find anything I could be missing?

@AndriySvyryd
Copy link
Member

I think this is the same issue we had previously when we didn't have a deterministic order for SaveChanges. When it deadlocks the new posts and comments are being inserted before the old ones are deleted. I think we don't have more reports of this because the entities need to be queried in this specific way to produce this change application order.

We can probably fix it by adding an explicit dependency between deletes and inserts to the same table when applying the topological sort.

@divega
Copy link
Contributor

divega commented Jan 15, 2019

We can probably fix it by adding an explicit dependency between deletes and inserts to the same table when applying the topological sort.

Oh, I thought we had already done that! 😮 Thanks @AndriySvyryd!

Clearing up milestone to discuss priority in the next triage.

@divega divega removed this from the 3.0.0 milestone Jan 15, 2019
@divega divega removed their assignment Jan 15, 2019
@AndriySvyryd
Copy link
Member

@divega Currently we only do that in the second order sort

@ajcvickers ajcvickers added this to the 3.0.0 milestone Jan 16, 2019
@ajcvickers ajcvickers modified the milestones: 3.0.0, Backlog Jun 28, 2019
@ajcvickers ajcvickers modified the milestones: Backlog, 6.0.0 Nov 5, 2020
@AndriySvyryd AndriySvyryd removed their assignment Sep 4, 2021
@AndriySvyryd AndriySvyryd added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Sep 4, 2021
AndriySvyryd added a commit that referenced this issue Sep 4, 2021
Always sort deletes before updates and updates before inserts for the same table.

Fixes #14371
Fixes #15180
Fixes #25228
@AndriySvyryd AndriySvyryd removed the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Sep 4, 2021
@AndriySvyryd AndriySvyryd self-assigned this Sep 4, 2021
AndriySvyryd added a commit that referenced this issue Sep 9, 2021
…cks.

This dependency isn't hard, so cycle breaking is introduced.
Make the implementations of Multigraph TopologicalSort and BatchingTopologicalSort more alike and faster

Fixes #14371
@AndriySvyryd AndriySvyryd removed their assignment Sep 9, 2021
@AndriySvyryd AndriySvyryd added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Sep 9, 2021
@ajcvickers ajcvickers modified the milestones: 6.0.0, 6.0.0-rc2 Sep 10, 2021
@ajcvickers ajcvickers modified the milestones: 6.0.0-rc2, 6.0.0 Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-save-changes closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. customer-reported punted-for-3.0 type-bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants