Synchronization for non timed-out rollback. #1611 #1616

fredericDelaporte · 2017-06-22T17:31:36Z

Here is a working solution for avoiding the race condition after a distributed rollback. It does not even require the EnlistTransaction(null). But it does not handle the troubles with timeouts.

Note that locally I am unable to compile the new target netstandard 2.0. (Not having preview installed.)

I am going to comment the proposed changes.

fredericDelaporte · 2017-06-22T17:32:28Z

src/Npgsql/NpgsqlConnection.cs

@@ -501,16 +501,17 @@ public new NpgsqlTransaction BeginTransaction(IsolationLevel level)
 /// <summary>
 /// Enlist transation.
 /// </summary>
- public override void EnlistTransaction(Transaction transaction)
+ public override void EnlistTransaction([CanBeNull]Transaction transaction)


For having R# happy. (I frequently end up just removing that beast from my box.)

fredericDelaporte · 2017-06-22T17:33:04Z

src/Npgsql/NpgsqlConnection.cs

 {
- if (EnlistedTransaction != null)
+ var enlistedTransaction = EnlistedTransaction;


Storing in a local variable for avoiding race with ressource manager setting it to null.

fredericDelaporte · 2017-06-22T17:36:22Z

src/Npgsql/NpgsqlConnection.cs

@@ -521,6 +522,7 @@ public override void EnlistTransaction(Transaction transaction)
 }

 var connector = CheckReadyAndGetConnector();
+ connector.EnlistedResource?.WaitIfRequired();


Required for avoiding replacing the RM on the connector while the current RM still needs the connector (transaction no more active but second phase not yet ended).

fredericDelaporte · 2017-06-22T17:38:45Z

src/Npgsql/NpgsqlConnection.cs

@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
 // Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
 // distributed transactions aren't supported.


This comment looks not completely right to me. A durable resource does not enforce promotion to distributed, provided it supports single phase and is the only one enlisted. Unless Mono does not support durable resources at all, better use durable resource with Mono too, once implemented. (Confirmed by #1592 discussion.)

I agree, I've learned some things since I wrote that comment :)

fredericDelaporte · 2017-06-22T17:41:05Z

src/Npgsql/NpgsqlConnector.cs

@@ -1506,6 +1515,10 @@ internal UserAction StartUserAction(NpgsqlCommand command)

 internal UserAction StartUserAction(ConnectorState newState=ConnectorState.Executing, NpgsqlCommand command=null)
 {
+#if !NETSTANDARD1_3
+ EnlistedResource?.WaitIfRequired();


Main point protecting from concurrent use when the RM has not finished its work. Of course EnlistedResource needs to be still there when the RM has not finished its work, thus the sync check in Connection.EnlistTransaction.

fredericDelaporte · 2017-06-22T17:55:35Z

src/Npgsql/VolatileResourceManager.cs

+ {
+ if (_secondPhaseEndedOrNotNeedingWait || _isRMCall.Value ||
+ _transaction?.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
+ return;


No wait if the second phase does not mandate it, if we are the thread executing the second phase, or if the transaction is still on-going.

Is this an optimization to avoid waiting on the semaphore if we don't need to (even when it's not taken)? If so, I do understand the motivation for this optimization - WaitIfRequired() gets called from StartUserAction, which makes it a hot path - just trying to make sure I'm not missing something else.

EDIT: on second thought, if this is purely an optimization, then it only optimizes StartUserAction in cases where the connector's EnlistedResource is set - which is perhaps a bit limited.

It is not an optimization, but I should simplify this by using a ManualResetEventSlim instead.

fredericDelaporte · 2017-06-22T17:56:14Z

src/Npgsql/VolatileResourceManager.cs

+ catch (ObjectDisposedException)
+ {
+ // Ignore. The transaction is already disposed, but the second phase is not ended.
+ }


We go there in the rollback case: the transaction is disposed while the rollback not yet done.

fredericDelaporte · 2017-06-22T17:57:19Z

src/Npgsql/VolatileResourceManager.cs

+ bool waitSuccess;
+ try
+ {
+ waitSuccess = _secondPhaseLock.Wait(5000);


Second phase should not take long. Better not freeze the app if this lock proves buggy someday. But could be a bit annoying when debugging.

fredericDelaporte · 2017-06-22T18:03:42Z

src/Npgsql/VolatileResourceManager.cs

+ // Avoid un-enlisting another RM in case the connection has been concurrently re-used in a new
+ // transaction.
+ if (_connector.EnlistedResource == this)
+ _connector.EnlistedResource = null;


In the case a prepare phase has been run, we have already set it to null, so with the test, no risk, even with concurrency, to override another RM.
In the rollback case, concurrency is still locked at that point.

fredericDelaporte · 2017-06-22T18:08:45Z

src/Npgsql/VolatileResourceManager.cs

@@ -326,6 +375,8 @@ void Dispose()
 else
 _connector.Close();
 }
+ _isRMCall.Dispose();
+ _secondPhaseLock.Dispose();


Dispose should be used only if all operation on the semaphore have completed. We have un-referenced the RM for avoiding new waiting threads, then released the lock. But does it immediately cause the waiting threads to no more use the semaphore? I am a bit unsure here, so I have put that try catch around the wait on the semaphore, in case there is a race condition between the signalling with Release, the actual end of thread waits, and the disposal here.

roji · 2017-06-28T09:18:25Z

@fredericDelaporte - thanks for submitting this, I promise to take a look very soon, am currently involved in a pretty huge refactor.

fredericDelaporte · 2017-06-28T10:54:07Z

There is no hurry on my side. For NHibernate issues with Npgsql, we only need #1611.

fredericDelaporte · 2017-07-04T13:21:41Z

src/Npgsql/NpgsqlConnection.cs

- if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
- throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})");
+ if (enlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
+ throw new InvalidOperationException($"Already enlisted to transaction (localid={enlistedTransaction.TransactionInformation.LocalIdentifier})");


New bug discovered: after scope disposal in distributed case, the transaction status may still be Active! This happens when the supplied transaction was not the original one but a .Clone() of it... Illustrated in this NHibernate test case. Waiting a bit allows to get the correct status...

I do not think we can do much against that later case. The main usual case (enlisting with the original transaction, not with a clone) will at least be protected by that code.

New bug discovered: after scope disposal in distributed case, the transaction status may still be Active!

I actually thought that could be the case simply because of the asynchronous nature of the DTC implementation...

roji

@fredericDelaporte, thanks for this PR. Sorry it took me a month to get around to it, I redid the entire Npgsql type handling and writing system in the meantime :)

I have written some comments, but before looking at those please look at my larger comment outside of this review for an alternative approach to solving this problem.

roji · 2017-07-24T15:37:31Z

src/Npgsql/NpgsqlConnection.cs

@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
 // Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
 // distributed transactions aren't supported.


I agree, I've learned some things since I wrote that comment :)

roji · 2017-07-24T15:39:11Z

src/Npgsql/NpgsqlConnection.cs

@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
 // Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
 // distributed transactions aren't supported.

- transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None);
+ var ressourceManager = new VolatileResourceManager(this, transaction);
+ connector.EnlistedResource = ressourceManager;


Nit: one s for resource (I'm half-French so I understand)

roji · 2017-07-24T15:39:39Z

src/Npgsql/NpgsqlConnector.cs

@@ -139,6 +139,15 @@ sealed partial class NpgsqlConnector : IDisposable
 [CanBeNull]
 internal NpgsqlConnection Connection { get; set; }

+#if !NETSTANDARD1_3
+ /// <summary>
+ /// The ressource manager actually using this connector. <see langword="null" /> if the


nit: "resource" as above

roji · 2017-07-24T15:40:13Z

src/Npgsql/VolatileResourceManager.cs

@@ -40,7 +40,7 @@ namespace Npgsql
 /// Note that a connection may be closed before its TransactionScope completes. In this case we close the NpgsqlConnection
 /// as usual but the connector in a special list in the pool; it will be closed only when the scope completes.
 /// </remarks>
- class VolatileResourceManager : ISinglePhaseNotification
+ sealed class VolatileResourceManager : ISinglePhaseNotification, IDisposable


roji · 2017-07-24T15:56:20Z

src/Npgsql/NpgsqlConnection.cs

@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
 // Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
 // distributed transactions aren't supported.

- transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None);
+ var ressourceManager = new VolatileResourceManager(this, transaction);
+ connector.EnlistedResource = ressourceManager;


Agreed, seems better to set EnlistedResource inside VolatileResourceManager.

roji · 2017-07-24T16:15:24Z

src/Npgsql/VolatileResourceManager.cs

@@ -138,6 +143,7 @@ public void Commit(Enlistment enlistment)

 try
 {
+ _isRMCall.Value = true;


Wouldn't it be easier to simply record Thread Thread.CurrentThread.ManagedThreadId on the RM and then check in WaitIfRequired if the current thread ID matches that? Seems simpler than a ThreadLocal unless I'm missing something...

roji · 2017-07-24T16:18:34Z

src/Npgsql/VolatileResourceManager.cs

+ {
+ if (_secondPhaseEndedOrNotNeedingWait || _isRMCall.Value ||
+ _transaction?.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
+ return;


Is this an optimization to avoid waiting on the semaphore if we don't need to (even when it's not taken)? If so, I do understand the motivation for this optimization - WaitIfRequired() gets called from StartUserAction, which makes it a hot path - just trying to make sure I'm not missing something else.

EDIT: on second thought, if this is purely an optimization, then it only optimizes StartUserAction in cases where the connector's EnlistedResource is set - which is perhaps a bit limited.

roji · 2017-07-24T16:21:08Z

src/Npgsql/NpgsqlConnection.cs

- if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
- throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})");
+ if (enlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
+ throw new InvalidOperationException($"Already enlisted to transaction (localid={enlistedTransaction.TransactionInformation.LocalIdentifier})");


New bug discovered: after scope disposal in distributed case, the transaction status may still be Active!

I actually thought that could be the case simply because of the asynchronous nature of the DTC implementation...

roji · 2017-07-24T16:40:59Z

Here are some thought I had about a different approach to this problem - this may be an advantage of disconnecting for a few weeks.

It occurred to me that Npgsql already has a mechanism for synchronizing access to a connection - the so-called user action. Rather than adding an additional semaphore (_secondPhaseLock), we could view the RM as, well, another user of the connection - we could simply start a user action when Prepare() is called, and end the user action when the RM is disposed. NpgsqlConnection.EnlistTransaction() would also start a user action, ending it right away - its sole purpose would be to make use block until any existing 2nd-phase commit is over.

IMHO this approach would have the advantage of leveraging the existing general Npgsql synchronization, resulting in significant simplification; for example, we'd no longer need the special WaitIfRequired(), simply relying on the existing StartUserAction() mechanism to make us block as needed.

The only small issue I can see with this approach is that the current user action mechanism simply throws on concurrent use, whereas if a 2-phase commit is in progress we'd like the concurrent attempt to wait until it's done. It should be pretty trivial to work around this: we could add a new ConnectorState.DistributedTransactionCommit state, and check for it specifically in StartUserAction(), waiting instead of throwing.

One side advantage would be to protect calls to NpgsqlConnection.EnlistTransaction() from (mistaken) concurrent use, just like we do with concurrent commands - a clear, informative exception would be thrown instead of the undefined behavior we have today.

What are your thoughts about this approach? I'd be glad to get your comments in case I'm missing something with the above. I could also try to put together a PR and see where it leads.

On an unrelated note, unfortunately tomorrow I'm leaving for a 3-month project that will probably not permit me to spend much time on this. But in any case this PR (as well as my alternative approach) seem well outside a 3.2 patch release, so they'd have to wait for a 3.3 release.

fredericDelaporte · 2017-07-24T17:33:55Z

I had seen that anti-concurrency mechanism, which throws. Changing it to a synchronization mechanism is quite a serious semantic change, that is why I have not attempted to use it.

Why not doing that, but it should probably be an unconditional change. Doing it only when detecting a distributed transaction may cause behavior changes for other cases depending on whether they are in a distributed transaction or not. This could be quite cryptic for user.
(Unless the change you think about would make sure that the ongoing user action is from the commit phase, otherwise will still throw.)

roji · 2017-07-24T18:18:16Z

I had seen that anti-concurrency mechanism, which throws. Changing it to a synchronization mechanism is quite a serious semantic change, that is why I have not attempted to use it.

Sure, it's definitely a significant change. On the other hand the user action mechanism already serves as a synchronization mechanism in a partial way, to make sure that keepalives and user actions don't interfere with one another. In that sense it seems natural to extend it to handle distributed transaction asynchronicity as well.

Why not doing that, but it should probably be an unconditional change. Doing it only when detecting a distributed transaction may cause behavior changes for other cases depending on whether they are in a distributed transaction or not. This could be quite cryptic for user.

I might have not been totally clear on this... I propose to block (as opposed to throw) only during the 2-phase commit. In other words, if concurrent use of the connection (e.g. two simultaneous executions) occur, an exception will be thrown as today regardless of whether we're in the a distributed transaction or not. From the user's perspective, the connection will be in this "blocking" state only from the moment their TransactionScope is disposed, and until the transaction is actually committed an unknown amount of time later.

So to be clear - I'm not proposing that the entire transaction represent a user action. When first enlisting there would be a very short-lived user action to a) block in case a previous 2-phase commit is still in progress, and b) to throw in case of any other concurrent use (e.g. command execution and Enlist() in parallel). Then, another user action starting at the 1st phase (prepare) and terminating once the second phase completes.

Hope that makes it clearer. I will try to prepare a PR to show and test this.

fredericDelaporte · 2017-07-24T18:39:40Z

There will probably be one concurrency issue left if using the user action check: the rollback case. There is potentially no first phase in rollback cases. It could go directly to second phase. When distributed, this second phase will then run concurrently to code following the scope disposal. So the call to user action from the rollback may fail due to the code following the scope having already started another user action.

The PR here avoids that, with the condition testing for transaction status. I will probably rewrite a bit all that to make it a bit more clear.

roji · 2017-07-24T19:22:10Z

You're right about the rollback skipping first phase - that would be a problem. I'll spend a bit more time on this.

Thanks again for all the help.

roji · 2017-07-24T21:01:39Z

In theory it's possible for the rollback to again start a user action if one isn't already in progress - this would prevent rollback-without-1st-phase from interfering with user queries. However, it makes it technically possible for a user query to happen before the rollback happens, making a total mess of everything. The only real option seems to be to prevent any user queries from happening until the resource manager completes, and that doesn't seem possible with the user action approach I proposed.

fredericDelaporte · 2017-07-24T22:06:21Z

Rebased, two commits added: one for switching to ManualResetEventSlim and changing the logic for it to be more explicit, the other for simplifying second phases commit and rollback by always locking the connection instead of using another one when possible. It has the additional benefit of avoiding user not being able of reading its data right after scope disposal: the user call will be waiting till it is actually committed.

YohDeadfall · 2018-05-09T22:51:13Z

@roji Could you review it to cleanup pull requests?

fredericDelaporte · 2018-11-15T15:41:31Z

In case this may help, this is rebased.

roji · 2018-11-15T16:32:25Z

@fredericDelaporte thanks for rebasing!

Will try to get around to reviewing this ASAP - I'm back to having lots of spare time for Npgsql. However, my focus is currently the EF Core provider since the 2.2 release is right around the corner. Once that happens I'll come back to this.

YohDeadfall · 2019-10-28T13:24:14Z

@roji When will you be able to review this? How soon? (:

roji · 2019-10-28T13:40:04Z

Good question... I admit that since distributed transactions haven't been ported over to .NET Core (https://github.com/dotnet/corefx/issues/13532) I haven't given the implementation any attention lately...

roji · 2020-06-27T22:02:42Z

@fredericDelaporte sorry I didn't give this more attention back when you submitted it. Since for now there's no distributed transaction support in .NET Core (and unfortunately I'm not sure it's going to arrive anytime soon, dotnet/runtime#715), I'm going to go ahead and close it. Let's revisit if/when distributed transaction support is brought over.

fredericDelaporte force-pushed the RollbackLock branch 3 times, most recently from 5504afc to 15e7110 Compare June 22, 2017 17:59

fredericDelaporte commented Jun 22, 2017

View reviewed changes

roji force-pushed the dev branch from f31cd09 to 2297c8f Compare July 2, 2017 21:53

fredericDelaporte commented Jul 4, 2017

View reviewed changes

roji force-pushed the dev branch 2 times, most recently from 123bd92 to 1b02912 Compare July 7, 2017 10:33

roji force-pushed the dev branch 2 times, most recently from a16fe5e to 942090d Compare July 18, 2017 13:11

roji reviewed Jul 24, 2017

View reviewed changes

fredericDelaporte force-pushed the RollbackLock branch from 15e7110 to c75ad7e Compare July 24, 2017 22:02

fredericDelaporte force-pushed the RollbackLock branch from c75ad7e to a62b748 Compare July 24, 2017 22:59

roji force-pushed the dev branch 2 times, most recently from 93a7abc to cb3e6b6 Compare November 26, 2017 08:55

roji force-pushed the dev branch from 3734e73 to a93670e Compare February 11, 2018 07:48

roji force-pushed the dev branch 3 times, most recently from 5151389 to 8d35b8a Compare February 24, 2018 06:40

roji force-pushed the dev branch 2 times, most recently from 349e870 to 0fcf927 Compare March 7, 2018 16:25

roji force-pushed the dev branch from 0799697 to e6ae764 Compare April 27, 2018 12:54

roji force-pushed the dev branch from ba6d66a to 6d5c753 Compare May 27, 2018 18:18

roji force-pushed the dev branch from 5dc6fb4 to 1970723 Compare June 26, 2018 21:08

roji force-pushed the dev branch 2 times, most recently from fe46200 to 1f156c8 Compare July 20, 2018 12:57

fredericDelaporte added 3 commits November 15, 2018 16:34

Synchronization for non timedout rollback. npgsql#1611

5c58997

Synchronization timeout for rollback case refactored. npgsql#1611

9c6002a

Simplifying second phases by always locking them. npgsql#1611

bce504c

fredericDelaporte force-pushed the RollbackLock branch from a62b748 to bce504c Compare November 15, 2018 15:35

roji force-pushed the dev branch from a3bb792 to 4e700d5 Compare February 5, 2020 10:35

roji force-pushed the dev branch from dccaced to 8eb495f Compare May 13, 2020 13:40

roji closed this Jun 27, 2020

		@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
		// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
		// distributed transactions aren't supported.

Synchronization for non timed-out rollback. #1611 #1616

Synchronization for non timed-out rollback. #1611 #1616

Conversation

fredericDelaporte commented Jun 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fredericDelaporte Jun 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roji Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roji commented Jun 28, 2017

fredericDelaporte commented Jun 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roji Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roji commented Jul 24, 2017

fredericDelaporte commented Jul 24, 2017

roji commented Jul 24, 2017

fredericDelaporte commented Jul 24, 2017

roji commented Jul 24, 2017

roji commented Jul 24, 2017

fredericDelaporte commented Jul 24, 2017

YohDeadfall commented May 9, 2018

fredericDelaporte commented Nov 15, 2018

roji commented Nov 15, 2018

YohDeadfall commented Oct 28, 2019

roji commented Oct 28, 2019

roji commented Jun 27, 2020

fredericDelaporte commented Jun 22, 2017 •

edited

Loading

fredericDelaporte Jun 22, 2017 •

edited

Loading

roji Jul 24, 2017 •

edited

Loading

roji Jul 24, 2017 •

edited

Loading