-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronization for non timed-out rollback. #1611 #1616
Conversation
5504afc
to
15e7110
Compare
@@ -501,16 +501,17 @@ public new NpgsqlTransaction BeginTransaction(IsolationLevel level) | |||
/// <summary> | |||
/// Enlist transation. | |||
/// </summary> | |||
public override void EnlistTransaction(Transaction transaction) | |||
public override void EnlistTransaction([CanBeNull]Transaction transaction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For having R# happy. (I frequently end up just removing that beast from my box.)
{ | ||
if (EnlistedTransaction != null) | ||
var enlistedTransaction = EnlistedTransaction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing in a local variable for avoiding race with ressource manager setting it to null.
@@ -521,6 +522,7 @@ public override void EnlistTransaction(Transaction transaction) | |||
} | |||
|
|||
var connector = CheckReadyAndGetConnector(); | |||
connector.EnlistedResource?.WaitIfRequired(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required for avoiding replacing the RM on the connector while the current RM still needs the connector (transaction no more active but second phase not yet ended).
@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction) | |||
// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case - | |||
// distributed transactions aren't supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment looks not completely right to me. A durable resource does not enforce promotion to distributed, provided it supports single phase and is the only one enlisted. Unless Mono does not support durable resources at all, better use durable resource with Mono too, once implemented. (Confirmed by #1592 discussion.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I've learned some things since I wrote that comment :)
@@ -1506,6 +1515,10 @@ internal UserAction StartUserAction(NpgsqlCommand command) | |||
|
|||
internal UserAction StartUserAction(ConnectorState newState=ConnectorState.Executing, NpgsqlCommand command=null) | |||
{ | |||
#if !NETSTANDARD1_3 | |||
EnlistedResource?.WaitIfRequired(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main point protecting from concurrent use when the RM has not finished its work. Of course EnlistedResource
needs to be still there when the RM has not finished its work, thus the sync check in Connection.EnlistTransaction
.
{ | ||
if (_secondPhaseEndedOrNotNeedingWait || _isRMCall.Value || | ||
_transaction?.TransactionInformation.Status == System.Transactions.TransactionStatus.Active) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No wait if the second phase does not mandate it, if we are the thread executing the second phase, or if the transaction is still on-going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an optimization to avoid waiting on the semaphore if we don't need to (even when it's not taken)? If so, I do understand the motivation for this optimization - WaitIfRequired()
gets called from StartUserAction, which makes it a hot path - just trying to make sure I'm not missing something else.
EDIT: on second thought, if this is purely an optimization, then it only optimizes StartUserAction
in cases where the connector's EnlistedResource
is set - which is perhaps a bit limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not an optimization, but I should simplify this by using a ManualResetEventSlim
instead.
catch (ObjectDisposedException) | ||
{ | ||
// Ignore. The transaction is already disposed, but the second phase is not ended. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We go there in the rollback case: the transaction is disposed while the rollback not yet done.
bool waitSuccess; | ||
try | ||
{ | ||
waitSuccess = _secondPhaseLock.Wait(5000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second phase should not take long. Better not freeze the app if this lock proves buggy someday. But could be a bit annoying when debugging.
// Avoid un-enlisting another RM in case the connection has been concurrently re-used in a new | ||
// transaction. | ||
if (_connector.EnlistedResource == this) | ||
_connector.EnlistedResource = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case a prepare phase has been run, we have already set it to null, so with the test, no risk, even with concurrency, to override another RM.
In the rollback case, concurrency is still locked at that point.
@@ -326,6 +375,8 @@ void Dispose() | |||
else | |||
_connector.Close(); | |||
} | |||
_isRMCall.Dispose(); | |||
_secondPhaseLock.Dispose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dispose
should be used only if all operation on the semaphore have completed. We have un-referenced the RM for avoiding new waiting threads, then released the lock. But does it immediately cause the waiting threads to no more use the semaphore? I am a bit unsure here, so I have put that try catch around the wait on the semaphore, in case there is a race condition between the signalling with Release
, the actual end of thread waits, and the disposal here.
@fredericDelaporte - thanks for submitting this, I promise to take a look very soon, am currently involved in a pretty huge refactor. |
There is no hurry on my side. For NHibernate issues with Npgsql, we only need #1611. |
if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active) | ||
throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})"); | ||
if (enlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active) | ||
throw new InvalidOperationException($"Already enlisted to transaction (localid={enlistedTransaction.TransactionInformation.LocalIdentifier})"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New bug discovered: after scope disposal in distributed case, the transaction status may still be Active
! This happens when the supplied transaction was not the original one but a .Clone()
of it... Illustrated in this NHibernate test case. Waiting a bit allows to get the correct status...
I do not think we can do much against that later case. The main usual case (enlisting with the original transaction, not with a clone) will at least be protected by that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New bug discovered: after scope disposal in distributed case, the transaction status may still be Active!
I actually thought that could be the case simply because of the asynchronous nature of the DTC implementation...
123bd92
to
1b02912
Compare
a16fe5e
to
942090d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fredericDelaporte, thanks for this PR. Sorry it took me a month to get around to it, I redid the entire Npgsql type handling and writing system in the meantime :)
I have written some comments, but before looking at those please look at my larger comment outside of this review for an alternative approach to solving this problem.
@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction) | |||
// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case - | |||
// distributed transactions aren't supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I've learned some things since I wrote that comment :)
src/Npgsql/NpgsqlConnection.cs
Outdated
@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction) | |||
// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case - | |||
// distributed transactions aren't supported. | |||
|
|||
transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None); | |||
var ressourceManager = new VolatileResourceManager(this, transaction); | |||
connector.EnlistedResource = ressourceManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: one s for resource (I'm half-French so I understand)
src/Npgsql/NpgsqlConnector.cs
Outdated
@@ -139,6 +139,15 @@ sealed partial class NpgsqlConnector : IDisposable | |||
[CanBeNull] | |||
internal NpgsqlConnection Connection { get; set; } | |||
|
|||
#if !NETSTANDARD1_3 | |||
/// <summary> | |||
/// The ressource manager actually using this connector. <see langword="null" /> if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "resource" as above
@@ -40,7 +40,7 @@ namespace Npgsql | |||
/// Note that a connection may be closed before its TransactionScope completes. In this case we close the NpgsqlConnection | |||
/// as usual but the connector in a special list in the pool; it will be closed only when the scope completes. | |||
/// </remarks> | |||
class VolatileResourceManager : ISinglePhaseNotification | |||
sealed class VolatileResourceManager : ISinglePhaseNotification, IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
src/Npgsql/NpgsqlConnection.cs
Outdated
@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction) | |||
// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case - | |||
// distributed transactions aren't supported. | |||
|
|||
transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None); | |||
var ressourceManager = new VolatileResourceManager(this, transaction); | |||
connector.EnlistedResource = ressourceManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, seems better to set EnlistedResource inside VolatileResourceManager
.
@@ -138,6 +143,7 @@ public void Commit(Enlistment enlistment) | |||
|
|||
try | |||
{ | |||
_isRMCall.Value = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be easier to simply record Thread Thread.CurrentThread.ManagedThreadId
on the RM and then check in WaitIfRequired if the current thread ID matches that? Seems simpler than a ThreadLocal unless I'm missing something...
{ | ||
if (_secondPhaseEndedOrNotNeedingWait || _isRMCall.Value || | ||
_transaction?.TransactionInformation.Status == System.Transactions.TransactionStatus.Active) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an optimization to avoid waiting on the semaphore if we don't need to (even when it's not taken)? If so, I do understand the motivation for this optimization - WaitIfRequired()
gets called from StartUserAction, which makes it a hot path - just trying to make sure I'm not missing something else.
EDIT: on second thought, if this is purely an optimization, then it only optimizes StartUserAction
in cases where the connector's EnlistedResource
is set - which is perhaps a bit limited.
if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active) | ||
throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})"); | ||
if (enlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active) | ||
throw new InvalidOperationException($"Already enlisted to transaction (localid={enlistedTransaction.TransactionInformation.LocalIdentifier})"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New bug discovered: after scope disposal in distributed case, the transaction status may still be Active!
I actually thought that could be the case simply because of the asynchronous nature of the DTC implementation...
Here are some thought I had about a different approach to this problem - this may be an advantage of disconnecting for a few weeks. It occurred to me that Npgsql already has a mechanism for synchronizing access to a connection - the so-called user action. Rather than adding an additional semaphore ( IMHO this approach would have the advantage of leveraging the existing general Npgsql synchronization, resulting in significant simplification; for example, we'd no longer need the special The only small issue I can see with this approach is that the current user action mechanism simply throws on concurrent use, whereas if a 2-phase commit is in progress we'd like the concurrent attempt to wait until it's done. It should be pretty trivial to work around this: we could add a new One side advantage would be to protect calls to What are your thoughts about this approach? I'd be glad to get your comments in case I'm missing something with the above. I could also try to put together a PR and see where it leads. On an unrelated note, unfortunately tomorrow I'm leaving for a 3-month project that will probably not permit me to spend much time on this. But in any case this PR (as well as my alternative approach) seem well outside a 3.2 patch release, so they'd have to wait for a 3.3 release. |
I had seen that anti-concurrency mechanism, which throws. Changing it to a synchronization mechanism is quite a serious semantic change, that is why I have not attempted to use it. Why not doing that, but it should probably be an unconditional change. Doing it only when detecting a distributed transaction may cause behavior changes for other cases depending on whether they are in a distributed transaction or not. This could be quite cryptic for user. |
Sure, it's definitely a significant change. On the other hand the user action mechanism already serves as a synchronization mechanism in a partial way, to make sure that keepalives and user actions don't interfere with one another. In that sense it seems natural to extend it to handle distributed transaction asynchronicity as well.
I might have not been totally clear on this... I propose to block (as opposed to throw) only during the 2-phase commit. In other words, if concurrent use of the connection (e.g. two simultaneous executions) occur, an exception will be thrown as today regardless of whether we're in the a distributed transaction or not. From the user's perspective, the connection will be in this "blocking" state only from the moment their TransactionScope is disposed, and until the transaction is actually committed an unknown amount of time later. So to be clear - I'm not proposing that the entire transaction represent a user action. When first enlisting there would be a very short-lived user action to a) block in case a previous 2-phase commit is still in progress, and b) to throw in case of any other concurrent use (e.g. command execution and Enlist() in parallel). Then, another user action starting at the 1st phase (prepare) and terminating once the second phase completes. Hope that makes it clearer. I will try to prepare a PR to show and test this. |
There will probably be one concurrency issue left if using the user action check: the rollback case. There is potentially no first phase in rollback cases. It could go directly to second phase. When distributed, this second phase will then run concurrently to code following the scope disposal. So the call to user action from the rollback may fail due to the code following the scope having already started another user action. The PR here avoids that, with the condition testing for transaction status. I will probably rewrite a bit all that to make it a bit more clear. |
You're right about the rollback skipping first phase - that would be a problem. I'll spend a bit more time on this. Thanks again for all the help. |
In theory it's possible for the rollback to again start a user action if one isn't already in progress - this would prevent rollback-without-1st-phase from interfering with user queries. However, it makes it technically possible for a user query to happen before the rollback happens, making a total mess of everything. The only real option seems to be to prevent any user queries from happening until the resource manager completes, and that doesn't seem possible with the user action approach I proposed. |
15e7110
to
c75ad7e
Compare
Rebased, two commits added: one for switching to |
c75ad7e
to
a62b748
Compare
93a7abc
to
cb3e6b6
Compare
5151389
to
8d35b8a
Compare
349e870
to
0fcf927
Compare
@roji Could you review it to cleanup pull requests? |
fe46200
to
1f156c8
Compare
a62b748
to
bce504c
Compare
In case this may help, this is rebased. |
@fredericDelaporte thanks for rebasing! Will try to get around to reviewing this ASAP - I'm back to having lots of spare time for Npgsql. However, my focus is currently the EF Core provider since the 2.2 release is right around the corner. Once that happens I'll come back to this. |
@roji When will you be able to review this? How soon? (: |
Good question... I admit that since distributed transactions haven't been ported over to .NET Core (https://github.com/dotnet/corefx/issues/13532) I haven't given the implementation any attention lately... |
@fredericDelaporte sorry I didn't give this more attention back when you submitted it. Since for now there's no distributed transaction support in .NET Core (and unfortunately I'm not sure it's going to arrive anytime soon, dotnet/runtime#715), I'm going to go ahead and close it. Let's revisit if/when distributed transaction support is brought over. |
Here is a working solution for avoiding the race condition after a distributed rollback. It does not even require the
EnlistTransaction(null)
. But it does not handle the troubles with timeouts.Note that locally I am unable to compile the new target netstandard 2.0. (Not having preview installed.)
I am going to comment the proposed changes.