Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization for non timed-out rollback. #1611 #1616

Closed
wants to merge 3 commits into from

Conversation

fredericDelaporte
Copy link
Contributor

@fredericDelaporte fredericDelaporte commented Jun 22, 2017

Here is a working solution for avoiding the race condition after a distributed rollback. It does not even require the EnlistTransaction(null). But it does not handle the troubles with timeouts.

Note that locally I am unable to compile the new target netstandard 2.0. (Not having preview installed.)

I am going to comment the proposed changes.

@fredericDelaporte fredericDelaporte force-pushed the RollbackLock branch 3 times, most recently from 5504afc to 15e7110 Compare June 22, 2017 17:59
@@ -501,16 +501,17 @@ public new NpgsqlTransaction BeginTransaction(IsolationLevel level)
/// <summary>
/// Enlist transation.
/// </summary>
public override void EnlistTransaction(Transaction transaction)
public override void EnlistTransaction([CanBeNull]Transaction transaction)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For having R# happy. (I frequently end up just removing that beast from my box.)

{
if (EnlistedTransaction != null)
var enlistedTransaction = EnlistedTransaction;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing in a local variable for avoiding race with ressource manager setting it to null.

@@ -521,6 +522,7 @@ public override void EnlistTransaction(Transaction transaction)
}

var connector = CheckReadyAndGetConnector();
connector.EnlistedResource?.WaitIfRequired();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for avoiding replacing the RM on the connector while the current RM still needs the connector (transaction no more active but second phase not yet ended).

@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
// distributed transactions aren't supported.
Copy link
Contributor Author

@fredericDelaporte fredericDelaporte Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks not completely right to me. A durable resource does not enforce promotion to distributed, provided it supports single phase and is the only one enlisted. Unless Mono does not support durable resources at all, better use durable resource with Mono too, once implemented. (Confirmed by #1592 discussion.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I've learned some things since I wrote that comment :)

@@ -1506,6 +1515,10 @@ internal UserAction StartUserAction(NpgsqlCommand command)

internal UserAction StartUserAction(ConnectorState newState=ConnectorState.Executing, NpgsqlCommand command=null)
{
#if !NETSTANDARD1_3
EnlistedResource?.WaitIfRequired();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main point protecting from concurrent use when the RM has not finished its work. Of course EnlistedResource needs to be still there when the RM has not finished its work, thus the sync check in Connection.EnlistTransaction.

{
if (_secondPhaseEndedOrNotNeedingWait || _isRMCall.Value ||
_transaction?.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
return;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No wait if the second phase does not mandate it, if we are the thread executing the second phase, or if the transaction is still on-going.

Copy link
Member

@roji roji Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an optimization to avoid waiting on the semaphore if we don't need to (even when it's not taken)? If so, I do understand the motivation for this optimization - WaitIfRequired() gets called from StartUserAction, which makes it a hot path - just trying to make sure I'm not missing something else.

EDIT: on second thought, if this is purely an optimization, then it only optimizes StartUserAction in cases where the connector's EnlistedResource is set - which is perhaps a bit limited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an optimization, but I should simplify this by using a ManualResetEventSlim instead.

catch (ObjectDisposedException)
{
// Ignore. The transaction is already disposed, but the second phase is not ended.
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We go there in the rollback case: the transaction is disposed while the rollback not yet done.

bool waitSuccess;
try
{
waitSuccess = _secondPhaseLock.Wait(5000);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second phase should not take long. Better not freeze the app if this lock proves buggy someday. But could be a bit annoying when debugging.

// Avoid un-enlisting another RM in case the connection has been concurrently re-used in a new
// transaction.
if (_connector.EnlistedResource == this)
_connector.EnlistedResource = null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case a prepare phase has been run, we have already set it to null, so with the test, no risk, even with concurrency, to override another RM.
In the rollback case, concurrency is still locked at that point.

@@ -326,6 +375,8 @@ void Dispose()
else
_connector.Close();
}
_isRMCall.Dispose();
_secondPhaseLock.Dispose();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispose should be used only if all operation on the semaphore have completed. We have un-referenced the RM for avoiding new waiting threads, then released the lock. But does it immediately cause the waiting threads to no more use the semaphore? I am a bit unsure here, so I have put that try catch around the wait on the semaphore, in case there is a race condition between the signalling with Release, the actual end of thread waits, and the disposal here.

@roji
Copy link
Member

roji commented Jun 28, 2017

@fredericDelaporte - thanks for submitting this, I promise to take a look very soon, am currently involved in a pretty huge refactor.

@fredericDelaporte
Copy link
Contributor Author

There is no hurry on my side. For NHibernate issues with Npgsql, we only need #1611.

if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})");
if (enlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
throw new InvalidOperationException($"Already enlisted to transaction (localid={enlistedTransaction.TransactionInformation.LocalIdentifier})");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New bug discovered: after scope disposal in distributed case, the transaction status may still be Active! This happens when the supplied transaction was not the original one but a .Clone() of it... Illustrated in this NHibernate test case. Waiting a bit allows to get the correct status...

I do not think we can do much against that later case. The main usual case (enlisting with the original transaction, not with a clone) will at least be protected by that code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New bug discovered: after scope disposal in distributed case, the transaction status may still be Active!

I actually thought that could be the case simply because of the asynchronous nature of the DTC implementation...

@roji roji force-pushed the dev branch 2 times, most recently from 123bd92 to 1b02912 Compare July 7, 2017 10:33
@roji roji force-pushed the dev branch 2 times, most recently from a16fe5e to 942090d Compare July 18, 2017 13:11
Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fredericDelaporte, thanks for this PR. Sorry it took me a month to get around to it, I redid the entire Npgsql type handling and writing system in the meantime :)

I have written some comments, but before looking at those please look at my larger comment outside of this review for an alternative approach to solving this problem.

@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
// distributed transactions aren't supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I've learned some things since I wrote that comment :)

@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
// distributed transactions aren't supported.

transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None);
var ressourceManager = new VolatileResourceManager(this, transaction);
connector.EnlistedResource = ressourceManager;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: one s for resource (I'm half-French so I understand)

@@ -139,6 +139,15 @@ sealed partial class NpgsqlConnector : IDisposable
[CanBeNull]
internal NpgsqlConnection Connection { get; set; }

#if !NETSTANDARD1_3
/// <summary>
/// The ressource manager actually using this connector. <see langword="null" /> if the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "resource" as above

@@ -40,7 +40,7 @@ namespace Npgsql
/// Note that a connection may be closed before its TransactionScope completes. In this case we close the NpgsqlConnection
/// as usual but the connector in a special list in the pool; it will be closed only when the scope completes.
/// </remarks>
class VolatileResourceManager : ISinglePhaseNotification
sealed class VolatileResourceManager : ISinglePhaseNotification, IDisposable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

@@ -532,7 +534,9 @@ public override void EnlistTransaction(Transaction transaction)
// Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
// distributed transactions aren't supported.

transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None);
var ressourceManager = new VolatileResourceManager(this, transaction);
connector.EnlistedResource = ressourceManager;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, seems better to set EnlistedResource inside VolatileResourceManager.

@@ -138,6 +143,7 @@ public void Commit(Enlistment enlistment)

try
{
_isRMCall.Value = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be easier to simply record Thread Thread.CurrentThread.ManagedThreadId on the RM and then check in WaitIfRequired if the current thread ID matches that? Seems simpler than a ThreadLocal unless I'm missing something...

{
if (_secondPhaseEndedOrNotNeedingWait || _isRMCall.Value ||
_transaction?.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
return;
Copy link
Member

@roji roji Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an optimization to avoid waiting on the semaphore if we don't need to (even when it's not taken)? If so, I do understand the motivation for this optimization - WaitIfRequired() gets called from StartUserAction, which makes it a hot path - just trying to make sure I'm not missing something else.

EDIT: on second thought, if this is purely an optimization, then it only optimizes StartUserAction in cases where the connector's EnlistedResource is set - which is perhaps a bit limited.

if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})");
if (enlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
throw new InvalidOperationException($"Already enlisted to transaction (localid={enlistedTransaction.TransactionInformation.LocalIdentifier})");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New bug discovered: after scope disposal in distributed case, the transaction status may still be Active!

I actually thought that could be the case simply because of the asynchronous nature of the DTC implementation...

@roji
Copy link
Member

roji commented Jul 24, 2017

Here are some thought I had about a different approach to this problem - this may be an advantage of disconnecting for a few weeks.

It occurred to me that Npgsql already has a mechanism for synchronizing access to a connection - the so-called user action. Rather than adding an additional semaphore (_secondPhaseLock), we could view the RM as, well, another user of the connection - we could simply start a user action when Prepare() is called, and end the user action when the RM is disposed. NpgsqlConnection.EnlistTransaction() would also start a user action, ending it right away - its sole purpose would be to make use block until any existing 2nd-phase commit is over.

IMHO this approach would have the advantage of leveraging the existing general Npgsql synchronization, resulting in significant simplification; for example, we'd no longer need the special WaitIfRequired(), simply relying on the existing StartUserAction() mechanism to make us block as needed.

The only small issue I can see with this approach is that the current user action mechanism simply throws on concurrent use, whereas if a 2-phase commit is in progress we'd like the concurrent attempt to wait until it's done. It should be pretty trivial to work around this: we could add a new ConnectorState.DistributedTransactionCommit state, and check for it specifically in StartUserAction(), waiting instead of throwing.

One side advantage would be to protect calls to NpgsqlConnection.EnlistTransaction() from (mistaken) concurrent use, just like we do with concurrent commands - a clear, informative exception would be thrown instead of the undefined behavior we have today.

What are your thoughts about this approach? I'd be glad to get your comments in case I'm missing something with the above. I could also try to put together a PR and see where it leads.

On an unrelated note, unfortunately tomorrow I'm leaving for a 3-month project that will probably not permit me to spend much time on this. But in any case this PR (as well as my alternative approach) seem well outside a 3.2 patch release, so they'd have to wait for a 3.3 release.

@fredericDelaporte
Copy link
Contributor Author

I had seen that anti-concurrency mechanism, which throws. Changing it to a synchronization mechanism is quite a serious semantic change, that is why I have not attempted to use it.

Why not doing that, but it should probably be an unconditional change. Doing it only when detecting a distributed transaction may cause behavior changes for other cases depending on whether they are in a distributed transaction or not. This could be quite cryptic for user.
(Unless the change you think about would make sure that the ongoing user action is from the commit phase, otherwise will still throw.)

@roji
Copy link
Member

roji commented Jul 24, 2017

I had seen that anti-concurrency mechanism, which throws. Changing it to a synchronization mechanism is quite a serious semantic change, that is why I have not attempted to use it.

Sure, it's definitely a significant change. On the other hand the user action mechanism already serves as a synchronization mechanism in a partial way, to make sure that keepalives and user actions don't interfere with one another. In that sense it seems natural to extend it to handle distributed transaction asynchronicity as well.

Why not doing that, but it should probably be an unconditional change. Doing it only when detecting a distributed transaction may cause behavior changes for other cases depending on whether they are in a distributed transaction or not. This could be quite cryptic for user.

I might have not been totally clear on this... I propose to block (as opposed to throw) only during the 2-phase commit. In other words, if concurrent use of the connection (e.g. two simultaneous executions) occur, an exception will be thrown as today regardless of whether we're in the a distributed transaction or not. From the user's perspective, the connection will be in this "blocking" state only from the moment their TransactionScope is disposed, and until the transaction is actually committed an unknown amount of time later.

So to be clear - I'm not proposing that the entire transaction represent a user action. When first enlisting there would be a very short-lived user action to a) block in case a previous 2-phase commit is still in progress, and b) to throw in case of any other concurrent use (e.g. command execution and Enlist() in parallel). Then, another user action starting at the 1st phase (prepare) and terminating once the second phase completes.

Hope that makes it clearer. I will try to prepare a PR to show and test this.

@fredericDelaporte
Copy link
Contributor Author

There will probably be one concurrency issue left if using the user action check: the rollback case. There is potentially no first phase in rollback cases. It could go directly to second phase. When distributed, this second phase will then run concurrently to code following the scope disposal. So the call to user action from the rollback may fail due to the code following the scope having already started another user action.

The PR here avoids that, with the condition testing for transaction status. I will probably rewrite a bit all that to make it a bit more clear.

@roji
Copy link
Member

roji commented Jul 24, 2017

You're right about the rollback skipping first phase - that would be a problem. I'll spend a bit more time on this.

Thanks again for all the help.

@roji
Copy link
Member

roji commented Jul 24, 2017

In theory it's possible for the rollback to again start a user action if one isn't already in progress - this would prevent rollback-without-1st-phase from interfering with user queries. However, it makes it technically possible for a user query to happen before the rollback happens, making a total mess of everything. The only real option seems to be to prevent any user queries from happening until the resource manager completes, and that doesn't seem possible with the user action approach I proposed.

@fredericDelaporte
Copy link
Contributor Author

Rebased, two commits added: one for switching to ManualResetEventSlim and changing the logic for it to be more explicit, the other for simplifying second phases commit and rollback by always locking the connection instead of using another one when possible. It has the additional benefit of avoiding user not being able of reading its data right after scope disposal: the user call will be waiting till it is actually committed.

@YohDeadfall
Copy link
Contributor

@roji Could you review it to cleanup pull requests?

@fredericDelaporte
Copy link
Contributor Author

In case this may help, this is rebased.

@roji
Copy link
Member

roji commented Nov 15, 2018

@fredericDelaporte thanks for rebasing!

Will try to get around to reviewing this ASAP - I'm back to having lots of spare time for Npgsql. However, my focus is currently the EF Core provider since the 2.2 release is right around the corner. Once that happens I'll come back to this.

@YohDeadfall
Copy link
Contributor

@roji When will you be able to review this? How soon? (:

@roji
Copy link
Member

roji commented Oct 28, 2019

Good question... I admit that since distributed transactions haven't been ported over to .NET Core (https://github.com/dotnet/corefx/issues/13532) I haven't given the implementation any attention lately...

@roji
Copy link
Member

roji commented Jun 27, 2020

@fredericDelaporte sorry I didn't give this more attention back when you submitted it. Since for now there's no distributed transaction support in .NET Core (and unfortunately I'm not sure it's going to arrive anytime soon, dotnet/runtime#715), I'm going to go ahead and close it. Let's revisit if/when distributed transaction support is brought over.

@roji roji closed this Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants