-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SqlException: The PROMOTE TRANSACTION request failed because there is no local transaction active #237
Comments
Hi @scale-tone Just to let you know I was able to reproduce the issue from the provided repro, and currently investigating into it. |
Hi @scale-tone Could you give the latest v1.1.0-preview1 version we released today a try in your repro app? I don't see the error occurring again with this release version. I'm wondering if @Wraith2 's PR #232 did the trick? Let me know how that goes! |
Hi, I've been working with @scale-tone on this issue, and I am still able to reproduce with the new package. However, it took a few more minutes than longer to reproduce. |
I'm currently running your reproduction. How long should it run before it occurs? |
Originally I’m able to reproduce from between 20 to 3000 iterations. With the new package, I had to wait until 5000 iterations. Edit: query the ImportantBusinessData table with nolock, and check the Id column to get current iteration count. |
My test run in Release, I added the duration since start to the exception logging and only logging the message property. Just after a minute it happens for the first time and after 14 minutes 15 exceptions got logged.
|
@ramonsmits The repro uses package reference 1.0.19269.1, which reproduces issue quickly. |
After few more mins (nearly 45 mins from beginning), I get this:
But here I see TranCount = 2, is that expected? Interestingly, database table [ImportantBusinessData] contains entry:
|
Running it with |
I got 6 inserts in 2 hours. Looking into it. |
@cheenamalhotra, If the insert into the |
@cheenamalhotra, the reason you see different |
About the time it takes to reproduce. I have noticed that if I leave my computer while the setup running, it takes noticeably longer to reproduce. When left running over night, it might run for several hours without any incorrect behavior, only for a burst of inconsistencies to occur the moment I wiggle the mouse and wake the screen. Running the setup on a "busy" computer seems to be the best way to consistently reproduce the problem. |
I do not see any difference in behavior with the PR you mentioned. The same assert is hit immediately. If disabled, the same exception occurs almost instantly. As I mentioned, I expect you to compile the repro against the latest sources, not against nuget packages. As I mentioned, the quickest way to repro is to start and stop LoopHammer.exe multiple times. Please, pay attention. |
Hi @scale-tone Yes. I deeply looked into the repro later and figured out its targeting .Net Framework, and the PR has no link to it. Also the NuGet packages are showing weird behaviors just due to the randomness of the issue. Because I do know nothing has changed on .Net Framework between these versions and the issue does occur with current source too. I will continue to investigate more into it, just wanted to clarify, I'm on the same page. The issue is reproduced, its frequency is intermittent, and mostly happens on a busy machine. |
Please validate if the WCF client requires to be disposed. The catch does a finally
{
((IDisposable)serviceClient).Dispose();
} Please make sure that you use a new connection pool string so that you will not be using stale dirty connections. |
Good point, @ramonsmits , but no, this extra explicit Dispose() doesn't help - still the same exceptions. And can you please elaborate on "new connection pool string"? What's that? |
Two months passed. Do we have any progress? |
@scale-tone A SQL connection can be 'invalid' and be returned to the connection pool. As we are dealing with strange transaction behavior I made that comment to ensure that this would not be happening. By adding for example |
@ramonsmits , as I understand, what you're proposing is to get rid of the connection pool as such (by just using a unique conn string every time). A simpler way to achieve that is described in the WORKAROUND section above - by adding "Pooling=false" to the conn string. That helps, but certainly severely hurts performance. And need to underscore again, that the root bug has all signs of a security issue (think of some sort of multi-tenant scenario, when tenants might be re-using connections from a pool), therefore I do expect much more attention to this from the team. |
@scale-tone I'm aware of your mentioned workaround. I made my comment about the connection pool on my "explicit Dispose()" comment to ensure that it would be tested with a fresh new pool not containing any bad connections from a previous test session. I agree that this issue seems severe enough to get more attention as it leads to data corruption. |
Although I do can confirm, that with your proposed change the issue stops being reproducible, I have great doubts that it is that simple. Because:
Which is quite understandable, because what you're now doing is sending something into a transaction that doesn't exist. Are you saying this is expected (or at least was expected by the author of that line)? I really doubt that.
So rather I would say that your proposed change introduces some side-effect (Probably, extra delays due to lots of exceptions thrown. Or connections being reset.), which makes the issue less visible. But the root cause should be something different. |
Yes you're right. The usage pattern in other places doesn't suggest this needs any change here. The change somehow got me confused and amused as the Debug Assert stopped failing and so did repro. Also I didn't come across new exceptions, I too was running debug, will recheck. Thanks for quick testing! I'll continue to look further. |
I think as reported above this issue is reproduced in earlier versions of .Net Framework too. We will be backporting fix to latest .Net Framework version once the issue is identified and confirmed. Are you aware of any working version of SqlClient or .Net Framework that doesn't reproduce this issue? That would make investigations a bit easier. |
There's something more I'd like to share: Adding the below line after creating TransactionScope in repro also fixes the issue as it promotes the transaction to MSDTC transaction. It's mainly used in context of flowing transactions across app domains, but maybe it's working out in this case of parallelism?
cc @roji @saurabh500 |
Hi @scale-tone Interestingly, the stack trace above contains trigger point of this error to be this method:
Which suggests, by the time this line gets executed in busy machines, the transaction gets rollback/aborted. So doing this call is safe as the token is generated immediately with the TransactionScope instantiation, and can be used internally by Systems.Transactions when needed. I also found official documentation here for this scenario, which discusses how the promotion of transaction to a full distributed 2-phase commit transaction occurs. Step 5 confirms the need for propagation token. I've also found another blog post discussing similar problems with TransactionScope and Multi-threading, where propagation token was needed to promote transaction to DTC. I would recommend making this call right after transaction scope is initiated, that also confirms to solve all issues:
|
Hi @scale-tone Waiting for your response if the above comment resolves the issue? |
No it doesn't, @cheenamalhotra . What you're proposing is yet another workaround. Effectively your proposed line promotes the transaction into a distributed one early. While normally that transaction is being automatically promoted to a distributed one inside the MakeTransactionalWcfCallThatThrows() call (if it were not promoted at that point, the whole scenario wouldn't work). SqlClient is expected to do that automatic promotion for us, and it is expected to do that correctly, especially it should never lead to a data loss (no matter what internal threading issues it experiences). Besides, OK, we could add that line to this particular place in our code. Would you then expect us to pepper the entire codebase with that line (because how do we know when this early promotion is needed and when it is not?)? We do not need another workaround, we already have one. We need an ultimate fix, which is now less important for us, but way much more important for other customers, who might not even know that they're loosing data due to this! |
@cheenamalhotra so far, this issue is reproduced for all .NET versions since (and including) 3.0. That is the first framework with WCF, which is a key component in this setup. I have pushed a branch to my repo with the needed changes for that version. The exception thrown in .NET 3.0 example:
|
Hi @jimcarley @corivera @geleems Could you take a look and provide some insights for this issue reported by customer as it seems to be reproduced in System.Data.SqlClient with all .NET framework versions of starting 3.0 as mentioned above. |
I tried the This means that this workaround requires the ability to control the creation of the TransactionScope. |
I have only skimmed over this thread. But when I saw the comment about "early" promotion of the transaction, it made me think of something. We had a situation where one of the enlistments, either volatile or durable, was an "EnlistDuringPrepareRequired" enlistment. This means that after Commit processing was started, a new enlistment might be attempted during "Phase 0" of the commit. When the transaction commit was delegated to SQL Server through the EnlistPromotableSinglePhase enlistment, there was a problem. SQL Server would start the commit processing by issuing the commit request to MSDTC and MSDTC would discover the Phase0 enlistment and tell that enlistment to prepare first. As part of the prepare processing, that Phase0 enlistment would try to create a new enlistment with SQL Server. But SQL Server would reject it because it thought the transaction was already in the process of committing. In other words, SQL Server didn't support the idea of a Phase0 enlistment. But it doesn't sound like this current issue is related to that. This definitely sounds like some sort of issue with the connection pooling. And it also sounds related to the situation where the transaction aborts before the first SQL operation and so a separate SQL transaction is used for the SQL operations, but when the WCF work is attempted, the aborted System.Transactions transaction is discovered. The SQL transaction(s) aren't affected and therefore commit. Can someone modify the repro to have the application "register" for the Transaction.Completed event and then log the time and outcome when that event is fired? This might help narrow down why the transaction might be aborting early, before the SQL enlistments are made? Also, I don't recall if there is some option to specify in the connection string that disables the automatic "switch" to using a SQL Transaction if the System.Transactions transaction is not usable at the time of the enlistment. |
@jimcarley |
I added the following and issue is still happening: Transaction.Current.TransactionCompleted += (s, e) => Console.Out.WriteLineAsync($"{Start.Elapsed}: {e.Transaction.TransactionInformation.Status}"); Reported status is always The only way to currently solve this is by adding TransactionInterop.GetTransmitterPropagationToken(Transaction.Current); |
Hi @jimcarley I've been debugging over and over this issue and I've noticed a strange behaviour with Pooled Connections. I captured SQL Profiles for 1 round of cycle, for below 3 scenarios:
All further loops lead to same pattern over and over again. I looked into ServiceClient code, and found the tag used on
Upon removing this tag (only for investigation purposes), I see a different pattern where propagation does not happen and transaction is rollbacked when exception occurs: Which makes sense since WPF service forces transaction promotion in this case. Also attached traces below, that can aid in investigations. At this point I'm not sure what's expected behaviour since this issue has occurred only in WPF applications like the repro and this issue has always been observed with all versions of SqlClient. To me it seems like the transactions should be promoted to DTC at the first place, but since they are pooled connections, their behaviour is different here. Appreciate your insights on the same! |
Another note I'd like to add specifically for pooled connections: [This happens in the second round and onwards - post Promote Tran triggered by WCF Call]
The driver somehow doesn't know how to handle this state, and marks it as "Unknown" here: After which the local copy of pending transaction is re-used for addressing Begin transaction. This activation of pooled connection also brings a "Warning" from SQL Server: |
@scale-tone Did you verify if 2.0.0-preview3 resolved your issue? |
@ramonsmits , I don't see any data losses happening anymore, yes.
@cheenamalhotra , as usual, you can easily see this yourself, if you compile the repro against SqlClient 2.0.0-preview3 and run/stop/run it several times. So the bug isn't fully fixed still. |
We found it was introduced with the fix when logging information, now fixed it with #563 Please give the current 'master' branch a try and let us know if you find any other concerns. |
@cheenamalhotra Do you know if this change be back-ported to |
@saurabh500 Could you provide an update? |
Under certain circumstances, when a combination of SqlCommands and transactional WCF calls is being run in a massively parallel manner, the current TransactionScope is occasionally "left over", SqlCommands do not get attached to it and therefore their updates are successfully committed, while the TransactionScope is rolled back. This leads to severe data corruption in a Line-of-Business application of one of our corporate customers.
Stable repro is outlined in this test project: https://github.com/scale-tone/InconsistencyDemo. See more detailed instructions on how to run it there. To reveal the issue more quickly, start and stop LoopHammer console app several times.
The LoopHammer console app emulates a message processing, consisting of the following steps:
This algorithm is executed multiple times, in parallel (just in the way the real message processing service operates). Note that the repro code only creates one TransactionScope and doesn't explicitly mix it with SqlTransactions or any other nested transactions.
Expected behavior
The distributed transaction is rolled back entirely, nothing is written to DB.
Actual behavior
After seconds to minutes of normal execution the WCF call fails with the following (not expected!) exception:
This happens presumably because at the moment of promotion attempt, the TransactionScope (not itself, but some underlying internal transaction object) appears to be in an aborted state already. Because it appears to be in that state, the above SQL statements do not get attached to it and got committed immediately. So the TransactionPromotionException is just a visible part of the problem, while the actual issue occurs somewhere in between TransactionScope creation and execution of SqlCommands. We're able to detect that faulty state of TransactionScope beforehand, by executing "select @@trancount" against the DB - in a failure case it appears to be 0 (normally should be 1).
WORKAROUND: if we disable SqlConnection pooling (add "Pooling=false" to connection string), the problem disappears. This makes us thinking, that the bug is somewhere in connection sharing mechanism. Probably, a connection is sometimes not fully detached from one thread, and occasionally allows it to influence a TransactionScope belonging to another thread. If that is the case, then it can not only lead to data corruption but also introduce serious security issues.
The problem is reproducible with the latest Microsoft.Data.SqlClient v1.0.19269.1. Originally it was detected in System.Data.SqlClient coming from .Net Framework - see this variant of the same repro project: https://github.com/samegutt/InconsistencyDemo.
.Net Framework version doesn't matter (tried with all from 4.5.1 to 4.8).
When compiled against sources from this repo, the problem is also reproducible.
If we compile and start LoopHammer.exe in Debug mode, we immediately hit this debug assert. If we disable it, the further execution either leads to the same TransactionPromotionException OR to some others:
or even to:
A transactional WCF call is the essential part of the scenario. Surprisingly, forcing a transaction promotion by other means (e.g. by accessing another DB) does not lead to the same problem
For some reason, the MaxDegreeOfParallelism set to 4 maximizes chances for the issue to appear. Experiments with connection pool size didn't show any differences.
As mentioned above, the issue causes a serious damage to the customer's large LoB system. Furthermore, concerns are that it might indicate a severe security issue (multiple threads being able to influence each other's transaction flow).
The text was updated successfully, but these errors were encountered: