-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async reader may return wrong results on MacOS, Linux, WSL, Docker #659
Comments
Hi @Nercury, Thank you for producing the sample app and describing the issue well. Do you have any comments? |
Yes, this matches my experience. As I mentioned, the issue seems to be related to timeouts, and they can not happen if the DB server and the code is on the same machine, and the query is doing basically nothing. Try to introduce some latency between the db server and the application. I reproduced it using a bit crude approach, by running DB server in cloud and the application on my laptop with a router in another room (so there is some considerable latency between the two). Maybe there are tools to do it a bit more cleanly. In production environment the AWS starts throttling read operations when they go above allowed IOPS limit, that's how it originally happened when running on AWS. |
By replacing the data source by a slower connection over the network I am able to reproduce the wrong result. |
Thanks for checking it out. I am not sure I completely understand what you mean by "added a local index inside the QueryAndVerify method to keep the index parameter value and used this local variable instead of that parameter inside the function body", because the |
While I was preparing the requested changes I found out it is happening again. Maybe the network conditions simultaneously had affected my test by the last change. |
In between investigation, I found these mitigations besides a workaround that has mentioned as disabling the
@Nercury Can you give them a try one by one? |
Set Pooling = false
Removing the connection that encountered an exception
Got wrong results on a third attempt. Attempting to execute a SQL command without attaching a transaction to the commandThis was attempted in production when it was still unclear if this is a MARS issue. It makes wrong results more rare, but they still happen. I was not able to reproduce it in this test project with a simplistic select statement. Using ExecuteScalarAsync instead of ExecuteReaderAsyncTried it 5 times and wrong results did not happen. @DavoudEshtehari I hope this helps. |
Same bug experienced when MARS is disabled:We are experiencing a similar problem in our code base, although we have MARS disabled explicitly in the connection string and still see the bug. We run this method across multiple incoming requests from an API Controller:
The connectionFactory simply returns a new SqlConnection:
Things will run normally, then a single instance will experience a SQL Exception - previously there was an unexpected (and denied) request to perform a SHOWPLAN, sometimes it's a SQL Timeout exception. Following that exception, serialisation bugs will continue to appear on that instance until it is taken out of service. The source of the serialisation failure is that the value stored in Once the instance has been out of service for a while, subsequent attempts to access the DB via this method complete successfully, so it seems to have the ability to resolve itself. Even when the instance is misbehaving, it appears that not all requests get jumbled up, as some manage to succeed. Previously this service was hosted on Windows without any issues. Since the port to Linux this problem crops up every 7-30 days or so. We've also seen it on more than one service (the above code is in a shared library). This bug seems to be the closest thing to what we are experiencing, except that we have explicitly disabled MARS in the connection string and are still seeing it from time to time. Libraries/Infrastructure involved: This is being run on an AWS T3 instance hosting Ubuntu 18.04 We are going to attempt changing the call from QueryAsync to ExecuteScalarAsync and are attempting to capture some more information to help us debug the issue. If the change to ExecuteScalarAsync resolves the problem for at least a month or so then I'll add an extra comment here indicating that it's a possible fix. |
Could you please try your use-cases with PR #796 as I tried to run repro provided and am able to see issue fixed. Kindly confirm. Edit: This issue remained unfixed for now, due to colliding changes in different async paths. |
@cheenamalhotra All right, it makes sense to wait until all fixes are in. |
Following on from @gyphysics (some company, different team):
We have also seen some problems on INSERT, where no exception was raised, but checking later the data was not written to the table. There were gaps in the IDENTITY column, suggesting a possible silent transaction rollback (exception didn't come back to originating async context?). Possibly unrelated, but seems close enough to link not here. |
Hello! I've created a new repro putting together items of this discussion using https://github.com/trainline/SqlClient659 I can confirm the problem can be reproduced even without MARS, although it is less frequent.
@cheenamalhotra Any suggestion on how to try to pinpoint the origin of the problem? Is there a diagnostic source available in the SqlClient? |
Hi; I see a mention of Dapper here. It is entirely possible that this is a Dapper issue, using an unintended and unsupported feature that only kinda works by accident (and only when MARS is enabled). Ultimately, "MARS enabled" is not the same as "performing evilness via overlapped unawaited operations", and I will be the first to say: any time you don't immediately await something, you're effectively doing the same thing as multi-threading, and race conditions become obvious. So; if there's a problem, and this API doesn't work (or even claim to work) the way in which Dapper has used (abused) it: maybe we should just deprecate that feature from Dapper - or at least see if we can re-work it to be less evil. (I'm one of the Dapper maintainers) |
The system where the bug was discovered was using Dapper too. But the bug remained after the code was rewritten without Dapper. We will know for sure once the underlying issue is fixed. |
@Nercury is it worth updating the issue title as it is not only happening with MARS enabled? |
This fix applies to SqlCommand layer and to Begin and End methods of Execute Reader Async so Mars was not a factor, and neither is Managed SNI. We also know ExecuteXmlReaderAsync also has same issue, I will apply the same fix there on Monday. |
That's great news, I was really struggling to find a good repro for the non-MARS scenario! I will start testing this build next week. |
I now have a consistent repro of this issue to rule out inconsistencies, as it's always been difficult to consistently reproduce this. The issue is exaggerated with Threadpool exhaustion and instead of exhausting resources everytime, we are able to put a small delay on one of the delegates we know are causing problem here and are able to reproduce the wrong data use case. To run this repro, you'd need to reference driver source project and update OnTimeout function with a small delay period of 10 seconds, so it looks like this: private void OnTimeout(object state)
{
Thread.Sleep(10000); // Include this line
if (!_internalTimeout)
{
... The output captured is as under: While we're trying our best to fix this by various methods, things are very flaky and design is highly complex. Being able to reproduce consistently does give some hope. But there may be more cases where wrong data could be possible. Will provide more update as we identify more scenarios and solve this one. |
Hi Everyone! We'd like to request community validations on the fix from PR #906. Please give the below NuGet package a try and send us your feedback: |
Anyone? |
We're very keen @Wraith2, just need to find a moment having spent engineering time moving linux services back to windows to mitigate the issue. |
@Wraith2 , I've spend some time testing the pull request with my own repro-case. Earlier on, I was able to get the incorrect results with the general version 2.1.1. Consistent wrong results over the different attempts. For the sake of completeness; the same wrong results with the general version 2.1.0. Next, I've tried Microsoft.Data.SqlClient.2.1.0-pull-3bdbb71.21048.9. This version never gave any incorrect results, so it looks like this fix solves the problem. :) |
Hi @cheenamalhotra , @Wraith2 , as you know reproducing the issue, especially without MARS, on real world applications is extremely challenging, so it is hard to prove the fix. Thanks for the great effort on this! |
Some more background information about my earlier repro-case, for reference. The reprocase ran as a Docker-container in a Microsoft Azure environment, on a Linux App Service Plan (P1V2 and P2V2, both used for testing) and SQL Server Azure (S0) The base situation is a full async operation, so openasync, executereaderasync, and readasync. I've never used transactions in the scenario's, but used a low commandtimeout (1 second) to enforce problems as much as possible.. The SQL Statement I used is “select @id as Id” The issue of the wrong results only occurred when a “Timeout expired exception” appeared. Scaling up / down the database also causes exceptions, but didn’t trigger any wrong results (or at least, I couldn’t manage to do so). The results: Mitigations: Attempt 1: Use ConfigureAwait(false) for openasync, executereaderasync, readasync. Attempt 2: Use .Open() instead of OpenAsync() Attempt 3: Change the Max Connection Pool Size Performance was, however, reduced significly. Results with the patch: In summary:
The patch solves the situation of wrong results appearing. So looking forward for this hotfix to be released... ;) |
The underlying bug that was found was caused by threadpool exhaustion. It is a bug in the TdsParser layer which is above the network layer so it was possible to cause the problem in any version of the driver but was much more likely in managed (linux, macos) mode and with mars due to the way they use threads. The issue itself was down to timing of internal cancellation messages and how they were implemented which made it possible for a timeout to arrive after a result had returned, or even worse after another query had begun. What the PR does is harden the timeout handling by adding synchronous timeout checks and carefully tracking the origin of the async timeout so that if it doesn't match the executing query it is discarded. This results in more accurate timeouts in high contention scenarios and prevention of cross query timeouts which were causing a strange mismatch in the driver. |
The latest Microsoft.Data.SqlClient v2.1.2 has been released containing fix for this issue. |
Thank you all for the brilliant work to track down this elusive bug! |
@cheenamalhotra / @Wraith2: We are experiencing a similar issue using versions 1.1.x. Is this issue applicable to v2.1 only? |
It could occur in all versions of SqlClient prior to the fix in 2.1.2 Correction: Fix has also been released in v1.1.4 |
@Wraith2 Thank you. |
Hi @andrehaupt v1.1.4 has been released to contain the fix too. Please upgrade if you're still using the v1.1 series. |
Hi @cheenamalhotra, @Wraith2, Would you be able to clarify the following points please ?
I'm trying to analyze a similar issue observed with a .NET 5 app running on Windows with System.Data.SqlClient (and .NET runtime 5.0.2), so these clarifications would greatly help, thanks! |
The problem could affect all versions of the library running on any version of .net. It was more likely on the managed driver but the root cause was shared. It was a timing issue.
Yes, every version of SqlClient prior to the fix could be affected. |
…ng to intermittent wrong data Backport of dotnet/SqlClient#659 to 3.1 Cherry-picks: - dotnet/SqlClient@717ceda - dotnet/SqlClient@2c2f100 - dotnet/SqlClient@81055cf Resolves CVE-2022-41064.
…ng to intermittent wrong data Backport of dotnet/SqlClient#659 to 3.1 Cherry-picks: - dotnet/SqlClient@717ceda - dotnet/SqlClient@2c2f100 - dotnet/SqlClient@81055cf Resolves CVE-2022-41064.
…ng to intermittent wrong data Backport of dotnet/SqlClient#659 to 3.1 Cherry-picks: - dotnet/SqlClient@717ceda - dotnet/SqlClient@2c2f100 - dotnet/SqlClient@81055cf Resolves CVE-2022-41064.
Describe the bug
Sometimes the query may return wrong results under high load, on Non-Windows clients.
To reproduce
Bug is seen more often when
MultipleActiveResultSets=True
(MARS) is in the connection string. It might be prevented by disabling MARS (unclear at this point).The issue is hard to reproduce and rare, therefore this project was created with the code that attempts to simulate the situation.
This program starts 2000 concurrent connections that runs
select @Id as Id
statement. Each
Id
is different, and the query result should always returnthe id that was queried, yet under certain circumstances that's not the case. It looks like
sometimes the reader returns the stale data from random previous connection that experienced the timeout.
There are more details in the linked project.
Network latency can be more easily simulated using this method here: https://github.com/trainline/SqlClient659
Expected behavior
Under no circumstances a select statement should return a different result.
Further technical details
Microsoft.Data.SqlClient version: 4.8.1 or earlier
.NET target: netcoreapp2.0 - netcoreapp3.1
SQL Server version: SQL Server Web Edition 13.00.5426.0.v1, SQL Server Express Edition 14.00.3281.6.v1
Operating system: macOS 10.15.5, Docker container
Workaround
Do not use this SQL client on linux/docker.
Always check the id of the returned result and retry or crash if it is wrong.
To decrease the likelihood of this or maybe prevent it (unclear at this point), do not use MARS feature, if you are using .NET Core with MSSQL database, ensure that the connection string does not contain
MultipleActiveResultSets=True
.The text was updated successfully, but these errors were encountered: