Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async reader may return wrong results on MacOS, Linux, WSL, Docker #659

Closed
Nercury opened this issue Jul 21, 2020 · 44 comments
Closed

Async reader may return wrong results on MacOS, Linux, WSL, Docker #659

Nercury opened this issue Jul 21, 2020 · 44 comments
Assignees
Labels
Area\Managed SNI Issues that are targeted to the Managed SNI codebase. 🐛 Bug! Issues that are bugs in the drivers we maintain.

Comments

@Nercury
Copy link

Nercury commented Jul 21, 2020

Describe the bug

Sometimes the query may return wrong results under high load, on Non-Windows clients.

To reproduce

  • .NET Core
  • MSSQL instance
  • Client is not running on Windows (reproduced on MacOS or Linux)
  • Query uses async reader methods
  • High load that saturates the connection pool and causes timeouts.

Bug is seen more often when MultipleActiveResultSets=True (MARS) is in the connection string. It might be prevented by disabling MARS (unclear at this point).

The issue is hard to reproduce and rare, therefore this project was created with the code that attempts to simulate the situation.

This program starts 2000 concurrent connections that runs select @Id as Id
statement. Each Id is different, and the query result should always return
the id that was queried, yet under certain circumstances that's not the case. It looks like
sometimes the reader returns the stale data from random previous connection that experienced the timeout.

There are more details in the linked project.

Network latency can be more easily simulated using this method here: https://github.com/trainline/SqlClient659

Expected behavior

Under no circumstances a select statement should return a different result.

Further technical details

Microsoft.Data.SqlClient version: 4.8.1 or earlier
.NET target: netcoreapp2.0 - netcoreapp3.1
SQL Server version: SQL Server Web Edition 13.00.5426.0.v1, SQL Server Express Edition 14.00.3281.6.v1
Operating system: macOS 10.15.5, Docker container

Workaround

Do not use this SQL client on linux/docker.

Always check the id of the returned result and retry or crash if it is wrong.

To decrease the likelihood of this or maybe prevent it (unclear at this point), do not use MARS feature, if you are using .NET Core with MSSQL database, ensure that the connection string does not contain MultipleActiveResultSets=True.

@JRahnama
Copy link
Contributor

@Nercury thanks for reaching out. Issue #422 is similar to the problem you have and is under investigation.

@JRahnama JRahnama added 2️⃣ Duplicate Issue/PR that is a duplicate and already exists. 🐛 Bug! Issues that are bugs in the drivers we maintain. Area\Managed SNI Issues that are targeted to the Managed SNI codebase. and removed 2️⃣ Duplicate Issue/PR that is a duplicate and already exists. labels Jul 21, 2020
@DavoudEshtehari
Copy link
Contributor

Hi @Nercury,

Thank you for producing the sample app and describing the issue well.
I ran your sample app multiple times at a virtual machine running Ubuntu 18.0, .Net core 3.1, and SQL Server 19 with MARS enabled and connection pool equals 100. I didn't get any Wrong results as you expected.

Do you have any comments?

@Nercury
Copy link
Author

Nercury commented Aug 18, 2020

I didn't get any Wrong results as you expected.

Do you have any comments?

Yes, this matches my experience. As I mentioned, the issue seems to be related to timeouts, and they can not happen if the DB server and the code is on the same machine, and the query is doing basically nothing.

Try to introduce some latency between the db server and the application.

I reproduced it using a bit crude approach, by running DB server in cloud and the application on my laptop with a router in another room (so there is some considerable latency between the two). Maybe there are tools to do it a bit more cleanly.

In production environment the AWS starts throttling read operations when they go above allowed IOPS limit, that's how it originally happened when running on AWS.

@DavoudEshtehari
Copy link
Contributor

By replacing the data source by a slower connection over the network I am able to reproduce the wrong result.
I checked the code and added a local index inside the QueryAndVerify method to keep the index parameter value and used this local variable instead of that parameter inside the function body and this change solved the issue. Can I ask you to give it a try and let me know?

@Nercury
Copy link
Author

Nercury commented Sep 7, 2020

Thanks for checking it out. I am not sure I completely understand what you mean by "added a local index inside the QueryAndVerify method to keep the index parameter value and used this local variable instead of that parameter inside the function body", because the int index is already local to the QueryAndVerify function. Can you provide a changeset (in the form of PR) so that I am sure I am doing the same thing?

@DavoudEshtehari
Copy link
Contributor

While I was preparing the requested changes I found out it is happening again. Maybe the network conditions simultaneously had affected my test by the last change.
The issue intermittently happens either with System.Data.Sqlclient & Microsoft.Data.SqlClient when ExecuteReaderAsync blends with exceptions.

@DavoudEshtehari DavoudEshtehari removed the Area\Managed SNI Issues that are targeted to the Managed SNI codebase. label Sep 8, 2020
@DavoudEshtehari DavoudEshtehari added the Area\Managed SNI Issues that are targeted to the Managed SNI codebase. label Sep 15, 2020
@DavoudEshtehari
Copy link
Contributor

In between investigation, I found these mitigations besides a workaround that has mentioned as disabling the MARS:

  • Set Pooling = false
  • Removing a connection that encountered an exception from the connection pool: SqlConnection.ClearPool(cn)
  • Attempting to execute a SQL command without attaching a transaction to the command (remove the BeginTransaction & CommitAsync) :

var command = new SqlCommand(sql, cn);
command.Transaction = null;

  • Using ExecuteScalarAsync instead of ExecuteReaderAsync

@Nercury Can you give them a try one by one?

@Nercury
Copy link
Author

Nercury commented Sep 17, 2020

Set Pooling = false

MultipleActiveResultSets=True;Pooling=false; painfully slow and many timeouts, but no wrong results, which is good

Removing the connection that encountered an exception

            SqlConnection cnp = null;
            try
            {
                using (var cn = new SqlConnection(connectionString))
                {
                    cnp = cn;
                    
                    // ...
                }
            }
            catch (Exception e)
            {
                if (cnp != null)
                {
                    SqlConnection.ClearPool(cnp);
                }
                
                // ...
            }

Got wrong results on a third attempt.

Attempting to execute a SQL command without attaching a transaction to the command

This was attempted in production when it was still unclear if this is a MARS issue. It makes wrong results more rare, but they still happen. I was not able to reproduce it in this test project with a simplistic select statement.

Using ExecuteScalarAsync instead of ExecuteReaderAsync

Tried it 5 times and wrong results did not happen.

@DavoudEshtehari I hope this helps.

@gyphysics
Copy link

Same bug experienced when MARS is disabled:

We are experiencing a similar problem in our code base, although we have MARS disabled explicitly in the connection string and still see the bug.

We run this method across multiple incoming requests from an API Controller:

public async Task<T> GetAsync(string id)
        {
            var sql = $"SELECT Data FROM { _configuration.TableName } {(_configuration.InMemory ? "WITH (SNAPSHOT)" : "")} WHERE Id=@id";

            using (var connection = _connectionFactory.Create())
            {
                var record = await connection.QueryAsync<string>(sql, new { id }).ConfigureAwait(false);

                var data = record.FirstOrDefault();

                return data == null ? default(T) : _serializer.Deserialize<T>(data);
            }
        }

The connectionFactory simply returns a new SqlConnection:
return new SqlConnection(_connectionString);

id is a GUID
data will be an encrypted blob of data that decrypts to a JSON payload.
_serializer handles the decryption (if needed) then serialisation.

Things will run normally, then a single instance will experience a SQL Exception - previously there was an unexpected (and denied) request to perform a SHOWPLAN, sometimes it's a SQL Timeout exception. Following that exception, serialisation bugs will continue to appear on that instance until it is taken out of service.

The source of the serialisation failure is that the value stored in data is an id from an earlier request, rather than the expected value stored on the SQL DB. We've had this happen both on the in-memory DB and without it.

Once the instance has been out of service for a while, subsequent attempts to access the DB via this method complete successfully, so it seems to have the ability to resolve itself. Even when the instance is misbehaving, it appears that not all requests get jumbled up, as some manage to succeed.

Previously this service was hosted on Windows without any issues. Since the port to Linux this problem crops up every 7-30 days or so. We've also seen it on more than one service (the above code is in a shared library). This bug seems to be the closest thing to what we are experiencing, except that we have explicitly disabled MARS in the connection string and are still seeing it from time to time.

Libraries/Infrastructure involved:
NetCore 3.1
Dapper 2.0.35
Microsoft.Data.SqlClient 2.0.1

This is being run on an AWS T3 instance hosting Ubuntu 18.04

We are going to attempt changing the call from QueryAsync to ExecuteScalarAsync and are attempting to capture some more information to help us debug the issue. If the change to ExecuteScalarAsync resolves the problem for at least a month or so then I'll add an extra comment here indicating that it's a possible fix.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Nov 13, 2020

Hi @Nercury @gyphysics

Could you please try your use-cases with PR #796 as I tried to run repro provided and am able to see issue fixed.
For me 10k iterations also complete successfully now while the same failed with current driver.

image

Kindly confirm.


Edit: This issue remained unfixed for now, due to colliding changes in different async paths.

@Nercury
Copy link
Author

Nercury commented Nov 24, 2020

This issue remained unfixed for now, due to colliding changes in different async paths.

@cheenamalhotra All right, it makes sense to wait until all fixes are in.

@simonjduff
Copy link

simonjduff commented Dec 2, 2020

Following on from @gyphysics (some company, different team):
Upgraded to 2.1.0 and still seeing this issue in multiple services, all on Linux.
Definitely no MARS in connection string (though we haven't actively disabled it).
We are using Dapper and will try to remove it from one project to prove it's not related.
In the meantime we're using guard clauses to prevent wrong data being used

if (!thingId.Equals(thing.Id))
{
    throw new ThingConflictException($"Asked for thing [{thingId}], got thing [{thing.Id}]");
}

We have also seen some problems on INSERT, where no exception was raised, but checking later the data was not written to the table. There were gaps in the IDENTITY column, suggesting a possible silent transaction rollback (exception didn't come back to originating async context?). Possibly unrelated, but seems close enough to link not here.

@alefranz
Copy link

alefranz commented Dec 7, 2020

Hello!

I've created a new repro putting together items of this discussion using docker-compose so it can be easily executed locally also simulating network failures.

https://github.com/trainline/SqlClient659

I can confirm the problem can be reproduced even without MARS, although it is less frequent.

.NET version SQL Server Version MARS Failures rate
.NET Core 3.1 2017 Enterprise Yes High
.NET Core 3.1 2017 Enterprise No Low
.NET 5.0 2017 Enterprise Yes High
.NET 5.0 2017 Enterprise No Low
.NET Core 3.1 2019 Enterprise Yes High
.NET Core 3.1 2019 Enterprise No Low
.NET 5.0 2019 Enterprise Yes High
.NET 5.0 2019 Enterprise No Low

@cheenamalhotra Any suggestion on how to try to pinpoint the origin of the problem? Is there a diagnostic source available in the SqlClient?

@mgravell
Copy link
Member

mgravell commented Dec 7, 2020

Hi; I see a mention of Dapper here. It is entirely possible that this is a Dapper issue, using an unintended and unsupported feature that only kinda works by accident (and only when MARS is enabled). Ultimately, "MARS enabled" is not the same as "performing evilness via overlapped unawaited operations", and I will be the first to say: any time you don't immediately await something, you're effectively doing the same thing as multi-threading, and race conditions become obvious.

So; if there's a problem, and this API doesn't work (or even claim to work) the way in which Dapper has used (abused) it: maybe we should just deprecate that feature from Dapper - or at least see if we can re-work it to be less evil.

(I'm one of the Dapper maintainers)

@Nercury
Copy link
Author

Nercury commented Dec 7, 2020

The system where the bug was discovered was using Dapper too. But the bug remained after the code was rewritten without Dapper. We will know for sure once the underlying issue is fixed.

@alefranz
Copy link

alefranz commented Dec 8, 2020

@Nercury is it worth updating the issue title as it is not only happening with MARS enabled?

@Nercury Nercury changed the title Multiple parallel transactions with enabled MARS may return wrong results instead of throwing an exception Multiple parallel transactions may return wrong results Dec 8, 2020
@Nercury Nercury changed the title Multiple parallel transactions may return wrong results Async reader may return wrong results Dec 8, 2020
@Nercury Nercury changed the title Async reader may return wrong results Async reader may return wrong results on MacOS, Linux, Windows/WSL, Docker Dec 8, 2020
@cheenamalhotra
Copy link
Member

This fix applies to SqlCommand layer and to Begin and End methods of Execute Reader Async so Mars was not a factor, and neither is Managed SNI.

We also know ExecuteXmlReaderAsync also has same issue, I will apply the same fix there on Monday.

@alefranz
Copy link

That's great news, I was really struggling to find a good repro for the non-MARS scenario! I will start testing this build next week.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Feb 10, 2021

I now have a consistent repro of this issue to rule out inconsistencies, as it's always been difficult to consistently reproduce this. The issue is exaggerated with Threadpool exhaustion and instead of exhausting resources everytime, we are able to put a small delay on one of the delegates we know are causing problem here and are able to reproduce the wrong data use case.

Link to Repro

To run this repro, you'd need to reference driver source project and update OnTimeout function with a small delay period of 10 seconds, so it looks like this:

private void OnTimeout(object state)
    {
       Thread.Sleep(10000); // Include this line
       if (!_internalTimeout)
       {
           ...

The output captured is as under:

image

While we're trying our best to fix this by various methods, things are very flaky and design is highly complex. Being able to reproduce consistently does give some hope. But there may be more cases where wrong data could be possible. Will provide more update as we identify more scenarios and solve this one.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Feb 17, 2021

Hi Everyone!

We'd like to request community validations on the fix from PR #906.
We've captured the root cause and fixed the issue in the core parser layer.

Please give the below NuGet package a try and send us your feedback:
Microsoft.Data.SqlClient.2.1.0-pull-89d6662.21043.7 from Pipeline Artifacts

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 18, 2021

Anyone?

@simonjduff
Copy link

We're very keen @Wraith2, just need to find a moment having spent engineering time moving linux services back to windows to mitigate the issue.

@patrickbouwens1978
Copy link

patrickbouwens1978 commented Feb 22, 2021

@Wraith2 , I've spend some time testing the pull request with my own repro-case.

Earlier on, I was able to get the incorrect results with the general version 2.1.1. Consistent wrong results over the different attempts. For the sake of completeness; the same wrong results with the general version 2.1.0.

Next, I've tried Microsoft.Data.SqlClient.2.1.0-pull-3bdbb71.21048.9. This version never gave any incorrect results, so it looks like this fix solves the problem. :)

@alefranz
Copy link

Hi @cheenamalhotra , @Wraith2 ,

as you know reproducing the issue, especially without MARS, on real world applications is extremely challenging, so it is hard to prove the fix.
I can confirm however that it doesn't introduce any appreciable regression.

Thanks for the great effort on this!

@patrickbouwens1978
Copy link

patrickbouwens1978 commented Feb 22, 2021

Some more background information about my earlier repro-case, for reference.

The reprocase ran as a Docker-container in a Microsoft Azure environment, on a Linux App Service Plan (P1V2 and P2V2, both used for testing) and SQL Server Azure (S0)
The base dockerfile is mcr.microsoft.com/dotnet/core/aspnet:3.1.8-bionic, so non-windows environment on .Net core 3.1
Connectionstring didn’t include any MARS-references, so all is used in a non-MARS environment.

The base situation is a full async operation, so openasync, executereaderasync, and readasync. I've never used transactions in the scenario's, but used a low commandtimeout (1 second) to enforce problems as much as possible..

The SQL Statement I used is “select @id as Id”

The issue of the wrong results only occurred when a “Timeout expired exception” appeared. Scaling up / down the database also causes exceptions, but didn’t trigger any wrong results (or at least, I couldn’t manage to do so).

The results:
Base scenario: use of openasync, executereaderasync, and readasync, high load situation with multiple threads, multiple requests.
When things went wrong, they kept going wrong in a consistent matter. For instance, suppose there are three users:
User 1: Select * FROM User where UserId = 1
User 2: Select * FROM User where UserId = 2
User 3: Select * FROM User where UserId = 3
User 1 caused an exception. User 2 received the results from User 1, User 3 received the results from User 2, and so on.
Different requests got each other’s results, even after some time. More and more SQL statements results in wrong results, even after some time. The connection pool become corrupted, and stayed corrupted.
The only solution was to restart the web app.

Mitigations:
I tried a number of mitigations.

Attempt 1: Use ConfigureAwait(false) for openasync, executereaderasync, readasync.
No effect, wrong results keep appearing like in the base scenario.

Attempt 2: Use .Open() instead of OpenAsync()
Exception are occurring more often (due to Timeout expired), but only a few wrong results. (10 out of 5000 of the statements went wrong, instead of 4000+ out of 5000 in the base scenario)
Furthermore, the wrong results weren’t consistly wrong. New requests could be ok, so no restart of the application was required. So it looked like some connetions became corrupt, but somehow, it didn't influence the connectionpool for later connections.

Attempt 3: Change the Max Connection Pool Size
No noticeable effect.

Performance was, however, reduced significly.

Results with the patch:
In all situation, no wrong results have occurred.

In summary:
The only mitigation I could find, was to:

  • Reduce load, so the chance of Timeout Expired exception was reduced
  • Use Open instead of OpenAsync, so the chance of wrong results was reduced as well, and wrong results could ‘disappear’ in future requests. I’ve had some luck using an O/R mapper, so I could change to use Open instead of OpenAsync with just a couple of lines of code.

The patch solves the situation of wrong results appearing. So looking forward for this hotfix to be released... ;)

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 22, 2021

The underlying bug that was found was caused by threadpool exhaustion. It is a bug in the TdsParser layer which is above the network layer so it was possible to cause the problem in any version of the driver but was much more likely in managed (linux, macos) mode and with mars due to the way they use threads.

The issue itself was down to timing of internal cancellation messages and how they were implemented which made it possible for a timeout to arrive after a result had returned, or even worse after another query had begun. What the PR does is harden the timeout handling by adding synchronous timeout checks and carefully tracking the origin of the async timeout so that if it doesn't match the executing query it is discarded. This results in more accurate timeouts in high contention scenarios and prevention of cross query timeouts which were causing a strange mismatch in the driver.

@cheenamalhotra
Copy link
Member

The latest Microsoft.Data.SqlClient v2.1.2 has been released containing fix for this issue.
Closing as resolved.

@alefranz
Copy link

alefranz commented Mar 6, 2021

Thank you all for the brilliant work to track down this elusive bug!

@andrehaupt
Copy link

@cheenamalhotra / @Wraith2: We are experiencing a similar issue using versions 1.1.x. Is this issue applicable to v2.1 only?

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 22, 2021

It could occur in all versions of SqlClient prior to the fix in 2.1.2

Correction: Fix has also been released in v1.1.4

@andrehaupt
Copy link

@Wraith2 Thank you.

@cheenamalhotra
Copy link
Member

Hi @andrehaupt

v1.1.4 has been released to contain the fix too. Please upgrade if you're still using the v1.1 series.

@tlecomte
Copy link

tlecomte commented Oct 5, 2021

Hi @cheenamalhotra, @Wraith2,

Would you be able to clarify the following points please ?

  • was this bug also affecting .NET 5 apps running on Windows ? The title of this issue seems to say that it should not happen on Windows, but looking at the fix in Fix 659: sync and async timer state rework #906 it seems like it also affects Windows, and even .NET Framework apps, isn't it ?

  • is System.Data.SqlClient also affected - again in the context of a .NET 5 app running on Windows ?

I'm trying to analyze a similar issue observed with a .NET 5 app running on Windows with System.Data.SqlClient (and .NET runtime 5.0.2), so these clarifications would greatly help, thanks!

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 5, 2021

was this bug also affecting .NET 5 apps running on Windows ?

The problem could affect all versions of the library running on any version of .net. It was more likely on the managed driver but the root cause was shared. It was a timing issue.

is System.Data.SqlClient also affected

Yes, every version of SqlClient prior to the fix could be affected.
It doesn't look like the fix was backported to corefx so the System version retains the ability to encounter the problem https://github.com/dotnet/corefx/pulls?q=is%3Apr+is%3Aclosed+label%3Aarea-System.Data.SqlClient I don't know whether servicing was considered but given the rarity of the problem and the difficulty of the fix i can see why it wouldn't be be an obvious choice to push back to an LTS release.

vseanreesermsft pushed a commit to vseanreesermsft/corefx that referenced this issue Jan 10, 2023
Fasjeit pushed a commit to CryptoPro/corefx that referenced this issue Sep 21, 2023
cheenamalhotra added a commit to dotnet/maintenance-packages that referenced this issue Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area\Managed SNI Issues that are targeted to the Managed SNI codebase. 🐛 Bug! Issues that are bugs in the drivers we maintain.
Projects
None yet