Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken connection on unix systems when executing multiple concurrent statements against a single connection #1620

Closed
Mike737377 opened this issue May 24, 2022 · 13 comments

Comments

@Mike737377
Copy link

Describe the bug

On unix based systems the following exception is thrown when a single connection is being shared by multiple tasks/threads. The number of threads does not seem to be too relevant as long as there are 2+ threads involved.

Exception message: Microsoft.Data.SqlClient.SqlException (0x80131904): The connection is broken and recovery is not possible.  The connection is marked by the client driver as unrecoverable.  No attempt was made to restore the connection.
Stack trace:
at Task<DbDataReader> Microsoft.Data.SqlClient.SqlCommand.ExecuteDbDataReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken)+(Task<SqlDataReader> result) => { } 
at void System.Threading.Tasks.ContinuationResultTaskFromResultTask<TAntecedentResult, TResult>.InnerInvoke()

Other threads throw:

  • System.InvalidOperationException: Invalid operation. The connection is closed.
  • System.InvalidOperationException: BeginExecuteReader requires an open and available Connection. The connection's current state is open.
  • System.InvalidOperationException: BeginExecuteReader requires an open and available Connection. The connection's current state is closed.

It will typically take somewhere between 10,000 to 10,000,000 sql statements to occur over the connection before the exception happens. By increasing the minimum number of threads into the hundreds the thread pool via ThreadPool.SetMinThreads the exception will usually present itself earlier on.

This exception is problematic under our use case as this is a long running transactional batch import operation, otherwise we would just reconnect again and continue.

Running the same version of the code on windows executes successfully.

To reproduce

The following is a harsh simulation of what we are doing however I'm unable to reproduce the exact error as I encounter previously reported errors (#422, #826) before I can reproduce this one.

using System.Data;
using System.Linq;
using Microsoft.Data.SqlClient;

System.Threading.ThreadPool.SetMinThreads(500, 500);

using var sqlConnection = new SqlConnection("Server=tcp:*******,1433;Database=****;User Id=****;Password=****;MultipleActiveResultSets=true;TrustServerCertificate=true;Encrypt=true;");
sqlConnection.Open();
using var tran = sqlConnection.BeginTransaction();

Task.WaitAll(Enumerable.Range(0, Environment.ProcessorCount * 10).Select(i =>
{
    return Task.Run(() =>
    {
        for (var j = 0; j < 100000; j++)
        {
            DoThing(sqlConnection, tran, i);
        }

        Console.WriteLine(i);
    });        
}).ToArray());

tran.Commit();

static void DoThing(SqlConnection conn, SqlTransaction tran, int i)
{
    var sql1 = "insert into aaa(gid, txt, num, dt) values(@0, @1, @2, @3); select @@IDENTITY";
    var sql2 = "select * from aaa where id=@0";

    using var command = conn.CreateCommand();
    command.CommandText = sql1;
    var param1 = command.CreateParameter();
    param1.ParameterName = "0";
    param1.Value = Guid.NewGuid();
    param1.DbType = DbType.Guid;
    var param2 = command.CreateParameter();
    param2.ParameterName = "1";
    param2.Value = "Some random text with: " + i;
    param2.DbType = DbType.String;
    param2.Size = 100;
    var param3 = command.CreateParameter();
    param3.ParameterName = "2";
    param3.Value = i;
    param3.DbType = DbType.Int64;
    var param4 = command.CreateParameter();
    param4.ParameterName = "3";
    param4.Value = DateTime.Now;
    param4.DbType = DbType.DateTime;
    command.Parameters.Add(param1);
    command.Parameters.Add(param2);
    command.Parameters.Add(param3);
    command.Parameters.Add(param4);
    command.Transaction = tran;
    var x = command.ExecuteScalar();

    using var command1 = conn.CreateCommand();
    command1.CommandText = sql2;

    var param5 = command1.CreateParameter();
    param5.ParameterName = "0";
    param5.DbType = DbType.Int32;
    param5.Value = x;

    command1.Transaction = tran;
    command1.Parameters.Add(param5);
    using var reader =  command1.ExecuteReader();

    while (reader.Read()) { }
}
create table aaa
(
	id int identity primary key,
	gid uniqueidentifier,
	txt nvarchar(max),
	num integer, 
	dt datetime
)

Expected behavior

SqlConnection stays open

Further technical details

.NET target: .net 5 & .net 6
SQL Server version: SQL Server 14.0.2037.2, Azure SQL instance (general purpose serverless gen 5)
Operating system: 5.4.0-110-generic #124-Ubuntu, Ubuntu Focal 20.04 inside a docker container
Reproducable against:

  • System.Data.SqlClient 4.700.21.41603
  • Microsoft.Data.SqlClient 4.0.1 & 5.0.0-preview2.22096.2

Additional context
I've grabbed a copy of the code (commit #dfa62a1746) and added some extra tracing and can see that SNIHandle.CheckConnection() in the following stack trace is returning 1.

   at Microsoft.Data.SqlClient.SNI.TdsParserStateObjectManaged.CheckConnection() in    at Microsoft.Data.SqlClient.TdsParserStateObject.ValidateSNIConnection() in \src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 2663
   at Microsoft.Data.SqlClient.SqlConnection.ValidateAndReconnect(Action beforeDisconnect, Int32 timeout) in \src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlConnection.cs:line 1448
   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest) in \src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 4678
   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method) in \src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 4561
   at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String methodName) in \src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 1644
   at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryInternal(CommandBehavior behavior, AsyncCallback callback, Object stateObject, Int32 timeout, Boolean inRetry, Boolean asyncWrite) in \src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 1247
   at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryAsync(AsyncCallback callback, Object stateObject) in \src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 1216
   at System.Threading.Tasks.TaskFactory`1.FromAsyncImpl(Func`3 beginMethod, Func`2 endFunction, Action`1 endAction, Object state, TaskCreationOptions creationOptions)
   at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQueryAsync(CancellationToken cancellationToken)
   at Microsoft.Data.SqlClient.SqlCommand.ExecuteNonQueryAsync(CancellationToken cancellationToken) in \src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 2516

Using dotnet trace to collect the event source I can see the following
image
Note that the ValidateSNIConnection is one of the tracing points I added to determine that CheckConnection was returning something other than success.

I'm happy to try anything that will either help pinpoint the issue or attempt a work around.

@roji
Copy link
Member

roji commented May 24, 2022

You seem to be trying to use the same SqlConnection instance concurrently from multiple threads - that is not supported. Multiple active result sets (MARS) does not change this - connections are not thread-safe (see docs).

@Mike737377
Copy link
Author

Mike737377 commented May 24, 2022

@roji understand, but why are we getting an SNI connectivity error? I'm curious as to what that has to do with MARS.
Does that mean that any async execute function on SqlCommand has to be awaited before you execute another command with MARS enabled when using a single open SqlConnection?

@roji
Copy link
Member

roji commented May 24, 2022

@Mike737377 once you do concurrent access to an API that isn't thread-safe (SqlConnection in this case), usually all bets are off and the behavior is completely undefined; in this case, you're effectively having multiple threads doing I/O on the same underlying socket, which causes everything to break down.

MARS is about allowing multiple open readers, and does not provide thread-safety. That means that you must await e.g. ExecuteReaderAsync, but once that's completed and you have the returned reader, you can execute another query without first consuming the first reader.

@schotime
Copy link

Does it just magically work on windows then? I would've at least thought the behaviour should be similar in this regard.

@Wraith2
Copy link
Contributor

Wraith2 commented May 25, 2022

Does it just magically work on windows then?

Unless you're forcing the windows code to use the managed network implementation then you're seeing a difference between the managed and unmanaged implementations. So yes, it does sort of magically work on windows.

As Roji said multithreaded use of connections is not safe. You've provided a replication though so I'm sure someone will take a look and see if there's a way we can prevent this problem.

@Mike737377
Copy link
Author

I've been looking into how the connection is checked and I've noticed that there seems to be 2 functions; TdsParserStateObject.IsConnectionAlive() and TdsParserStateObject.ValidateSNIConnection() which do almost the same check but slightly differently given one is concerned about if connections are alive in the pool and the later about if the connection is up for executing the sql.

In the TdsParserStateObject.IsConnectionAlive() there is a check to see if there are pendingCallbacks with the interesting note around SNICheckConnection not being supported whilst connections are in use. TdsParserStateObject.ValidateSNIConnection() does not have this check and is where the failure is originating from.

else if ((_pendingCallbacks > 1) || ((_parser.Connection != null) && (!_parser.Connection.IsInPool)))

I also spotted that the logic in TdsParserStateObject.IsConnectionAlive() is flawed as the returned SNI code cannot be both not SNI_SUCCESS and not SNI_WAIT_TIMEOUT at the same time, unless I'm reading that wrong.

if ((error != TdsEnums.SNI_SUCCESS) && (error != TdsEnums.SNI_WAIT_TIMEOUT))

Out of curiousty adding a call to AddError(_parser.ProcessSNIError(this)); in TdsParserStateObject.ValidateSNIConnection() after the CheckConnection() call did not yield any real details about the SNI error.
image

However if I add some quick retry logic then I don't ever seem to loose my the connection which means the error is transient.

internal override uint CheckConnection()
{
    SNIHandle handle = Handle;
    uint result = TdsEnums.SNI_SUCCESS;

    if (handle != null)
    {
        for (var attempt = 1; attempt <= MaxConnectionCheckAttempts; attempt++)
        {
            result = handle.CheckConnection();
            if (result == TdsEnums.SNI_SUCCESS)
            {
                return result;
            }

            SqlClientEventSource.Log.TryTraceEvent("TdsParserStateObjectManaged.CheckConnection | Info | Connection check failed result {0}, Attempts: {1}/{2}", result, attempt, MaxConnectionCheckAttempts);
        }
    }

    return result;
}

@roji
Copy link
Member

roji commented May 25, 2022

Does it just magically work on windows then?

I highly suspect that given the right timing and usage patterns, things would fail on Windows as well. That's the thing with concurrency bugs - they're timing sensitive, and exactly when they cause failures (and which type) is undefined and random. Whatever you happen to be seeing on Windows with your code, concurrent usage of SqlConnection is not supported.

@JRahnama
Copy link
Contributor

@roji and @Wraith2 thanks for providing accurate responses. @Mike737377 I also agree on this matter with @roji and @Wraith2. MARS does not change anything in this scenario. As Roji has mentioned:

concurrent usage of SqlConnection is not supported

@Mike737377
Copy link
Author

@roji @Wraith2 @JRahnama thanks for your comments so far and I understand that you're saying it's not officially supported. However reading between the lines, does this mean that this is a "won't fix" issue?

Do you want me to raise the impossible logic statement that I raised in my previous comment as a seperate issue?

I also spotted that the logic in TdsParserStateObject.IsConnectionAlive() is flawed as the returned SNI code cannot be both not SNI_SUCCESS and not SNI_WAIT_TIMEOUT at the same time, unless I'm reading that wrong.

if ((error != TdsEnums.SNI_SUCCESS) && (error != TdsEnums.SNI_WAIT_TIMEOUT))

@Wraith2
Copy link
Contributor

Wraith2 commented May 26, 2022

@roji @Wraith2 @JRahnama thanks for your comments so far and I understand that you're saying it's not officially supported. However reading between the lines, does this mean that this is a "won't fix" issue?

As an community contributor I have more freedom to speak bluntly than others. This is a complex issue which is going to be a nightmare to debug and isn't an issue that I would give much priority because it's skirting around unsupported behaviour so even if I fix it there's a decent chance that you'll immediately hit another issue with a similar cause.

I wouldn't say "won't fix" it's just not a good value proposition to spend time on. I will say that I've spent time on more obscure issues in this repo in the past.

if ((error != TdsEnums.SNI_SUCCESS) && (error != TdsEnums.SNI_WAIT_TIMEOUT))
Looks fine to me, if the error is not success and also not timeout then it's an unexpected return code so throw construct an error and place it in the error collection for throwing from the main run function.

@Mike737377
Copy link
Author

if ((error != TdsEnums.SNI_SUCCESS) && (error != TdsEnums.SNI_WAIT_TIMEOUT)) Looks fine to me, if the error is not success and also not timeout then it's an unexpected return code so throw construct an error and place it in the error collection for throwing from the main run function.

You're right. The double negation got me on this one.

Thanks for your comments @Wraith2. I've been rewriting the code to be non concurrent and with MARS disabled so hopefully I can put the issue to bed in our code base.

@roji
Copy link
Member

roji commented Jun 7, 2022

Note that you don't have to disable MARS, which allows multiple resultsets (readers) to be open at the same time. That's unrelated to the above discussion, which is about using SqlClient APIs concurrently from multiple threads.

Having said that, MARS has its own issues, and it's not a bad idea to move away from it - but that's orthogonal to the discussion above.

@JRahnama
Copy link
Contributor

Closing this as it is not an issue in the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants