Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

WIP SqlClient handle socket network error gracefully #37068

Merged
merged 1 commit into from
Apr 25, 2019

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Apr 21, 2019

Fixes https://github.com/dotnet/corefx/issues/33930

Then a running the SqlClient test suite in managed mode (Linux, osx, uap) and Debug build, an assert fires indicating that an invalid state has been reached. The assertion indicates that a packet has reached the processing stage and has no contents or that the packet has been received in a state where none was expected. The test failure was caused by an assertion but could lead to silent unexpected non-crashing behaviour in release builds leading to unreliability.

After investigation and debugging I've arrived at three related changes.

  • plumb the error message from the packet read function all the way through the handler chain so that the recipients can identify what the error was
  • change the read function so that socket disconnections use an existing but unused error code to identify that the connection has been reset and add specific error catching for disconnections.
  • detect disconnections and disposed errors using the above changes and close the connection gracefully, and change a check

The functional and manual tests pass over multiple repeated runs correctly with the affected tests re-enabled.
/cc @afsanehr, @tarikulsabbir, @Gary-Zh , @David-Engel @saurabh500

@saurabh500
Copy link
Contributor

A quick look shows that you removed TdsParserStateObject from the change and rest of the stuff looks close to original PR.
Anything else changed from previous PR?

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 21, 2019

Sorry, I should have highlighted the specific change from the previous, it was late.

The important change is in SNIMarsHandle.ReceiveAsync where I'd already added an if to check for disconnected I've added another condition which checks for TdsEnums.SNI_ERROR. The reason for this is that occasionally (once every two full manual test runs on average) an attempt to read on a disposed socket was made and the error was flagged but not checked, so you'd end up here:

if (packet.DataLeft == 0)
{
packet.Dispose();
packet = null;
sniErrorCode = ReceiveAsync(ref packet);
if (sniErrorCode == TdsEnums.SNI_SUCCESS_IO_PENDING)
{
return;
}
HandleReceiveError(packet);

sniErrorCode would be set to SNI_ERROR at 276 but the HandleReceiveError wasn't checking for error and would try to forward the packet up the chain and because it was empty (no socket to attempt to get data from) it would trip the assert.

With this set of changes in place it got through 10 back to back runs of the manual test suite under varying amounts of system load. If there's still a problem with it then I'll need a very specific replication to find it.


((TdsParserStateObject)_callbackObject).ReadAsyncCallback(PacketHandle.FromManagedPacket(packet), 1);
if (sniErrorCode == TdsEnums.SNI_WSAECONNRESET || sniErrorCode == TdsEnums.SNI_ERROR)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned about this change in terms of general design. This makes SNI aware of the connection handling mechanisms of TdsParserStateObject now. And now SNI is managing the connection (connection pool) based on the error.
Earlier, HandleReceiveError was calling the ReadAsyncCallback with 1 viz SNI_ERROR.
In theory, HandleReceiveError has already signalled to the callback, that an error occurred, and it looks like the error should be handled in TdsParserStateObject. But now we are shortcircuiting the job of what TdsParserStateObject should be doing, in this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the point when we've got an error and we know that the packet contains no data should we be calling the receive callback when we haven't received anything? If we do then the problem is that the ReadAsyncCallback checks the error too late (or asserts too early, depends on your point of view) to be useful, I think the assert is valid because at that point it's acting as if there's a successful read and checking to make sure that's the case which leads to the conclusion that we shouldn't get there at all.

Are you proposing to move the logic handling the specific error code out of the mars handle and into the top of the callback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through your findings in the issue itself, where Decrement RefCount was asserting.

The part which I am finding hard to wrap my head around is, that in this fix, we are calling BreakConnection of the common code and making SNI cognizant of the upper layers, which means that we have skipped a bunch of error handling logic, and that's not how this layering with SNI was designed in the first place.

My understanding is that receive callback should be called with an indication that an error occured and the CheckPacket shouldn't check for Data being null. This is because the Debug assert for check packet is used for both successful and unsuccessful cases. And there is no reason why data will be available in unsuccessful cases.

The next thing that I want to understand, is that why the decrement call back is asserting and why its going to a -2 in these cases. (as you have mentioned in the issue). The idea is not to make changes to common code. As you correctly pointed out in the issue, the common code changing much is a big violation of the design.

I feel like this is uncovering problems with the way the common code and managed SNI are interacting. I had fixed one such issue for MARS where I found errors in Managed SNI, where the ref count was being incremented in sync MARS cases, where it shouldn't https://github.com/dotnet/corefx/pull/28427/files

I think there is another bug lurking around in Managed SNI which when coupled with fixing the CheckPacket condition should magically make things work. However I haven't looked deeply enough to say what it is.

Copy link
Contributor Author

@Wraith2 Wraith2 Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's at least one other bug in the SNI stuff that I'm aware of at the moment. As I work on it I find I'm gradually simplifying and streamlining it which fits with the spirit of change you made in the PR you linked.

While I agree I think fixing the issue this way is better than not fixing it at the moment. I'm still trying to spend time on improving the performance and reliability of the managed implementation but this bug blocks a lot of tests that I could really do with having working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I totally understand why you want to fix this.

Here's something I found. I changed CheckPacket to return true for managed SNI.
If I run just the MARSTest where the Debug.Assert is always hit, /p:XunitClassName=System.Data.SqlClient.ManualTesting.Tests.MARSTest I find that the SqlConnection.Dispose() is called before the call back is invoked, which causes the DecrementPendingCallbacks to fail. The SqlConnection.Close() when called, decrements the pending callback to 0 and then the receipt of failure message from SNIpacket again decrements the pending callback to -1



at System.Data.SqlClient.TdsParserStateObject.DecrementPendingCallbacks(Boolean release, String method) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 861
     at System.Data.SqlClient.TdsParserSessionPool.Dispose() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserSessionPool.cs:line 107
     at System.Data.SqlClient.TdsParser.Disconnect() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParser.cs:line 939
     at System.Data.SqlClient.SqlInternalConnectionTds.Dispose() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlInternalConnectionTds.cs:line 599
     at System.Data.ProviderBase.DbConnectionPool.DestroyObject(DbConnectionInternal obj) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\ProviderBase\DbConnectionPool.cs:line 910
     at System.Data.ProviderBase.DbConnectionPool.DeactivateObject(DbConnectionInternal obj) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\ProviderBase\DbConnectionPool.cs:line 882
     at System.Data.ProviderBase.DbConnectionPool.PutObject(DbConnectionInternal obj, Object owningObject) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\ProviderBase\DbConnectionPool.cs:line 1460
     at System.Data.ProviderBase.DbConnectionInternal.CloseConnection(DbConnection owningObject, DbConnectionFactory connectionFactory) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\ProviderBase\DbConnectionInternal.cs:lin
  e 270
     at System.Data.SqlClient.SqlConnection.CloseInnerConnection() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlConnection.cs:line 637
     at System.Data.SqlClient.SqlConnection.Close() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlConnection.cs:line 683
     at System.Data.SqlClient.SqlInternalConnectionTds.BreakConnection() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlInternalConnectionTds.cs:line 1747
     at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 2573
     at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 2128
     at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 2053
     at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 1089
     at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 1320
     at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) in D:\work\corefx\src\S
  ystem.Data.SqlClient\src\System\Data\SqlClient\TdsParser.cs:line 1624
     at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlDataReader.cs:line 1075
     at System.Data.SqlClient.SqlDataReader.get_MetaData() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlDataReader.cs:line 249
     at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlCommand.cs:line 2845
     at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds) in D:\work\corefx\s
  rc\System.Data.SqlClient\src\System\Data\SqlClient\SqlCommand.cs:line 2684
     at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite, String method) in D:
  \work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlCommand.cs:line 2515
     at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlCommand.cs:lin
  e 2473

This SqlConnection.Dispose() shouldn't happen before this callback is received. This should change the problem statement a little bit, to figure out why this Dispose() is happening before SNI triggers the callback.

If I run the tests against Native SNI, I don't see Dispose being called before the callback invocation.

Copy link
Contributor

@saurabh500 saurabh500 Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know where the issue is.
When SNIMarsHandle.Receive is called, then the calling thread is sent an error from

if (!_packetEvent.Wait(timeoutInMilliseconds))
{
SNILoadHandle.SingletonInstance.LastError = new SNIError(SNIProviders.SMUX_PROV, 0, SNICommon.ConnTimeoutError, string.Empty);
return TdsEnums.SNI_WAIT_TIMEOUT;
}

As a result of this issue, the calling thread disposes the connection instead of handling the timeout via the the callback invocation.

The TdsParserStateObject which receives the timeout in Native SNI, is not the one on which callback are being invoked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole singleton LastError thing makes me nervous. Too much bad experience with globals being used for state in vb6.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wraith2 Let me do a bit more digging around this area to provide more pointers and have a concrete explanation. Don't burn your cycles on this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's already WIP but i can close if that's easier.

@saurabh500
Copy link
Contributor

@Wraith2 I am reading your comments from https://github.com/dotnet/corefx/issues/33930
Is it possible that the fix is way simpler and our IsValid check for Managed SNI is wrong ?

The native CheckPacket doesn't check the contents of the packet, only that it is a pointer to a valid packet structure which is a slightly different check to the managed version

This makes sense and should be extrapolated to managed SNI. In case of managed SNI, we should only check that Packet is not null and not check for the data inside the packet.
https://github.com/dotnet/corefx/blob/master/src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserStateObjectManaged.cs#L36-L40

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 22, 2019

Is it possible that the fix is way simpler and our IsValid check for Managed SNI is wrong ?

I did try that and it just blew up somewhere else, following through that chain was leading me to make lots of changes to shared code and that didn't feel right since native works fine. I think the check though different is correct.

@saurabh500 saurabh500 changed the title SqlClient handle socket network error gracefully WIP SqlClient handle socket network error gracefully Apr 22, 2019
@AfsanehR-zz AfsanehR-zz merged commit 21e1b5d into dotnet:master Apr 25, 2019
@AfsanehR-zz
Copy link
Contributor

There is something wrong with Github I merged your other PR and this one shows as merged as well. Need to revert this.

@AfsanehR-zz
Copy link
Contributor

@Wraith2 There is something strange going on with this PR. It shows the file contents the same as your other PR that got merged. Could you open a new PR again? I also checked your branch(sqlfix-33930) on your remote repo and it seems like it has the same contents as PR 36678

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 25, 2019

I think this is a casualty of the problem I had with my repo trying to rebase on the UDT bytes PR. Since @saurabh is handling this I'll close and delete.

@Wraith2 Wraith2 deleted the sqlfix-33930 branch April 25, 2019 21:17
@saurabh
Copy link

saurabh commented Apr 25, 2019

You tagged the wrong guy. @Wraith2

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 25, 2019

oh, darn, well would you like to deal with this one or shall I leave it with @saurabh500 😁

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SqlClient ManualTest MARSSyncTimeoutTest fails in managed mode
5 participants