-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix | Fixes wrong data errors due to exceptions not handled in time #894
Conversation
0dbb3a9
to
eed1be1
Compare
@cheenamalhotra it will be hard to confirm this purely by code inspection. |
Alternatively I think we should a fix to check for managed sni being enabled and not enable the fix for native SNI unless we have a confirmed repro |
@@ -2060,6 +2060,10 @@ private IAsyncResult BeginExecuteReaderInternal(CommandBehavior behavior, AsyncC | |||
|
|||
if (!inRetry) | |||
{ | |||
if (asyncWrite) | |||
{ | |||
_activeConnection.GetOpenTdsConnection().IncrementAsyncCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ON line 2088 RunExecuteReader is called which calls RunExecuteReaderTds which already does something similar to what you are doing here for async cases. Why is that not working? Basically we are ref counting the connection before the execution of the command and that seems to be happening for the code path that you seem to be changing as well. So why do we need this extra increment here?
// If we are in retry, don't increment the Async count. This should have already been set.
if (async && !inRetry)
{
_activeConnection.GetOpenTdsConnection().IncrementAsyncCount();
decrementAsyncCountOnFailure = true;
}
I think the code is already structured to do what you are doing here, the question is why is the async count getting messed up with the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need traces to confirm that for this area, so we can track connection's activity.
I will work on that next week and will update findings here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I captured traces for the failing scenario, we are setting these counters but something goes terribly wrong when an exception is thrown to user from previous query timeout, take this case for example:
Logs have this:
e414c538-cfba-41a2-b26e-b7495e5dc215 82 secs 44 Success
e414c538-cfba-41a2-b26e-b7495e5dc215 82 secs 529 Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
When "44" was queried, timeout occurred but User got data as both threads were blocked that handle callbacks "HandleReceiveError" v/s completion callback > and completion callback finished first > that led to Commit being performed, but also Attention signal was sent once "HandleReceiveError" did it's job, about which user is unaware and error stayed in Connection's state when it was pushed to pool.
When "529" was queried, driver queried the database, results came on time, and "async command counter was decremented" as a normal flow, but a pending exception was also found in this connection, due to which the data that was fetched, was not consumed and was pushed back to pool as-is. This is where the wrong data flow begins. The async counter seemed to have no relation to whether exception were to be handled before concluding, and by the time exception was handled, it was too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems with my changes I'm introducing another counter here on the same method, but it also saved the connection to go back to pool, when "529" as queried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and error stayed in Connection's state when it was pushed to pool.
I dumped both good and bad objects state from the pool and apart from id's and muxing counters saw no difference so where is the state when it's in the pool?
@cheenamalhotra what is the async count if failure cases ? You mentioned that it is neither incremented or decrements without the fix. But I am trying to understand why that is the case? |
Is it possible to modify the code and see if you can print the callers of SqlInternalConnectionTds.IncrementAsyncCount() and DecrementAsyncCount() to see if they got called and what was the subsequent results. |
I do want to acknowledge that getting to the point where you can say that async count is messed up and that is what need to investigate is a huge achievement esp for a non-deterministic repro. I think we need to bridge the gap around "why" since the existing code is already structured to take care of ref counting the asyncCount. |
I've been working on debugging this issue as well. I added in some code to identify internal connection instances that were in two states: known broken because it has returned a clear incorrect result and known good because it has returned a known good result after the broken connection has been identified. I then added in some code to put the spids in environment variables and dump the details of those state objects as they left the pool immediately prior to be connected to the external command object. The state apart from id's were identical. I also added checks for data remaining in the network stream and the socket buffers on return from the pool. In both cases there was none. So I know that the poisoned state that is retained through the pool is not observable as it exits the pool. So the poisoned state wasn't observable as data either. So there was a state hole somewhere. The client state wasn't accurately reflecting something about the operations representing everything as clean when it wasn't, This seems a reasonable mitigation for that to me. The SNI layer is where I started investigating and @cheenamalhotra noted that she was seeing SNIPackets being finalized which pointed to there being a flaw in it somewhere. I spent a couple of weeks tracking down the recycling and packet flows. I've rewritten the mars demuxer and debugged through a lot of paths to get all packets correctly recycled even in the repro app which is massive overload scenario and has tasks that don't complete all over the place. I've definitely got areas of the managed SNI layer that I can improve but there is no clear flaw and as noted above the stored state remains consistent between known good and bad connections. The other big difference between the managed and native implementation is how threading is used. Managed uses the threadpool and the reproduction causes threadpool exhaustion along with all the non-deterministic behaviours that come along with that. I strongly suspect that the native component uses IOCP internally so won't suffer from the same timing issues and even if it has problems they're limited to the native layer and can only escape as error codes, in managed mode the failures can occur absolutely everywhere. |
@Wraith2 you are right about the native implementation using IOCP and leveraging IO threads and not Application thread pool threads. That is the biggest problem in Managed SNI implementation and causes the applications to suffer under stress where timeouts may occur in scenarios where windows implementation works fine. |
Yes, the screenshots above are run on Windows using Managed SNI enabled with App Context switch, so this is not a platform specific bug. The repro is also reliable, as we immediately see the Invalid packets in 1 single run.
The count is always 0 for corrupted connections where exceptions occur or end is not complete.
This could also be reason. The issue is equally reproducible with Debug builds and yes not MARS specific. Our main focus was always Managed SNI, MARS and and TDS Parser layers as that was what we were reproducing, so that's why we never captured enough traces for SqlCommand counters. I will also work on adding eventsource traces for this area to be absolutely sure we are making a right fix.
Yes absolutely! I will gather more details to confirm this is the right fix. This is the first change we made in the driver that stops wrong results from happening, so there's definitely a possibility of polishing it further and we may find something more. |
I spent about half hour trying to repro this, but not able to. |
Thanks @cheenamalhotra for the response, help me with the conditions for the reliable repro. |
Please use this repro, we don't need a very low memory machine I think. I'm able to reproduce on 32 GB RAM laptop. cc @Wraith2 (for your ease, all major test cases are here) |
OK. i will give this a shot |
It can take a few seconds to get an invalid result. It's highly timing dependent and the more you add to try and debug the issue the less likely it is to occur because you stay on one thread for longer. I've got mine setup for 4000 threads and have set the threadpool max to that as well to get things rolling with high concurrency quickly. |
@cheenamalhotra I see |
You get an invalid count. I can push my repro if you want but it's derived from the same codebase. You start it and let it run for a while and you should get an invalid result before it hits a debug assert most of the time. |
I have the repro running for 10 minutes without an error right now. Connecting to SQL server on a reasonably latent network connection. |
Sorry I updated the repro as I was testing with ExecuteReader, please update the 659_2.cs class on line 96:
|
Got it. Trying again now. |
And I have it !!
|
No fundemental differences just minor rearrangements and additional output. Mine identifies incorrect results with (spid, expected, recieved) entry when it finds one and there's some code for setting some environment variables which were then picked up by code inside the client library. I ran it for 5 minutes and got 1 incorrect result. Not unusual, there are some small changes in #889 which generally speed it up so that you get higher throughput and better chances of hitting the error. It's non-deterministic though so it may just not fail for you. |
@cheenamalhotra is there a dev doc on how to build the nuget and consume it in a local project for MDS ? |
Or any quick instructions to refresh my memory? |
default build packages the nuget. You don't need to though since you can add in a project to the sln add a project ref and run the library directly.
|
I also updated Case Study.zip file above (description), that contains an XL sheet for all details you may find helpful.
Just run |
How do I consume the package in the repro after building it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wooohhooo
It's rather disappointing that the thing that took us so much effort and time to track down is fixed with a simple call to a method that already exists. On the other hand it's safer than writing lots of new code so yay! |
@@ -1973,6 +1975,7 @@ private SqlDataReader EndExecuteReaderInternal(IAsyncResult asyncResult) | |||
int? sqlExceptionNumber = null; | |||
try | |||
{ | |||
CheckThrowSNIException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love for this to be the final fix to this issue. But how come adding a call to CheckThrowSNIException() at these locations makes a difference when the same method is already being called not much later in the flow?
From here, -> InternalEndExecuteReader (about 4 statements into it) -> CheckThrowSNIException()
I wonder if VerifyEndExecuteState is stopping flow before it gets to CheckThrowSNIException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. But this flow actually completes and reader is provided to user.
I'm looking more into why CheckThrow is swallowing that exception and not throwing it in that particular case.
Manual stepping does not reproduce that error, and not all cases that timeout are bypassed as we have good exception cases too.. I'll continue scanning and tracking again. It's far more dense that it seems 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In InternalEndExecuteReader
before it checks for errors it calls WaitForAsyncResults
which resets the _networkPacketSource
which is where the OnTimeout
function puts the exception that we want to find and throw. It certainly changes the behaviour in one of the downstream callee's of the poorly named CheckThrowSNIException
(it doesn't throw...).
If the original author of all this isn't in an asylum at the moment I expect they're currently located in a hidden base at the base of a volcano planning world domination.
Unfortunately, after more digging, this change just got lucky and didn't capture any wrong data when we tested, but use-case stays possible. So Attempt 3 is discarded too. 😞 The check we added here and the existing one, have no state change in between. It's basically a matter of time, when event for To also confirm this is NOT Managed SNI/Linux issue, I was also able to reproduce this issue on Native SNI on Windows, just had to blow up the counter 4000 to 12M and wait around 30 mins. Not trying to say this is a use-case, or it may never even occur to client applications, but just proves the theory that there is a basic driver design issue here, and the driver lacks synchronization between So we have a design problem to solve, irrespective to thread starvation, and the options we've tried so far are not going to fix it. Since it doesn't seems like we can do anything from SqlCommand layer or main thread where Reader is being prepared. We most probably need to double check before sending attention, whether this command for which timeout event is going to raise attention, is even active anymore! Or stop this timeout event that's already scheduled from ThreadPool? Now because we have features like MARS, it's going to be extremely challenging to come up with a better design to handle this scenario. I don't know about how other applications/drivers handle this case, but this seems a pretty dirty problem to solve. As we never "expect" in any generic design, that a timeout event could take further 60ish seconds to even get started! So whatever we were doing if timeout didn't occur, could eventually complete, and things we do when "timeout" event occurs, could cause instability and serious consequences, as it's already scheduled in ThreadPool and can we even stop it from happening!!!??? Closing for now, we need some brainstorming here! |
Is it worth considering the same approach used with statistics and some of the other timeouts. Take a timestamp and calculate the expiry timeout then keep that value around, later check if it's been exceeded. Since DateTime.UtcNow is synchronous and doesn't rely on a thread running you can know if you should have timed out and change state appropriately. Perhaps @roji has some insight about how this would work in the postgres driver? |
I've read the above, but am lacking some context here :) If you guys can give a quick/short description I'll can try to help with ideas... Is the problem that some timer code is firing too late, at a point where it causes some sort of corruption? If so, common practice is to cancel the timer (e.g. when the command ends)? For example, you can dispose a System.Threading.Timer in a way which guarantees that all callbacks have completed when DisposeAsync returns (but watch out for the different sync Dispose overloads). Another option would be for the timer callback to check some counter before doing its thing, and to increment it when the command completes; this effectively does the same thing since the callback no longer does anything after the command is over. Hopefully I've understood more or less what's going on... |
The problem is that the timer never fires because the process is in a state of massive thread contention on the threadpool. So the timer is set, work continues, data arrives late but because the timer hasn't been able to fire the fact that it's late isn't seen and as a result we end up with the timer firing to set the expiry flag after we've already proceeded to return data to the user. So essentially we can't rely on anything that requires any form of threadpool or thread scheduling in general since they may happen in a totally unreliable order. Are there any techniques you've used or seen in the postgres driver for dealing with threadpool unreliability? |
OK - it seems like what I wrote above is probably relevant then. Timers - like other types of concurrent operations - indeed cannot be relied upon to happen at particular points in time; the only reliable way to handle this would be to either disable the timer when the command completes - and to wait until all callbacks are guaranteed to have completed (the Timer class provides this) - or to make sure that if the timer fires after that point it does nothing. Is there any reason why one of these options isn't feasible in this case? |
HI @roji Thanks for chiming in! In the current design, we don't dispose the packet timer for MARS. [Flow: PutSession > SessionPool.PutSession does not call DisposeCounters (non-Mars)] as I think multiple reads are possible for MARS in the same TdsParserStateObject, as there are multiple sessions and multiple events could have been triggered from different queries. We are simply resetting it to -1 (Infinite), when done with the read to allow callbacks to complete. So even if we could Dispose timer with WaitHandle to wait for timer to complete, we are not able to do that. It doesn't seem a very clean design and we'll have to revalidate this flow. I was thinking if we can manage that the timer be owned and controlled by SqlCommand Async APIs (considering performance does not degrade for multi-packet reads) or by a MARS session, that can wait for it's timer events to complete before completing reader, can also dispose timer safely. We'll need to redesign this part, as currently the flow from SqlCommand to TdsParserStateObject is not clearly defined, and "ReadSni" is called from various places. If we have to redesign this, we could ensure the owning objects own the timer and dispose them safely, without having the TdsParserStateObject to leave things unhandled. Secondly, if we were to consider not doing task on timeout after a flag/counter is reset, it will again interfere with MARS I think, but we'll have to see it's feasibility. |
Good luck @cheenamalhotra... Keep in mind that there's the second alternative - rather than disposing the timer (and waiting for callbacks to complete), you could allow the timer to fire but its callback would check whether it should still perform its job (that's the counter solution I mentioned above). That would allow you continue reusing the same timer and refactor too much for ownership, but at the same time allow you to stop the callback from doing any damage if it runs too late. |
It's hard to track down but I'm fairly certain that the packet timer is associated with the TdsParserStateObject and that there is an individual instanceper virtual handle, so SNIMarsHandle will have an associated state and parser object of it's own, that's why there are two constructors (with a ridiculously complicated call hierarchy) for the state object, one is a standalone object and the other has a reference to a physical state object which is the one that holds the SNIMarsConnection multiplexor. So the one state object per queryable connection timer. The problem is that there's so much state in the state object and it interacts in so many complicated ways that I'm very wary of trying to change how any of it is used. What we really want is an async query object that we can use to track all the info about the running async query rather than keeping it all directly on the state object. If we keep the timer but since we know the time that it has to elapse we can do an inline time check after an async read completes and if the timeout has elapsed try to invoke the OnTimeout method synchronously and as roji said we then have OnTimeout check to see if it's no longer needed (this might already work if we set _attentionSent it should drop out safely). The problem i see is that OnTimeout uses a delay timer with more async stuff going on. |
I thought so too, but 'attentionSent' actually resets to 'false' after receiving Attn Acknowledgment from server (again supporting MARS here), so it's not reliable and therefore we will end up sending attention. And also we cannot set it to true on query completion so we have to think of something else in state object and/or mars session handle to be able to reliably continue. |
Then lets use a new variable. One more won't cause huge problems we just need to know where to clear it (probably in begin execute) and then interlock access to setting it. |
This change fixes the wrong data issue in #659
Root Cause
Ref: Case_Study.zip (RCA explained - #659 (comment))
Attempt 3 [The Fix - Updated 04-02-2021]
Those missing checks in SqlCommand methods added here caused wrong data flow, adding them respectively fixes it by capturing exception on time and throwing it immediately.
These checks were always done in Old Async APIs, e.g. InternalEndExecuteReader , InternalEndExecuteNonQuery
Fix applies to all platforms on all supported .NET SDKs. APIs fixed:
Old Attempts
Attempt 2 [Updated 02-02-2021]
[Discarded - Breaks Timeout Flows]
The first attempted solution was discarded after further investigations as the counter adjustment only made the connection cleaned out from connection pool, but the original issue existed.
After more investigations into the trigger point of Attention packets, I looked more into
_networkPacketTimeout
being used inSqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs
Lines 2388 to 2393 in cde615e
This timer when times out (when duration is updated again) triggers an Attention signal, and is likely causing some issues here. The update to duration when there is time left, kicks off this timer to timeout on another thread independently while the read could still be going on. If we only timeout when we hit 0 (next few lines) we aren't allowing timeout and read to be in a position to occur simultaneously.
Attempt 1 [Discarded]
The repro apps created by community in #659 were very helpful to reproduce this issue, thanks to everyone for helping us with that! We really appreciate your support and patience!
Repro Test Results
Below is the screenshot for the repro from #659 (comment) in multiple parallel sessions all at once with the fix, which has never been possible! Yes there are network errors, but that is due to thread-pool starvation issue #422 which we'll continue to investigate!
[Updated 04-02-2021]
[Attempt 3 - no wrong data observed in multiple sessions, results similar to screenshot]
Without this fix, invalid data is easily captured in a single process