-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Recieve methods so that failures return null packets #350
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi @Wraith2 Apparently one of the pipelines (CI-Ub16-SQL17L-Core30 — #19346.01 failed) did fail with Timeout error when performing Bulk Copy, is there an additional code path that needs a fix? |
Also it seems the original issue was discovered in EF Core tests, I'll try to setup a build that runs EF Core tests in CI to capture potential problems that we could introduce accidentally. They were always manually run till now, but it would be good to run in CI with proposed PRs. |
I'm not aware of any other changes that were required. bulk copy is still going to use the same Receive methods on the underlying handles. I ran the test suite with forced managed mode to ensure that everything was still working, it was. Any chance you could just re-run that leg in case it's a real timeout? |
It was a random one, as expected which confirms there are possibilities of getting random timeouts still, yet hard to reproduce. The change doesn't look very complex to me, any would bring improvements, but just wanted to pen down, it doesn't solve all random timeout problems, or maybe the reason for timeout errors is something else. |
The timeout situation that this fixes isn't currently evident (as far as I know) but that's down to timing and delicate ordering of operations. If you start moving around things in the managed implementation it turns up with varying frequency. I made drastic changes to the managed implementation which made it a deterministic error and then spent a couple of weeks tracing individual packet lifetimes to isolate it. The faster you're able to run the managed implementation the more likely it seems to hit the problem, I'd like to be able to make the managed version faster than native if possible. We're currently in a stable position on a slightly unstable foundation and this change firms up the foundation a little. I'm certain there's still more to be done. |
When working on the managed networking implementation in corefx a problem with a seemingly random timeout kept occurring and caused problems with tests. Separately in a clean branch I did a lot of work to add packet caching and lifetime tracking, this work found the bug that was causing the initial timeout. The fix was not added to the original PR (which was merged) because the bug was being investigated by someone on the sqlclient ms team, I did leave a description of my findings though. This PR is a follow up to that description fixing the bad behaviour.
When using managed networking the real (not mars wrapper) handles have their receive methods called with ref or out parameters. While tracing individual packet lifetimes if found that some packets were being disposed of twice and after investigation is became clear that it was possible for the packet contents to be processed twice if there happened to be one in some situations. This PR changes the implementation of the
Receive
andReceiveAsync
so that any path other than success nulls out the ref/out packet parameter so that the caller can no longer reach the released packet.I can't add a test for this because it's extremely difficult to find the packet double release without extensive changes to the managed implementation. Logically a packet should not be accessed after it has been released so I thought it safe to try this small change as an independent fix because it contributes to reliability in difficult to reproduce situations.