Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Recieve methods so that failures return null packets #350

Merged
merged 1 commit into from
Dec 12, 2019

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Dec 12, 2019

When working on the managed networking implementation in corefx a problem with a seemingly random timeout kept occurring and caused problems with tests. Separately in a clean branch I did a lot of work to add packet caching and lifetime tracking, this work found the bug that was causing the initial timeout. The fix was not added to the original PR (which was merged) because the bug was being investigated by someone on the sqlclient ms team, I did leave a description of my findings though. This PR is a follow up to that description fixing the bad behaviour.

When using managed networking the real (not mars wrapper) handles have their receive methods called with ref or out parameters. While tracing individual packet lifetimes if found that some packets were being disposed of twice and after investigation is became clear that it was possible for the packet contents to be processed twice if there happened to be one in some situations. This PR changes the implementation of the Receive and ReceiveAsync so that any path other than success nulls out the ref/out packet parameter so that the caller can no longer reach the released packet.

I can't add a test for this because it's extremely difficult to find the packet double release without extensive changes to the managed implementation. Logically a packet should not be accessed after it has been released so I thought it safe to try this small change as an independent fix because it contributes to reliability in difficult to reproduce situations.

Copy link
Contributor

@David-Engel David-Engel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cheenamalhotra
Copy link
Member

Hi @Wraith2

Apparently one of the pipelines (CI-Ub16-SQL17L-Core30 — #19346.01 failed) did fail with Timeout error when performing Bulk Copy, is there an additional code path that needs a fix?

@cheenamalhotra
Copy link
Member

Also it seems the original issue was discovered in EF Core tests, I'll try to setup a build that runs EF Core tests in CI to capture potential problems that we could introduce accidentally. They were always manually run till now, but it would be good to run in CI with proposed PRs.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Dec 12, 2019

I'm not aware of any other changes that were required. bulk copy is still going to use the same Receive methods on the underlying handles. I ran the test suite with forced managed mode to ensure that everything was still working, it was.

Any chance you could just re-run that leg in case it's a real timeout?

@cheenamalhotra
Copy link
Member

It was a random one, as expected which confirms there are possibilities of getting random timeouts still, yet hard to reproduce.

The change doesn't look very complex to me, any would bring improvements, but just wanted to pen down, it doesn't solve all random timeout problems, or maybe the reason for timeout errors is something else.

@cheenamalhotra cheenamalhotra added this to the 2.0.0-preview1 milestone Dec 12, 2019
@Wraith2
Copy link
Contributor Author

Wraith2 commented Dec 12, 2019

It was a random one, as expected which confirms there are possibilities of getting random timeouts still, yet hard to reproduce.

The timeout situation that this fixes isn't currently evident (as far as I know) but that's down to timing and delicate ordering of operations. If you start moving around things in the managed implementation it turns up with varying frequency. I made drastic changes to the managed implementation which made it a deterministic error and then spent a couple of weeks tracing individual packet lifetimes to isolate it. The faster you're able to run the managed implementation the more likely it seems to hit the problem, I'd like to be able to make the managed version faster than native if possible.

We're currently in a stable position on a slightly unstable foundation and this change firms up the foundation a little. I'm certain there's still more to be done.

@cheenamalhotra cheenamalhotra merged commit c95b0b3 into dotnet:master Dec 12, 2019
@Wraith2 Wraith2 deleted the bug-packetreuse branch December 13, 2019 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants