Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Making open operation async first #2384

Closed
wants to merge 3 commits into from

Conversation

Farenheith
Copy link

Attempt to fix #979
The general idea is to make the open operation async first, so it simplifies the OpenAsync flow and allows the use of the same logic for the sync Open through GetAwaiter.

I'm still figuring out how to handle and to test this project, so I'm opening it as a draft hoping anyone can validate if this is a valid approach or if the idea is not feasible because it can introduce some breaking changes I haven't noticed.

@Farenheith
Copy link
Author

@dotnet-policy-service agree

Making the open operation async first simplifies the OpenAsync logic
and allows to use the same logic for sync Open, through GetAwaiter
Copy link

codecov bot commented Mar 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.85%. Comparing base (ced726a) to head (588ee4a).

❗ Current head 588ee4a differs from pull request most recent head 34e7372. Consider uploading reports for the commit 34e7372 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2384      +/-   ##
==========================================
+ Coverage   72.58%   76.85%   +4.27%     
==========================================
  Files         310      247      -63     
  Lines       61875    38876   -22999     
==========================================
- Hits        44911    29880   -15031     
+ Misses      16964     8996    -7968     
Flag Coverage Δ
addons 92.88% <ø> (ø)
netcore 76.73% <ø> (-0.12%) ⬇️
netfx ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This change is a little bit risky and I wanted to make it
in a separated commit, to see if the previous one could
pass at least most of the pipeline and to make this one
easier to revert
@edwardneal
Copy link
Contributor

I've not got much to technically say about your changes, but I'd had a few ideas kicking around and this is as good a place as any to drop them.

I noticed that you've removed a few references to Thread.Sleep. This is definitely the right thing to do. I'm pretty much certain that these shouldn't be in most Task-based code, given the impact it has on the wider thread pool. I've not run any meaningful tests on this, but I'd speculate off-the-cuff that use of the thread-based operations such as Sleep within a Task-based model could feed into the performance issues which have been reported on the async code paths - anything async-based would be slower under load if a large chunk of the threads in the thread pool are asleep or blocked...

Taking this further though: there are a few code paths which also block the thread, rather than the Task. Monitor.Enter is one of them, and thus so are SyncAsyncLock and _parserLock. Any Wait on _parserLock has the potential to block the thread, as does any lock() statement (although I don't think there are any of those in SqlInternalConnectionTds.)

On three more specific points:

  • I can see that you've made SqlConnection.Open() just wait for the asynchronous method to complete. I'd actually suggest having InternalOpenAsync take a "sync" parameter and return a ValueTask. You can then thread that sync parameter all the way through. This avoids unnecessary allocations of Task instances, but more importantly it means that you're not introducing sync-over-async
  • You've made your changes to the SqlConnection file in netfx. This is the .NET Framework project; I'd suggest also making the changes to the file in netcore - the .NET Core project
  • Once the changes are made to the .NET Core project, it's definitely worth testing with the managed SNI parser. This is always used in Linux, but can be enabled in Windows with the Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows AppContext switch.

The last point is the most important one. The managed SNI parser has the largest async performance issues, and if there are any issues/incompatibilities, I think that's probably where you'll run into them.

@Farenheith
Copy link
Author

Farenheith commented Mar 4, 2024

@edwardneal thanks for the tips. I'll take a look into it as soon as I get some time. There are other places I used GetAwaiter that I should change, too. Maybe if I use ConfigureAwait(false) it mitigates the risk of deadlocks? I will make another branch to try around on your suggestion, but I think of three ways to achieve this, actually:

  • Duplicate all the methods with an async and sync version. The drawback is obvious, as I'd have two similar codes to maintain;
  • Split the current methods in many parts and compose the calls in a sync and async manner, which would imply in a bigger refactoring;
  • Compose the tasks manually (similar to what is being done in the code without the PR) which I think ends up generating more lines of code than the first approach, and also makes the code harder to maintain;

Between these three options I could imagine, I think maybe the second approach would bring the best benefits in the long run, although it would make this PR harder to be reviewed. I really like, though, the idea of making the code 100% async, and falling in a sync-over-async scenario, as the async code removes the need of locks in many situations, but I'm not sure if the risk of deadlocks can really be mitigated.

The same changes of netfx were replicated on netcore.
Also, ConfigureAwait(false) was added to lower the risk of
deadlock if getAwaiter. This approach is still needed to be
checked
@edwardneal
Copy link
Contributor

Sounds good! The pattern I've seen has been similar to:

async ValueTask<int> ActCore(int param1, bool async)
{
    if (async)
        return await NestedFuncAsync(param1);
    else
        return ValueTask<int>.FromResult(NestedFunc(param1));
}

async ValueTask<int> ActAsync(int param1)
    => ActCore(param1, true);

int Act(int param1)
    => ActCore(param1, false).Result;

This provides a common codebase, while also bypassing the issues which come from sync-over-async.

@Farenheith
Copy link
Author

Farenheith commented Mar 9, 2024

@edwardneal I have an Idea:

What if I create a static helper object that uses AsyncLocal to determine wether all the Open stack call must be called synchronously?
With that approach, I can set the AsyncLocal instance as true on the OpenAsync operation, change all the methods on the stack to use ValueTask and, finally, only write the "bifurcation" code at the points where it'll really make a difference, as everywhere where I await a ValueTask return with a sync value, it'll still runs synchronously.

I think this way the code will be way cleaner and easier to maintain

@roji
Copy link
Member

roji commented Mar 9, 2024

Keep in mind that AsyncLocal has a certain perf cost, which I'm not sure makes sense here. A usual thing to do in order to unify sync and async paths is simply to pass an "async" flag parameter down the stack instead.

@edwardneal
Copy link
Contributor

I'd agree with roji here - I've not used AsyncLocal before, but it seems like overkill in most places when an async parameter can perform the bifurcation that you're talking about without a heap allocation, etc.

I agree with the rest of your point. Making the core code return a Task or a ValueTask is fine, and in the case where it runs synchronously you can simply return its Result property or invoke the code with async = false. The only reason I've picked ValueTask is that it's a value type, so doesn't necessarily incur a heap allocation (which would be guaranteed if it returned a Task instance) in the synchronous case.

To elaborate on thread-level synchronisation primitives though, please consider a situation where the thread pool is heavily loaded.

  1. In SqlInternalConnectionTds.Create, you call _parserLock.Wait, which in turn calls Monitor.Enter. This enters the thread-level sync primitive
  2. In OpenLoginEnlist, you drop through LoginNoFailover, then AttemptOneLogin, then Login, then into TdsParser and eventually down to SNITcpHandle.SendAsync where it writes to the network and awaits a response. The higher the network latency, the greater the likelihood that it has to await a response
  3. The task yields until it receives a response, then resumes on a different thread
  4. The parser lock is held by another thread (which has its own problems) so our task's thread don't have it. If the code tries to exit the monitor, it'll probably fail. If it tries to call Monitor.Enter on the same object again while it's not running on its original thread, it'll deadlock that thread.

We can't guarantee that a continuation of a Task/ValueTask will run on the same thread as its predecessor. It's possible that a ContinueWith could do the same thing, although I've picked a network write because it seems more likely to force the task to yield.

Incidentally, this is also part of the reason why we can't mix await and the lock statement. It's resolved by using task-level synchronisation primitives.

It's not an immediate stoppage because it's "only" reducing thread pool capacity, but it'll develop into such if it's given time. This sort of issue might also be having an impact on the existing issues around MARS and general async performance, although I'm just speculating.

It's possible that other issues are constraining the library's connection speed enough for that not to happen in real life. If it can happen, it'll probably happen in a situation where something is opening SQL Server connections rapidly to a database server with higher network latency. One such scenario could be a web service connecting to a database which is going through a disaster recovery scenario.

I think the largest lock to try to unpick for this is the SyncAsyncLock which sits in SqlInternalConnectionTds, but it's used in quite a few places and I'm struggling to understand it.

@Wraith2 - I'd appreciate some help with this, if possible. My first impression is that SqlClient's likely to encounter this sort of trouble with thread blocking, am I missing something which mitigates this in SyncAsyncLock or more widely? I can see that SyncAsyncLock.Wait has a canReleaseFromAnyThread parameter, but this doesn't seem to be widely used and I'm not sure it'd fully resolve the problem (since Monitor.Enter is called whatever this parameter's value.) I can also see a straight-up lock in SNIMarsConnection.

I get the impression from reading the code that SyncAsyncLock was written to use thread-level locking, then had some ad-hoc changes to make sure that the existing locking methodology would carry on working with Tasks. I think its essential purpose is to ensure that TdsParser has control over the underlying connection's TDS-level data stream, with the limited exception of client-to-server Attention messages and client-driven connection closures. Am I understanding that correctly?

If so, do you think it'd be viable to have something like a simple SemaphoreSlim to handle all protection of the data stream and an associated CancellationTokenSource to handle the cases of client-driven connection closure and command cancellation?

@JRahnama
Copy link
Contributor

JRahnama commented Mar 9, 2024

This sort of issue might also be having an impact on the existing issues around MARS and general async performance, although I'm just speculating.

This seems the exact root cause of issue #422

@Farenheith
Copy link
Author

Keep in mind that AsyncLocal has a certain perf cost, which I'm not sure makes sense here. A usual thing to do in order to unify sync and async paths is simply to pass an "async" flag parameter down the stack instead.

I understand. I've started to increment this draft following the suggestion of the async flag, then.

@edwardneal all your concerns are pretty valid. With OpenAsync starting to be asynchronous, we have a whole new world of scenarios to deal with. I was hoping this change could be more simpler, but this is really a bigger issue and I have a lot to learn about this code to make it safely.

I'll take my time to look at the code but I'm not sure if this draft is a good idea anymore. It's very risky,

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 9, 2024

I agree with the rest of your point. Making the core code return a Task or a ValueTask is fine, and in the case where it runs synchronously you can simply return its Result property or invoke the code with async = false. The only reason I've picked ValueTask is that it's a value type, so doesn't necessarily incur a heap allocation (which would be guaranteed if it returned a Task instance) in the synchronous case.

For modern .NET you should use ValueTask where there is a reasonable assumption that an operation will complete synchronously the majority of the time, for heuristic purposes consider 90% a decent approximation point to start with. If the code being changed is the code to fetch an existing connection from the in-memory pool then i'd say it should be written using value task. If the code being changed is going to establish a new socket connection I would expect to use Task. This information is taken from long conversations by the runtime team in api reviews about the topic.

I wouldn't expect to use AsyncLocal for opening a connection. My intention was to use a bool async parameter which will probably be quite viral like async itself is.
While performance is usually, correctly, a concern after something has been made to work there are considerations in async code that you should make while making it correct. Every await implies a compiler generated closure and the allocation for it. For opening and closing that might not be too bad but for field fetches it would be a performance killer. Also be aware that most task machinery like completion sources, cancellation registrations etc require allocations which add up quickly.

I wasn't going to start any of this sort of rewrite until the codebases are entirely merged. The difficulty is too high to implement and to review on multiple codebases. I've got a lot of things i could change if we could just get to a single codebase.

@JRahnama
Copy link
Contributor

Closing this as this will be addressed in the SqlClientx

@JRahnama JRahnama closed this Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SqlConnection.OpenAsync() may block on network i/o during connection creation
5 participants