fix: Making open operation async first #2384

Farenheith · 2024-03-03T13:01:56Z

Attempt to fix #979
The general idea is to make the open operation async first, so it simplifies the OpenAsync flow and allows the use of the same logic for the sync Open through GetAwaiter.

I'm still figuring out how to handle and to test this project, so I'm opening it as a draft hoping anyone can validate if this is a valid approach or if the idea is not feasible because it can introduce some breaking changes I haven't noticed.

Farenheith · 2024-03-03T13:06:50Z

@dotnet-policy-service agree

Making the open operation async first simplifies the OpenAsync logic and allows to use the same logic for sync Open, through GetAwaiter

codecov · 2024-03-03T21:24:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.85%. Comparing base (ced726a) to head (588ee4a).

❗ Current head 588ee4a differs from pull request most recent head 34e7372. Consider uploading reports for the commit 34e7372 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2384      +/-   ##
==========================================
+ Coverage   72.58%   76.85%   +4.27%     
==========================================
  Files         310      247      -63     
  Lines       61875    38876   -22999     
==========================================
- Hits        44911    29880   -15031     
+ Misses      16964     8996    -7968

Flag	Coverage Δ
addons	`92.88% <ø> (ø)`
netcore	`76.73% <ø> (-0.12%)`	⬇️
netfx	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This change is a little bit risky and I wanted to make it in a separated commit, to see if the previous one could pass at least most of the pipeline and to make this one easier to revert

edwardneal · 2024-03-04T05:16:01Z

I've not got much to technically say about your changes, but I'd had a few ideas kicking around and this is as good a place as any to drop them.

I noticed that you've removed a few references to Thread.Sleep. This is definitely the right thing to do. I'm pretty much certain that these shouldn't be in most Task-based code, given the impact it has on the wider thread pool. I've not run any meaningful tests on this, but I'd speculate off-the-cuff that use of the thread-based operations such as Sleep within a Task-based model could feed into the performance issues which have been reported on the async code paths - anything async-based would be slower under load if a large chunk of the threads in the thread pool are asleep or blocked...

Taking this further though: there are a few code paths which also block the thread, rather than the Task. Monitor.Enter is one of them, and thus so are SyncAsyncLock and _parserLock. Any Wait on _parserLock has the potential to block the thread, as does any lock() statement (although I don't think there are any of those in SqlInternalConnectionTds.)

On three more specific points:

I can see that you've made SqlConnection.Open() just wait for the asynchronous method to complete. I'd actually suggest having InternalOpenAsync take a "sync" parameter and return a ValueTask. You can then thread that sync parameter all the way through. This avoids unnecessary allocations of Task instances, but more importantly it means that you're not introducing sync-over-async
You've made your changes to the SqlConnection file in netfx. This is the .NET Framework project; I'd suggest also making the changes to the file in netcore - the .NET Core project
Once the changes are made to the .NET Core project, it's definitely worth testing with the managed SNI parser. This is always used in Linux, but can be enabled in Windows with the Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows AppContext switch.

The last point is the most important one. The managed SNI parser has the largest async performance issues, and if there are any issues/incompatibilities, I think that's probably where you'll run into them.

Farenheith · 2024-03-04T09:30:46Z

@edwardneal thanks for the tips. I'll take a look into it as soon as I get some time. There are other places I used GetAwaiter that I should change, too. Maybe if I use ConfigureAwait(false) it mitigates the risk of deadlocks? I will make another branch to try around on your suggestion, but I think of three ways to achieve this, actually:

Duplicate all the methods with an async and sync version. The drawback is obvious, as I'd have two similar codes to maintain;
Split the current methods in many parts and compose the calls in a sync and async manner, which would imply in a bigger refactoring;
Compose the tasks manually (similar to what is being done in the code without the PR) which I think ends up generating more lines of code than the first approach, and also makes the code harder to maintain;

Between these three options I could imagine, I think maybe the second approach would bring the best benefits in the long run, although it would make this PR harder to be reviewed. I really like, though, the idea of making the code 100% async, and falling in a sync-over-async scenario, as the async code removes the need of locks in many situations, but I'm not sure if the risk of deadlocks can really be mitigated.

The same changes of netfx were replicated on netcore. Also, ConfigureAwait(false) was added to lower the risk of deadlock if getAwaiter. This approach is still needed to be checked

edwardneal · 2024-03-04T12:35:34Z

Sounds good! The pattern I've seen has been similar to:

async ValueTask<int> ActCore(int param1, bool async)
{
    if (async)
        return await NestedFuncAsync(param1);
    else
        return ValueTask<int>.FromResult(NestedFunc(param1));
}

async ValueTask<int> ActAsync(int param1)
    => ActCore(param1, true);

int Act(int param1)
    => ActCore(param1, false).Result;

This provides a common codebase, while also bypassing the issues which come from sync-over-async.

Farenheith · 2024-03-09T14:02:30Z

@edwardneal I have an Idea:

What if I create a static helper object that uses AsyncLocal to determine wether all the Open stack call must be called synchronously?
With that approach, I can set the AsyncLocal instance as true on the OpenAsync operation, change all the methods on the stack to use ValueTask and, finally, only write the "bifurcation" code at the points where it'll really make a difference, as everywhere where I await a ValueTask return with a sync value, it'll still runs synchronously.

I think this way the code will be way cleaner and easier to maintain

roji · 2024-03-09T14:45:41Z

Keep in mind that AsyncLocal has a certain perf cost, which I'm not sure makes sense here. A usual thing to do in order to unify sync and async paths is simply to pass an "async" flag parameter down the stack instead.

edwardneal · 2024-03-09T18:56:41Z

I'd agree with roji here - I've not used AsyncLocal before, but it seems like overkill in most places when an async parameter can perform the bifurcation that you're talking about without a heap allocation, etc.

I agree with the rest of your point. Making the core code return a Task or a ValueTask is fine, and in the case where it runs synchronously you can simply return its Result property or invoke the code with async = false. The only reason I've picked ValueTask is that it's a value type, so doesn't necessarily incur a heap allocation (which would be guaranteed if it returned a Task instance) in the synchronous case.

To elaborate on thread-level synchronisation primitives though, please consider a situation where the thread pool is heavily loaded.

In SqlInternalConnectionTds.Create, you call _parserLock.Wait, which in turn calls Monitor.Enter. This enters the thread-level sync primitive
In OpenLoginEnlist, you drop through LoginNoFailover, then AttemptOneLogin, then Login, then into TdsParser and eventually down to SNITcpHandle.SendAsync where it writes to the network and awaits a response. The higher the network latency, the greater the likelihood that it has to await a response
The task yields until it receives a response, then resumes on a different thread
The parser lock is held by another thread (which has its own problems) so our task's thread don't have it. If the code tries to exit the monitor, it'll probably fail. If it tries to call Monitor.Enter on the same object again while it's not running on its original thread, it'll deadlock that thread.

We can't guarantee that a continuation of a Task/ValueTask will run on the same thread as its predecessor. It's possible that a ContinueWith could do the same thing, although I've picked a network write because it seems more likely to force the task to yield.

Incidentally, this is also part of the reason why we can't mix await and the lock statement. It's resolved by using task-level synchronisation primitives.

It's not an immediate stoppage because it's "only" reducing thread pool capacity, but it'll develop into such if it's given time. This sort of issue might also be having an impact on the existing issues around MARS and general async performance, although I'm just speculating.

It's possible that other issues are constraining the library's connection speed enough for that not to happen in real life. If it can happen, it'll probably happen in a situation where something is opening SQL Server connections rapidly to a database server with higher network latency. One such scenario could be a web service connecting to a database which is going through a disaster recovery scenario.

I think the largest lock to try to unpick for this is the SyncAsyncLock which sits in SqlInternalConnectionTds, but it's used in quite a few places and I'm struggling to understand it.

@Wraith2 - I'd appreciate some help with this, if possible. My first impression is that SqlClient's likely to encounter this sort of trouble with thread blocking, am I missing something which mitigates this in SyncAsyncLock or more widely? I can see that SyncAsyncLock.Wait has a canReleaseFromAnyThread parameter, but this doesn't seem to be widely used and I'm not sure it'd fully resolve the problem (since Monitor.Enter is called whatever this parameter's value.) I can also see a straight-up lock in SNIMarsConnection.

I get the impression from reading the code that SyncAsyncLock was written to use thread-level locking, then had some ad-hoc changes to make sure that the existing locking methodology would carry on working with Tasks. I think its essential purpose is to ensure that TdsParser has control over the underlying connection's TDS-level data stream, with the limited exception of client-to-server Attention messages and client-driven connection closures. Am I understanding that correctly?

If so, do you think it'd be viable to have something like a simple SemaphoreSlim to handle all protection of the data stream and an associated CancellationTokenSource to handle the cases of client-driven connection closure and command cancellation?

JRahnama · 2024-03-09T19:49:02Z

This sort of issue might also be having an impact on the existing issues around MARS and general async performance, although I'm just speculating.

This seems the exact root cause of issue #422

Farenheith · 2024-03-09T21:17:04Z

Keep in mind that AsyncLocal has a certain perf cost, which I'm not sure makes sense here. A usual thing to do in order to unify sync and async paths is simply to pass an "async" flag parameter down the stack instead.

I understand. I've started to increment this draft following the suggestion of the async flag, then.

@edwardneal all your concerns are pretty valid. With OpenAsync starting to be asynchronous, we have a whole new world of scenarios to deal with. I was hoping this change could be more simpler, but this is really a bigger issue and I have a lot to learn about this code to make it safely.

I'll take my time to look at the code but I'm not sure if this draft is a good idea anymore. It's very risky,

Wraith2 · 2024-03-09T21:29:42Z

I agree with the rest of your point. Making the core code return a Task or a ValueTask is fine, and in the case where it runs synchronously you can simply return its Result property or invoke the code with async = false. The only reason I've picked ValueTask is that it's a value type, so doesn't necessarily incur a heap allocation (which would be guaranteed if it returned a Task instance) in the synchronous case.

For modern .NET you should use ValueTask where there is a reasonable assumption that an operation will complete synchronously the majority of the time, for heuristic purposes consider 90% a decent approximation point to start with. If the code being changed is the code to fetch an existing connection from the in-memory pool then i'd say it should be written using value task. If the code being changed is going to establish a new socket connection I would expect to use Task. This information is taken from long conversations by the runtime team in api reviews about the topic.

I wouldn't expect to use AsyncLocal for opening a connection. My intention was to use a bool async parameter which will probably be quite viral like async itself is.
While performance is usually, correctly, a concern after something has been made to work there are considerations in async code that you should make while making it correct. Every await implies a compiler generated closure and the allocation for it. For opening and closing that might not be too bad but for field fetches it would be a performance killer. Also be aware that most task machinery like completion sources, cancellation registrations etc require allocations which add up quickly.

I wasn't going to start any of this sort of rewrite until the codebases are entirely merged. The difficulty is too high to implement and to review on multiple codebases. I've got a lot of things i could change if we could just get to a single codebase.

JRahnama · 2024-06-26T21:56:12Z

Closing this as this will be addressed in the SqlClientx

Farenheith force-pushed the fix/async-open branch from c4163ef to 5fd6235 Compare March 3, 2024 13:19

fix: Making open operation async first

a896f6b

Making the open operation async first simplifies the OpenAsync logic and allows to use the same logic for sync Open, through GetAwaiter

Farenheith force-pushed the fix/async-open branch from 5fd6235 to a896f6b Compare March 3, 2024 16:03

fix: Making connection and login async

588ee4a

This change is a little bit risky and I wanted to make it in a separated commit, to see if the previous one could pass at least most of the pipeline and to make this one easier to revert

refactor: applying same changes on netcore

34e7372

The same changes of netfx were replicated on netcore. Also, ConfigureAwait(false) was added to lower the risk of deadlock if getAwaiter. This approach is still needed to be checked

benrr101 mentioned this pull request Mar 6, 2024

Async opening of connections in parallel is slow/blocking #601

Open

edwardneal mentioned this pull request Mar 19, 2024

[Design change] Connection-level packet locking #2418

Open

edwardneal mentioned this pull request Apr 7, 2024

Performance and async improvements in SSRP, connection establishment and MARS connections #2451

Closed

JRahnama closed this Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Making open operation async first #2384

fix: Making open operation async first #2384

Farenheith commented Mar 3, 2024

Farenheith commented Mar 3, 2024

codecov bot commented Mar 3, 2024 •

edited

Loading

edwardneal commented Mar 4, 2024

Farenheith commented Mar 4, 2024 •

edited

Loading

edwardneal commented Mar 4, 2024

Farenheith commented Mar 9, 2024 •

edited

Loading

roji commented Mar 9, 2024

edwardneal commented Mar 9, 2024

JRahnama commented Mar 9, 2024

Farenheith commented Mar 9, 2024

Wraith2 commented Mar 9, 2024

JRahnama commented Jun 26, 2024

fix: Making open operation async first #2384

fix: Making open operation async first #2384

Conversation

Farenheith commented Mar 3, 2024

Farenheith commented Mar 3, 2024

codecov bot commented Mar 3, 2024 • edited Loading

Codecov Report

edwardneal commented Mar 4, 2024

Farenheith commented Mar 4, 2024 • edited Loading

edwardneal commented Mar 4, 2024

Farenheith commented Mar 9, 2024 • edited Loading

roji commented Mar 9, 2024

edwardneal commented Mar 9, 2024

JRahnama commented Mar 9, 2024

Farenheith commented Mar 9, 2024

Wraith2 commented Mar 9, 2024

JRahnama commented Jun 26, 2024

codecov bot commented Mar 3, 2024 •

edited

Loading

Farenheith commented Mar 4, 2024 •

edited

Loading

Farenheith commented Mar 9, 2024 •

edited

Loading