Skip to content
This repository has been archived by the owner on Jul 7, 2024. It is now read-only.

SqlServer.Core: Performance-oriented SQL Server .NET driver #6

Closed
Tracked by #48253
roji opened this issue Feb 10, 2021 · 91 comments
Closed
Tracked by #48253

SqlServer.Core: Performance-oriented SQL Server .NET driver #6

roji opened this issue Feb 10, 2021 · 91 comments
Assignees
Labels
project-woodstar User Story A single user-facing feature. Can be grouped under an epic.

Comments

@roji
Copy link
Member

roji commented Feb 10, 2021

An experiment in collaboration with the community to determine what potential there is modern .NET features in a highly performant SQL Server driver.

@roji roji changed the title Experiment SqlServer.Core Feb 10, 2021
@roji roji changed the title SqlServer.Core SqlClient.Core Feb 12, 2021
@roji roji transferred this issue from dotnet/efcore Feb 12, 2021
@roji roji added the User Story A single user-facing feature. Can be grouped under an epic. label Feb 12, 2021
@ajcvickers ajcvickers changed the title SqlClient.Core SqlServer.Core Apr 22, 2021
@ajcvickers
Copy link
Member

ajcvickers commented Apr 23, 2021

@ErikEJ @Wraith2 @NickCraver @mgravell @Drawaes: Let's get this party started!

@smitpatel, @roji, and I discussed how to get started with Woodstar, and we have created some initial issues for the ideas we came up with. We want this to be collaborative, so please add your ideas, either by commenting on issues or in new issues as appropriate.

Initial ideas:

/cc @davidfowl

@NickCraver
Copy link
Member

Would it be useful to list current pain points, especially w.r.t. performance in the current driver structure? Discussing connection pooling and what we are/aren't able to see or get metrics on today due to what's exposed and what's internal is something that doesn't hurt so much per connection but is a pain at scale in various ways. Today, we're wrapping connections to make best-guess workarounds for how pooling is behaving, which commands are in flight when a stall happens, etc.

I know out goal is performance, but I think it's worth considering the driver/implementation surface area along these lines as we go, and maybe current pain points is the good thing to have as a starting point on that front?

@Wraith2
Copy link

Wraith2 commented May 2, 2021

  • Async strings and blobs. The current internal structure makes their performance non-linear. It is a long project to fix it but it's complicated and there is no mitigation other than using SequentialAccess mode and constructing the string yourself.
  • Task based internal architecture uses almost entirely imperative task classes rather than langauge supported and this generates a lot of garbage in Tasks, TaskCompletionSources, ContinueWith closures etc. This could all be langauge and valuetask based and perform faster and cleaner
  • The codebases mixes sync and async paths liberally and confusingly. Along with lots of locking this makes making logical deductions about safety hard. Atomic state changes would be a lot easier to work with
  • The code is complex and hard to follow with very little explanation for internal choices. e.g., why when connecting to a named address does it choose the last ip4 address returned from the dns to try first? If we change it do we break someone?
  • Parameters being non-generic forces the use of boxing object containers and leads to reflection based runtime type discovery and conversion.
  • Connection pool keys being canonicalized and stored as strings incurs overhead because every connection because the string has to be generated each time. A dedicated immutable type would be better.
  • The SNI (Sql Network Interface) layer) and Tds Layer are disconnected loops and every packet received is copied at least twice, at least half of the byte arrays used are dropped to GC because the lifetime isn't tracked so they aren't known to be safe to re-use.

Plus a whole load of other random bits and pieces that when/if we encounter the same problem in a new implementation I will hopefully spot and warn about so we can avoid the same implementation choice. In general it's about what you expect from a 20 year old codebase at this point. Decisions which were correct at the time are now outdated but the weight of the codebase is so high that changing direction carries unacceptably high risk. Writing from a high level async-first perspective will get us an easier codebase to work with.

@NickCraver
Copy link
Member

Should async-only be a design goal here, like HttpClient in .NET Core before...the happening?

@Wraith2
Copy link

Wraith2 commented May 2, 2021

I'd prefer to support as much of the existing external surface area as possible because it will enable widest adoption if we get to production stages. I'd say async first but not async only if that's possible, I think it is possible because postgres manages it afaik.

@davidfowl
Copy link
Member

I'd hope we'd build the core engine with a different API shape than what ADO.NET wants. We can write a slow adapter for that layer but the new APIs shouldn't be bound by it. Also take a look at this issue dotnet/runtime#28882 as a way to avoid boxing for DbParameters.

@Wraith2
Copy link

Wraith2 commented May 2, 2021

The problem with using another shape is knowing what it is, there can be no consumers until we write one so we don't know what shape is required without a specific and well defined set of consumer requirements, who's the first consumer?

@Drawaes
Copy link

Drawaes commented May 2, 2021

Well you have SO people commenting right here. The point I thought is to push the boundaries of performance and not support the widest number of use cases. As David has said slower compatibility could be built on top. In reality though in the case of sync you are just hiding what is really going on as all db operations are async in nature if there is network or any disk access involved.

@Wraith2
Copy link

Wraith2 commented May 2, 2021

Ok, pure perf it is and then we'll see. That'll allow ignoring all the existing public surface area defined in terms of Task and using ValueTask throughout which will be lovely.

You say that all db operations are async. All IO operations are async and if you meant that then I agree. However because TDS is packetized there can be multiple columns and even rows in a single packet so a single data fetch may not be async it could just be fetched from memory. At the moment in SqlClient you either have to allocate a task every time or you have to use sync and live with the thread blocking sometimes. ValueTask<T> GetFieldValueAsync(int index) will make that problem go away.

We may also want to think about being clever with pre-dispatching packet requests. Currently you run through the packet until you run out of data and then request more and wait for the response. If you can guess based on reading, a string for example, that you won't have enough data you could pre dispatch the next packet request and then it might be ready when you need it. A fundamental limitation of any library like this is network latency and we might be able to hide some of that.

@roji
Copy link
Member Author

roji commented May 3, 2021

My vote here would be to only do a non-ADO.NET provider if figures clearly show ADO.NET is a serious blocker for performance.
My Npgsql experience is telling me that ADO.NET in itself shouldn't be a (significant) blocker for high-perf scenarios, and doing a non-ADO.NET driver basically means it's unusable by all existing .NET layers (Dapper, EF, LLBLGen Pro...). Of course we can always build an ADO.NET adapter on top of a non-ADO.NET adapter, but I wouldn't necessarily go in the direction of "break the ecosystem" until we see a good reason to.

Note that while doing optimization work on EF Core, I've recently seen that saving an allocation here and there has less of an impact on end RPS perf than expected (by me at least), and in at least some cases ADO.NET can be improved upon too (e.g. dotnet/runtime#17446 for parameter boxing). In other words, if some aspects of ADO.NET do turn out to block perf, I'd rather this effort helped us identify and fix them, rather than doing something completely different.

However because TDS is packetized there can be multiple columns and even rows in a single packet so a single data fetch may not be async it could just be fetched from memory. At the moment in SqlClient you either have to allocate a task every time or you have to use sync and live with the thread blocking sometimes. ValueTask GetFieldValueAsync(int index) will make that problem go away.

This is a good example. If the Task allocation from DbDataReader.GetFieldValue turns out to be significant (and let's see that it's the case, rather than assuming it is), we could simply add a new ValueTask-returning API alongside the existing one. We could then promote that into ADO.NET.

Regardless of the above, I think an async-only driver definitely makes sense, at least as at a first phase (but maybe in general). Needing to support sync already creates various problems in today's world (e.g. no Pipelines support, at least last time I checked).

@Wraith2
Copy link

Wraith2 commented May 3, 2021

If the Task allocation from DbDataReader.GetFieldValue turns out to be significant (and let's see that it's the case, rather than assuming it is)

Well, I think so. This is from the TE Fortunes benchmark test project that I've been using for perf improvements over the last couple of years.

image

And since you're a bunch of people who'll know what you're looking at here's the dotTrace file for that so you can explore.
BenchmarkDb.zip

The thing we actually want is the Fortune object and the strings it contains, look how far down the list it is. I'm pretty sure that the entire query result is inside a single packet response so it's all in memory at the point when fields are being fetched. Allocating a Task for every int32 read from an in memory packet isn't the worst things happening but because of the public surface area even if I managed to make the internals ValueTask based I'd still have the allocation at the user surface. I've considered proposing ValueTask<T> ValueGetFieldValueAsync(int i) a couple of times but the name makes me have second thoughts let alone the process or approval and years it would take to get it adopted.

Anyway. You raise another good point @roji. What do we mean by high performance? The reason I started on this was that I wrote a really nicely performing server app and then the perf got destroyed by simply trying to save data to the database. I'd like to be able to plug in the database layer without it being the worst performing thing in the application.

In my experience so far query speed is dominated by network latency and we can't really do a lot about that. This is why memory makes a very small difference quite a lot of the time. What we miss in this scenario is that there will be higher layers doing other work like layout etc.

I think we're looking to get a driver which is as cpu and memory clean in as many aspects as possible so that many machine resources can be used simultaneously, so don't block threads don't waste memory causing time to be lost to GC. Get work done and get out of the way so higher layers can get higher throughput per unit hardware per time.

@roji
Copy link
Member Author

roji commented May 3, 2021

This is from the TE Fortunes benchmark test project that I've been using for perf improvements over the last couple of years.

We're getting down into the little details, but... I can't see any reason for GetFieldValueAsync to be used in TE Fortunes (or generally a big majority of ADO.NET usage). Unless one is reading very big rows and using CommandBehavior.SequentialAccess, the row is pretty much guaranteed to be buffered in memory; after all, without SequentialAccess one can read columns in any order. In that scenario, using GetFieldValueAsync is pretty much useless, and GetFieldValue should be used instead.

But if we want to add another API just to remove the Task allocation for when huge rows and SequentialAccess are being used, then it really isn't that complicated - mostly a matter of finding the right name. I think it's a pretty low-priority scenario (if you're reading huge rows, that extra allocation isn't likely to matter), but it's trivial to push through.

Re the rest... I do hope Npgsql shows that very good perf is possible while sticking to ADO.NET.

This is why memory makes a very small difference quite a lot of the time.

I'd want to see some backing to that assumption. Of course, I don't mean egregious useless allocations inside the driver (which is what SqlClient does in some cases) - that should simply be avoided or cleaned up. But again, my recent experience with EF Core shows that hunting down every single reasonable allocation has quickly diminishing returns, and therefore probably doesn't justify dumping ADO.NET.

To summarize, .NET does have a standard database API. While it's not the best it could be, there's an entire ecosystem around it, and very good performance has been achieved without dumping it. Could a bit more perf be squeezed by saving a couple allocations that ADO.NET forces us to do? Maybe. Does that justify doing something completely non-standard, where maybe ADO.NET itself could be fixed? I think that remains to be shown.

@Drawaes
Copy link

Drawaes commented May 3, 2021

I don't understand the point of this repo if you just want to improve the sql client under ado.net? You might as well work on fixing that under the hood. SslStream was a similar age, very risky, a mix of sync and async and promise based code and we fixed it from the inside. There is no point just making another ado.net sql lib with all those restrictions.

@davidfowl
Copy link
Member

I'm on @Wraith2 's side, that profile shows so much waste, like nuke all of those allocations😄

@bgrainger
Copy link

In MySqlConnector, I've considered providing a "direct" API that exposes the raw MySQL protocol through a .NET API to see how much overhead ADO.NET contributes. However, it always seemed extremely unlikely that it would be used anywhere except the TE benchmarks due to the established ADO.NET ecosystem.

I have no objections to this project pursuing a focused SQL Server solution to see what the maximum possible performance is, but would personally appreciate the effort going into directions that could benefit all ADO.NET providers/users. But please just take this as my 2c, and not a strong attempt to change the project direction.

Some ADO.NET pain points I'd like to see fixed:

  • No first-class "connection pool" concept. Every ADO.NET user has to create (and throw away) a DbConnection object that behind-the-scenes borrows an actual connection from a pool (and returns it when DbConnection.Dispose is called). This results in a lot of temporary garbage objects when it would be nicer to write pool.BorrowConnection (or similar) instead. Obviously that would be a big change in how ADO.NET is consumed.
    • Having a common, well-tested, low-overhead, lock-free connection pool available to use as an ADO.NET driver implementer would be very nice.
  • Connection Strings. Related to above, the pool is usually implicitly selected by setting DbConnection.ConnectionString. This is frequently parsed with DbConnectionStringBuilder, which adds some unavoidable overhead: Parsing connection string is expensive mysql-net/MySqlConnector#238. If a "connection pool" object were created, connection string parsing would only have to happen once, instead of per-DbConnection. (Drivers can do some optimisations right now based on connection string equality, of course.)
  • DbCommand, DbParameterCollection, etc. are a lot of ephemeral objects created for simple SQL statements. Dapper provides an ExecuteScalarAsync<T> extension method on DbConnection; it might be nice to have something similar as part of ADO.NET that could do some optimised parameter binding, execute the SQL, and return the single scalar result.
  • DbCommand.Prepare doesn't make a lot of sense on an ephemeral object if it's intended to create a reusable server-side object. Usually this ends up storing some state on the underlying object that represents an open server connection.

OTOH, if a lot of these (and similar) suggestions get implemented, maybe it's "not ADO.NET" anymore, and trying to preserve ADO.NET was a false goal? 😀

@Drawaes
Copy link

Drawaes commented May 3, 2021

This is my 2c as well. But I see it as an opportunity to write a new fresh DB surface area that while focused on Sql Server initially there is no reason there couldn't be a MySql etc version. Connection pooling is a very good example that I think shoe horning into existing ado.net would extra pain.

I would love to see a break from tradition of the example of pipelines vs streams. Initially built on their own and eventually bridging code to help you use pipelines with stream apis. But for pure perf you want pipelines all the way through.

@roji
Copy link
Member Author

roji commented May 3, 2021

To be clear, I'm not the boss here or anything - am just stating my opinions!

I don't understand the point of this repo if you just want to improve the sql client under ado.net? You might as well work on fixing that under the hood. SslStream was a similar age, very risky, a mix of sync and async and promise based code and we fixed it from the inside. There is no point just making another ado.net sql lib with all those restrictions.

There are various reasons why this isn't feasible - lots of backwards-compatibility, legacy and stability constraints simply don't make this possible. As a very simple example, SqlClient needs to continue supporting .NET Framework.

But I see it as an opportunity to write a new fresh DB surface area that while focused on Sql Server initially there is no reason there couldn't be a MySql etc version.

So there really are two efforts being discussed here:

  1. Writing a new, modern and efficient driver for SQL Server; this indeed resembles rewriting SslStream while retaining the same public API. There are indications that this would be of immense value in the ecosystem - and this is the original/main thing discussed here.
  2. Writing a new .NET database API, as an alternative to ADO.NET.

I see these as two very separate and orthogonal efforts, each being quite huge. For example, discussing a first-class, cross-database pooling component as @bgrainger mentioned (see dotnet/runtime#24856) is just on its own a potentially huge debate. I believe having focus on one of the above two is important (trying to tackle too frequently backfires), but again, that's just my opinion.

I'm guess I'm also a bit confused on why we're proposing to do a new database API. Are we convinced ADO.NET is a significant blocker to database driver perf, and if so, that we can't fix those problems incrementally? Do we have any data to support that? Or do we just hate the API and want a modern, new API (also legitimate)? Or do we just want to play around without any constraints (also legitimate)? The important point is that any new DB API breaks the ecosystem in both directions - DB drivers would need to be rewritten to support it (Npgsql, MySqlConnector, Oracle), and upper layers would have to do the work to support it as well (EF Core, Dapper, LLBLGen Pro...). The effort would be immense - and I'm not clear on exactly why we'd be proposing that.

@bgrainger FWIW I agree with a lot of your points above, though some things can be handled without API changes. For example, Npgsql does avoid parsing the connection string more than once; DbCommand instances can be recycled by their owning DbConnection, and similarly DbDataReader can be recycled by their owning DbCommand (Npgsql does these things); we can possibly do something similar

@roji
Copy link
Member Author

roji commented May 3, 2021

But to reiterate - I'm just a guy expressing my opinions. If we end up with a non-ADO.NET driver, and an ADO.NET shim over that, maybe that's fine too.

@roji
Copy link
Member Author

roji commented May 3, 2021

I'm on @Wraith2 's side, that profile shows so much waste, like nuke all of those allocationssmile

FWIW I'm pretty sure 90+% of these allocations are the SqlClient implementation, rather than anything necessarily having to do with ADO.NET.

@mgravell
Copy link
Member

mgravell commented May 3, 2021

and doing a non-ADO.NET driver basically means it's unusable by all existing .NET layers (Dapper, EF, LLBLGen Pro...)

FWIW, the resurrection of this thread a few days ago is part of what prompted me to get started on DapperAOT. Pretty sure I'll have code working in the flavor of Dapper (although using the new DapperAOT) in lock-step with any-and-all progress here.

@Wraith2
Copy link

Wraith2 commented May 3, 2021

The current SqlClient codebase is too dangerous to use a "move fast and break things" approach. The number of users and the fact that we want people to move from System to Microsoft versions requires a slow and measured approach. The unfortunate side effect of that is that the speed of change is slow. I've been tinkering for two years and that trace is still abysmal compared to modern code. It is also important to note that the codebase is difficult to work with.

The existing SqlClient library will not go away. It will not change with sufficient speed to improve performance for the majority of users. I'm fairly sure that the performance of SqlClient lags behind that of other database providers, even if it doesn't I'm sure there can be a better performing implementation.

In writing an SqlCore library we can:

  1. identify if there are new patterns that would be better than existing ADO approaches for high throughput, low overhead, high concurrency, etc scenarios
  2. provide a good non-surprising (no async over sync) async way to access Sql Server data
  3. provide a viable mitigation for existing problems with high string and binary use scenarios

I think all of these things have value.

Things that aren't current goals as I see it:

  1. replace or deprecate ADO, just not practical nor demonstrably of value
  2. create an ADO capable SqlCore driver
  3. create a driver which implements the entire breadth of the existing SqlClient driver (SqlNotification? not now).

We do not have to choose to ignore the existing ADO api surface because we can probably provide most of it built on a new implementation. This doesn't have to be an explicit goal but it may be something to be aware of as a long term nice-to-have so we can flag up changes which would prevent it being possible.

The existing ADO surface has problems such as Task<T> GetFieldValue<T>(int i) which can be worked around trivially with new implementation build using ValueTasks where it is appropriate. So perhaps we should work on a new implementation regardless of the api shape and then when we have something that works for trivial cases review what we've got and discuss further what direction we should take?

I'm happy with working towards specific use cases from StackOverflow if they can be provided. It's a point to work towards and learn about the problem space. Once we have something that is of worth or we collectively decide that the goal is impossible or improbable we can re-assess.

I don't see any problem with the general pattern of access which ado provides. As a sketch

public IAsyncEnumerable<Fortune> GetFortunes()
{
	await using (var connection = GetSqlCoreConnection())
	{
		using (var command = GetSqlCoreCommand(connection, command properties ?))
		{
			using (var reader = await command.GetReader(reader properties))
			{
				while (await reader.GetNextRow())
				{
					yield return new Fortune { Id = await reader.GetValue<int>(0), Message = await reader.GetValue<string>(1) };
				}
			}
		}
	}
}

but that doesn't mean we have to start off with implementing all the interfaces that are in the System.Data namespace to do it.

@GSPP
Copy link

GSPP commented Oct 6, 2021

(1)

I wonder how it would be possible to quantify the potential throughput gains. Internal driver overheads in SqlClient might stem from design choices, and those overheads might be smeared out across the code base. The profiler call tree might not highlight any particular hot point, yet an entirely different library design could potentially deliver big gains.

An example of this is JSON serialization. If you were to profile Newtonsoft.Json, you'd probably find it rather optimized. Yet, the new BCL JSON serializer achieves a 2x gain on Newtonsoft with a fresh design.

The IDataReader interface feels like it would architecturally add quite a bit of overhead because it forces you to operate row-by-row and column-by-column. The library has no way to optimize the whole operation. A while ago I made a proposal for how expression trees could be used to create highly efficient deserialization code. The key point is that the library is put in a position where it can optimize reading an entire row or even multiple rows in one call, placing the result directly into an instance of a user class. I'm linking to it here: dotnet/runtime#25739


(2)

Even a highly optimized async codepath has considerable CPU overhead. The async machinery can be quite expensive. In my estimation, it will be necessary to provide a truly synchronous code path. This code path will be the one providing the lowest CPU usage for API users that want to optimize for that.

With database access, the degree of parallelism usually is low because the database cannot take more than a handful of parallel queries or else it will become overloaded. For that reason, the thread saving benefits of async are less relevant in common scenarios. It seems that the main point of this new library would be to lower CPU usage which makes a synchronous codepath quite important.

To make an example: If your database server has 8 cores (and disregarding disks for simplicity) it cannot help throughput to make more than 8 parallel queries. Having a thread pool of 8 threads with synchronous IO is a perfectly adequate solution in this scenario, and it will be the fastest solution. (I'm simplifying a lot here to make the general point.)

The BCL team recently found the need to add synchronous APIs to HttpClient for two reasons: Satisfying synchronous interfaces, and it plainly uses less CPU.


(3)

There's probably value in exposing a zero-overhead TDS connection class that has no additional layers. No pooling, no retry, no threading, not even a connection string. It's usage would basically be:

new TdsConnection(ip, port, new TdsConnectionOptions() { ... })

This API would provide raw access to some abstraction of the protocol. I'm sure there are some attractive features in there that are not widely known. For example, the recently exposed SqlCommandSet is a TDS feature. Also, the bulk insert API has some features that are not exposed.

@roji
Copy link
Member Author

roji commented Oct 6, 2021

I wonder how it would be possible to quantify the potential throughput gains. Internal driver overheads in SqlClient might stem from design choices, and those overheads might be smeared out across the code base. The profiler call tree might not highlight any particular hot point, yet an entirely different library design could potentially deliver big gains.

One approach we've been thinking about is to get an upper bound for performance by doing an absolute-minimum implementation of a minimal "driver" (#12) - send a minimal query and provide some sort of bare API for getting the results. See here for an Npgsql experiment; we've got a similar prototype in the works for SQL Server but nothing public yet.

The IDataReader interface feels like it would architecturally add quite a bit of overhead because it forces you to operate row-by-row and column-by-column. The library has no way to optimize the whole operation.

This isn't really true... when the database sends back a resultset, it typically first sends some sort of metadata on the resultsets, which allows the driver to parse them (e.g. which column has which type); the driver can use that information to pregenerate any sort of code that will make subsequent row/column accesses faster. In fact, when a statement is prepared, this can even happen once for multiple query executions. This seems quite orthogonal to deserializing user POCOs vs. providing a DbDataReader-like interface to rows and columns.

A while ago I made a proposal for how expression trees could be used to create highly efficient deserialization code. The key point is that the library is put in a position where it can optimize reading an entire row or even multiple rows in one call, placing the result directly into an instance of a user class. I'm linking to it here: dotnet/runtime#25739

Here are some problems I see with this:

  • There are various dynamic scenarios where there's no user POCO to map to, because the query result shape isn't known in advance. Since these need to be supported, some sort of row/column API will be necessary, and so the POCO mapping you're proposing would be an extra thing.
  • See above for a discussion on whether such a high-perf driver should implement ADO.NET or not. Even if the driver doesn't implement ADO.NET (a bit along the lines of your proposal in (3) above), I feel it's important to at least have an optional ADO.NET adapter on top of it; anything else would basically make it unusable in the .NET ecosystem, as all ORMs would need to support support the custom API (EF Core, Dapper, LLBLGen Pro...). A driver mapping to a user POCO doesn't seem like it would be adaptable to ADO.NET via a layer on top, so once again a separate row/column API would be needed.
  • In general, mapping to user POCO always opens up lots of questions around customizations - do the database column names correspond exactly to the POCO property names? If not, how does one specify a mapping? All that stuff is typically handled by upper layers (e.g. EF Core, Dapper), and this proposal would introduce them to the lowest level of database interaction - this doesn't seem like a good idea.
  • Runtime code generation is generally something that's a bit "on the way out", with the recent focus on trimming and platforms where it isn't supported (e.g. iOS). To support those you'd have to move the code generation to build-time, e.g. with source generators, which is quite a different kind of project; see DapperAOT for a possibly similar approach.
  • Finally and most importantly, as I wrote in Add a fast and convenient deserialization API to ADO.NET runtime#25739, I'm not convinced simply introducing expression trees and deserializing to user POCOs necessary provides a meaningful perf advantage here. I won't repeat the arguments from the other issue, but in general most of the work of decoding wire results from the database would need to be done either way.
    • However, if you're convinced of this direction, the best way forward is to prove the perf advantage via a prototype. The Npgsql.Core prototype can serve as a basis for such an experiment, or if you'd prefer to work against SqlServer, we'll have a similar thing up at some point. If you can show an expression-tree POCO-deserializing driver working considerably faster than an equivalent rows/columns one, that would provide good motivation to go in that direction.

@Wraith2
Copy link

Wraith2 commented Oct 6, 2021

I wonder how it would be possible to quantify the potential throughput gains. Internal driver overheads in SqlClient might stem from design choices, and those overheads might be smeared out across the code base. The profiler call tree might not highlight any particular hot point, yet an entirely different library design could potentially deliver big gains.

The SqlClient cpu call graphs are a bit deeper than I'd like but in general cpu isn't a major problem (outside some known async problems). We spend most of the time waiting for network io . If I see ways to conserve cpu I implement them but in general avoiding allocations to reduce time spent outside user code yields far better returns at the moment. Fundementally, reading a forward only stream of self describing data isn't a hard cpu problem.

Anecdotally; calculating connection pool keys is one of the most cpu expensive things we do because of some pinvokes.

Providing a forward only reader implementation that requires the user to operate as if it were in sequential mode would significantly reduce internal complexity as would generics in some places.

Even a highly optimized async codepath has considerable CPU overhead.

It's also a tradeoff of cpu against time where cpu is a resource you can get more of relatively easily but time is linear and constrained, We usually judge that it is better to do as much work in a given allocation of time as is possible.

@bgrainger
Copy link

One approach we've been thinking about is to get an upper bound for performance by doing an absolute-minimum implementation of a minimal "driver" (#12) - send a minimal query and provide some sort of bare API for getting the results. See here for an Npgsql experiment; we've got a similar prototype in the works for SQL Server but nothing public yet.
...

  • However, if you're convinced of this direction, the best way forward is to prove the perf advantage via a prototype. The Npgsql.Core prototype can serve as a basis for such an experiment, or if you'd prefer to work against SqlServer, we'll have a similar thing up at some point. If you can show an expression-tree POCO-deserializing driver working considerably faster than an equivalent rows/columns one, that would provide good motivation to go in that direction.

If you happen to be more familiar with MySQL, I have a small prototype of what a "direct" .NET connection to MySQL would look like here: https://github.com/bgrainger/FrameworkBenchmarks/tree/mysql-direct/frameworks/CSharp/aspnetcore/Benchmarks/MySqlDirect; it could possibly be used as a base for experiments.

There may also be useful code in https://github.com/neuecc/MySqlSharp.

@roji
Copy link
Member Author

roji commented Oct 6, 2021

@Wraith2

Anecdotally; calculating connection pool keys is one of the most cpu expensive things we do because of some pinvokes.

Interesting...! Shameless hijacking: what are these pinvokes about? I have an ADO.NET thing I'd like to push in .NET 7.0 which may help a lot here... Will keep you posted.

@Wraith2
Copy link

Wraith2 commented Oct 6, 2021

They're the pinvokes behind Environment.UserName and Environment.DomainName on windows. It is possible to get both bits of information at once from one call but the BCL doesn't expose a way to do it so we call the same api twice and wastes buffers and cycles. It also always causes 3 string allocations to get a single string back. It's on my private trello list of "things to investigate" but trying to argue that we should do direct pinvokes into windows security related apis is going to get some pushback and it isn't worth taking my time trying to fight it at the moment.

@roji
Copy link
Member Author

roji commented Oct 6, 2021

@Wraith2 ok, thanks - do these at least only get called when the Username isn't present in the connection string?

@Wraith2
Copy link

Wraith2 commented Oct 6, 2021

It's not the database username it's the calling user. It depends on the SNI strategy. In native we use the windows identity SID in managed we use current username and password but we can't cache because of impersonation so we must always create and compare so even if we can find we've got the same user as last time we have to invoke and allocate the strings.
The pool implementation is probably the most complicated or confusing code in SqlClient, while I've wrestled with some of the other parts and has some success the pool beats me every time I go near it.

@roji
Copy link
Member Author

roji commented Oct 6, 2021

OK, thanks for the details @Wraith2. I plan to work on a proposal to basically allow users to directly reference a "DbDataSource" (naming temporary!), which typically would represent a pool; this would allow applications to resolve the pool from the connection string exactly once at startup, and then request connections from that. However, if you really want to check the system username on every open (because of impersonation), this may not help so much.

@KalleOlaviNiemitalo
Copy link

Can you first check whether the thread is impersonating, and if it is not, then use the previously allocated strings (or no strings at all)? Something with GetCurrentThreadToken and GetTokenInformation, perhaps. Whether that would help depends on how common impersonation is, I guess. And this would still involve "direct pinvokes into windows security related apis".

@Wraith2
Copy link

Wraith2 commented Oct 6, 2021

Impersonation and Mars are things that could be left out of the Core implementation quite easily. They'll still have to be supported in SqlClient but if you're looking for a fast minimal sql implementation I don't think you want them.

@IanRingrose
Copy link

public IAsyncEnumerable<Fortune> GetFortunes()
{
	await using (var connection = GetSqlCoreConnection())
	{
		using (var command = GetSqlCoreCommand(connection, command properties ?))
		{
			using (var reader = await command.GetReader(reader properties))
			{
				while (await reader.GetNextRow())
				{
					yield return new Fortune { Id = await reader.GetValue<int>(0), Message = await reader.GetValue<string>(1) };
				}
			}
		}
	}
}

We know that a single row is nearly always small enough to hold in memory and that the user often knows their rows are small. I would normally rather not have await reader.GetNextRow() complete until all fields are in memory ready to read. This can then totally advoid the overhead of having GetValue() returning a task that then needs checking to see of it is completed. (Think of what these unnecessary conditions checks do to the CPU instruction pipeline)

@Wraith2
Copy link

Wraith2 commented Dec 31, 2022

We know that a single row is nearly always small enough to hold in memory

How do we know this? I know that mine often are but working with the sqlclient driver has shown me bug replications where people do things I would never consider.

the user often knows their rows are small.

Yes. However through maintenance this has a habit of changing. If a small row is 3 items and the driver optimizes for that what happens when 2 years later someone adds 3 more fields. Do we see a slowdown? Is that consequence going to be obvious from the code?

This can then totally avoid the overhead of having GetValue() returning a task that then needs checking to see of it is completed

I'm almost certain we would use ValueTask as the return type. If for some reaon we could not then a custom poolable awaitable would likely be used. Any non-null valued data type including a single byte in the tds stream can span multiple packets (because there is always a header so all records are multibyte). That means that every read can cause a network delay to be observable, the faster you can consume the data the more likely you are to hit an io wait. This needs to be a task like return type.

@roji
Copy link
Member Author

roji commented Jan 1, 2023

I agree that the majority use case is small rows which can be easily buffered in memory, and that this scenario should be optimized for (of course while keeping good support for large rows). This is also the reasoning for ADO.NET's CommandBehavior.SequentialAccess, which is an opt-in behavior for large rows.

I would normally rather not have await reader.GetNextRow() complete until all fields are in memory ready to read. This can then totally advoid the overhead of having GetValue() returning a task that then needs checking to see of it is completed.

I'm a bit confused... If GetNextRow (ADO.NET's DbDataReader.Read) doesn't wait to buffer all columns, then the subsequent GetValue calls (ADO.NET's DbDataReader.GetFieldValue) do potentially have to do I/O (since the row hasn't been completely read and buffered yet). So either you (possibly) wait longer up-front in GetNextRow - and then yuo're guaranteed that subsequent GetValue calls just accessed buffered results - or GetNextRow returns early without buffering, and then subsequent GetValue calls may need to wait.

Stepping back... I think the low-level approach here would be to not do (implicit) row buffering. Crucially, this means it's the user's responsibility to read column values in the order in which they get delivered from the database; you can't read value 3 and then read value 1 (since we're just streaming columns). You also can't read the same value twice. This corresponds pretty much exactly to how CommandBehavior.SequentialAccess works today in ADO.NET.

Row buffering can be considered a "convenience" feature layered on top of this, allowing users to seek back in the row and read the same value multiple times. I agree it's still not necessary to pre-buffer the entire row in GetNextRow: I/O can simply be performed later - as needed - from GetValue, which would return ValueTask (so calls to values which happen to be already buffered are very cheap).

the user often knows their rows are small.

Yes. However through maintenance this has a habit of changing. If a small row is 3 items and the driver optimizes for that what happens when 2 years later someone adds 3 more fields. Do we see a slowdown? Is that consequence going to be obvious from the code?

I do think that programmers can be expected to know the general shape of their data (does a certain table contain huge rows or not), and that it's reasonable to expect stability here. If you don't want to make any assumptions (e.g. a dynamic data scenario), you can always use "sequential access" and stream columns - I don't think there's any downside to doing so, aside from a more difficult API to work with. Otherwise, if you use an API which does buffering (the ADO.NET default), and you encounter a huge row, your memory requirements will increase accordingly.

@IanRingrose
Copy link

We know that a single row is nearly always small enough to hold in memory

How do we know this? I know that mine often are but working with the sqlclient driver has shown me bug replications where people do things I would never consider.

the user often knows their rows are small.

Yes. However through maintenance this has a habit of changing. If a small row is 3 items and the driver optimizes for that what happens when 2 years later someone adds 3 more fields. Do we see a slowdown? Is that consequence going to be obvious from the code?

I am thinking of a mode of operation when the user states upfront the rows are small and an exception is throw if the rows are large.

"Small row" being defined as under the size of a network packet so it is known that a row will never need more then two networks pockets being buffered in a "zero copy" implementation.

=======
Another option would be for ReadRowAndAllValue() generic method that returns a triplet of all the values. I don't know if this is practical to implement.

@IanRingrose
Copy link

Should an optional callback based interface be provided so the user implementation OnNextRow() and OnNextStringValue(id, span) etc?

This would make it cheaper to read into an array(s) of structs for code that needs fast bulk loading. At present it can be faster to get SQLServer to write to a file and then read the file from .net, this should never be the case with a database access API.

@roji
Copy link
Member Author

roji commented Jan 4, 2023

I am thinking of a mode of operation when the user states upfront the rows are small and an exception is throw if the rows are large.

"Small row" being defined as under the size of a network packet so it is known that a row will never need more then two networks pockets being buffered in a "zero copy" implementation.

I think it's a good idea to stop and think exactly what kind of special optimizations we're thinking about for small rows vs. large rows. I think that the existing ADO.NET model isn't too bad here: an opt-in mode (SequentialAccess) for streaming the row and a default mode for, basically, buffering it. I don't know what "zero copy" means in this context: the driver needs to read raw bytes in any case, and decode them to a user-readable value (int, string); there shouldn't be any extra copying beyond that going on regardless of whether the row is small or not. The only impact that should have is whether we keep earlier columns in memory so the user can access them after later ones.

Another option would be for ReadRowAndAllValue() generic method that returns a triplet of all the values. I don't know if this is practical to implement.

That sounds like a very allocate-y API to get the row.

Should an optional callback based interface be provided so the user implementation OnNextRow() and OnNextStringValue(id, span) etc?
This would make it cheaper to read into an array(s) of structs for code that needs fast bulk loading. [...]

Not sure exactly why this is inherently faster than an API where application code calls ReadRow (it's basically just push vs. pull).

@IanRingrose
Copy link

Should an optional callback based interface be provided so the user implementation OnNextRow() and OnNextStringValue(id, span) etc?
This would make it cheaper to read into an array(s) of structs for code that needs fast bulk loading. [...]

Not sure exactly why this is inherently faster than an API where application code calls ReadRow (it's basically just push vs. pull).

It advoids C# having to feed as many returned tasks into the generated state machine.

@Wraith2
Copy link

Wraith2 commented Jan 4, 2023

"Small row" being defined as under the size of a network packet so it is known that a row will never need more then two networks packets being buffered in a "zero copy" implementation.

The number of row values in a single packet depends on the data values, not even the types. The ability to identify if you had a specific set of data which did not span a packet is prohibitively difficult to calculate. You can also change the packet size in the connection string.

I agree with @roji that the current high level way of accessing the data is reasonable and covers most scenarios that we need. All other methods considered have clear downsides like requiring sequential access at all times of large upfront investment of developer time to ensure they avoid packet boundaries which should be handled by the driver transparently.

@roji
Copy link
Member Author

roji commented Jan 4, 2023

"Small row" being defined as under the size of a network packet so it is known that a row will never need more then two networks packets being buffered in a "zero copy" implementation.

The number of row values in a single packet depends on the data values, not even the types. The ability to identify if you had a specific set of data which did not span a packet is prohibitively difficult to calculate. You can also change the packet size in the connection string.

@Wraith2 at least from my perspective, the idea here wouldn't be to automatically calculate anything, but rather for the developer to use one API or another based on whether they want to stream the row or not, based on their knowledge of their data domain. In other words, pretty much what CommandBehavior.SequentialAccess already means in ADO.NET.

I definitely do think we need to provide a row sequential/streaming mode, which avoids having to buffer entire rows in memory; otherwise huge rows must require huge amounts of memory, which isn't feasible. This is already supposed to be done in all major ADO.NET drivers (via CommandBehavior.SequentialAccess).

Should an optional callback based interface be provided so the user implementation OnNextRow() and OnNextStringValue(id, span) etc?
This would make it cheaper to read into an array(s) of structs for code that needs fast bulk loading. [...]

Not sure exactly why this is inherently faster than an API where application code calls ReadRow (it's basically just push vs. pull).

It advoids C# having to feed as many returned tasks into the generated state machine.

I'm not sure we should shape the entire row-reading API around saving on the async state machine box for when ReadRow needs to complete asynchronously (note that at least in reasonable scenarios, many/most invocations of ReadRow return synchronously since (small) rows are already buffered in memory).

@roji
Copy link
Member Author

roji commented Jul 1, 2024

See #22 for a summary of the Woodstar experiment.

@roji roji closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
project-woodstar User Story A single user-facing feature. Can be grouped under an epic.
Projects
None yet
Development

No branches or pull requests