Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QueryUnbufferedAsync with FirstAsync is slow if the query returns lots of records #2115

Open
cph0 opened this issue Sep 24, 2024 · 11 comments
Open

Comments

@cph0
Copy link

cph0 commented Sep 24, 2024

Using the System.Linq.Async NuGet Package.

var query = "SELECT * FROM table";
var cnn = ...
var firstRecord = await cnn.QueryUnbufferedAsync<type>(query).FirstAsync();
cnn.close();

If "table" has millions of rows this is slow. IAsyncEnumerable is designed to be more efficient, at least from a memory point of view. But when calling FirstAsync you would expect it to run quickly as each row is returned to the calling code when it is read.

The reader is being disposed without cancelling the command first. I did look at using ExecuteReaderAsync but I can't get access to the command.

@mgravell
Copy link
Member

That's curious, because it should "just work"; while I try to investigate, however, a pragmatic workaround may be to use QueryFirstAsync<type> - does this behave as expected?

@cph0
Copy link
Author

cph0 commented Sep 25, 2024

Using QueryFirstAsync has the same effect.

@mgravell
Copy link
Member

mgravell commented Oct 7, 2024

I believe this may simply be the server latency, i.e. the server is taking a long time to fetch the first row; does the same query in, for example, SSMS (or the relevant tool for your database) behave similarly?

@cph0
Copy link
Author

cph0 commented Oct 7, 2024

Yes it's slow in SSMS because it's loading millions of rows. It will be loading the first row quickly but the query is not finished. Using FirstAsync is having to wait for all rows to be loaded.

If I write it using DbCommand it's much faster when the command is cancelled before the reader is disposed.

var query = "SELECT * FROM table";
var cmd = cnn.CreateCommand();
cmd.CommandText = query;
var manualReader = await cmd.ExecuteReaderAsync();
await manualReader.ReadAsync();
cmd.Cancel(); //20 seconds without this
await manualReader.DisposeAsync();

@mgravell
Copy link
Member

mgravell commented Oct 7, 2024

hm; OK, that's interesting; I'll see what I can do with that

@mgravell
Copy link
Member

mgravell commented Oct 7, 2024

So; I can't quite repro as stated, but: I can see a pain point that is specific to async - specifically it is async and single-row specified and we read the primary grid to completion. This might also be specific to the ADO.NET provider - which exact provider are you using? is it Microsoft.Data.SqlClient? or System.Data.SqlClient?

Here's my test rig with what I see:

Found 100000 rows...
Reading one row, badly, via async Dapper...
Got row 1039451802 in 2534ms
Reading one row, badly, via async Dapper...
Got row 1039451802 in 23ms
Reading one row, badly, via sync Dapper + LINQ...
Got row 1039451802 in 21ms
Reading one row, badly, via sync Dapper + LINQ...
Got row 1039451802 in 2ms
Reading one row, badly, via async DapperAOT...
Got row 1039451802 in 2373ms
Reading one row, badly, via async DapperAOT...
Got row 1039451802 in 22ms
Reading one row, badly, via sync DapperAOT + LINQ...
Got row 1039451802 in 20ms
Reading one row, badly, via sync DapperAOT + LINQ...
Got row 1039451802 in 21ms
Reading one row, badly, via async ADO.NET...
Got row 1039451802 in 20ms
Reading one row, badly, via sync ADO.NET...
Got row 1039451802 in 20ms
Reading one row, badly, via async ADO.NET; read-grid...
Got row 1039451802 in 41ms
Reading one row, badly, via sync ADO.NET; read-grid...
Got row 1039451802 in 20ms
Reading one row, badly, via async ADO.NET; single-row...
Got row 1039451802 in 20ms
Reading one row, badly, via sync ADO.NET; single-row...
Got row 1039451802 in 21ms
Reading one row, badly, via async ADO.NET; single-row; read-grid...
Got row 1039451802 in 2363ms
Reading one row, badly, via sync ADO.NET; single-row; read-grid...
Got row 1039451802 in 20ms

The main rows of interest are:

Reading one row, badly, via async Dapper...
Got row 1039451802 in 2534ms
Reading one row, badly, via async DapperAOT...
Got row 1039451802 in 2373ms
Reading one row, badly, via async ADO.NET; single-row; read-grid...
Got row 1039451802 in 2363ms

Do you see the same? I'm using Microsoft.Data.SqlClient; if you're using System.Data.SqlClient, does changing to Microsoft.Data.SqlClient make what you see more consistent with what I see?

code:

using Dapper;
using Microsoft.Data.SqlClient;
using System.Data;
using System.Diagnostics;

static class Program
{
    static async Task<int> Main()
    {

        var builder = new SqlConnectionStringBuilder()
        {
            TrustServerCertificate = true,
            DataSource = ".",
            InitialCatalog = "master",
            IntegratedSecurity = true,
        };
        using var conn = new SqlConnection(builder.ConnectionString);
        await conn.OpenAsync();

        const int Expected = 100_000;

        int count;
        try
        {
            count = conn.QuerySingle<int>("""
    select count(1)
    from LoadTest;
    """);

            Console.WriteLine($"Found {count} rows...");
        }
        catch (SqlException ex) when (ex.Number == 208)
        {
            Console.WriteLine("Creating table...");
            await conn.ExecuteAsync("""
        create table LoadTest(id int not null, name nvarchar(100) not null);
        """);
            count = 0;
        }
        catch (Exception ex)
        {
            Console.Error.WriteLine(ex.Message);
            return -1;
        }
        if (count < Expected)
        {
            int need = Expected - count;
            Console.WriteLine($"Inventing {need} rows...");

            var rand = Random.Shared;
            var data = from id in Enumerable.Range(1, need)
                       select new MyRow { Id = rand.Next(), Name = CreateName(rand) };

            using var bcp = new SqlBulkCopy(conn);
            bcp.DestinationTableName = "LoadTest";
            bcp.EnableStreaming = true;

            await bcp.WriteToServerAsync(TypeAccessor.CreateDataReader(data));
            Console.WriteLine($"Copied: {bcp.RowsCopied}");
        }

        await RunDapper(conn);
        await RunDapperAot(conn);
        await RunAdoNet(conn, false, false);
        await RunAdoNet(conn, false, true);
        await RunAdoNet(conn, true, false);
        await RunAdoNet(conn, true, true);

        return 0;
    }

    const string Sql = """
    select id, name from LoadTest;
    """;


    public static async ValueTask<T> FirstAsync<T>(this IAsyncEnumerable<T> source)
    {
        await using var iter = source.GetAsyncEnumerator();
        if (!await iter.MoveNextAsync()) throw new EndOfStreamException("oops");
        return iter.Current;
    }


    [DapperAot(false)]
    async static Task RunDapper(SqlConnection conn)
    {
        Console.WriteLine("Reading one row, badly, via async Dapper...");
        var timer = Stopwatch.StartNew();
        var row = await conn.QueryFirstAsync<MyRow>(Sql);
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");

        Console.WriteLine("Reading one row, badly, via async Dapper...");
        timer = Stopwatch.StartNew();
        row = await conn.QueryUnbufferedAsync<MyRow>(Sql).FirstAsync();
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");


        Console.WriteLine("Reading one row, badly, via sync Dapper + LINQ...");
        timer = Stopwatch.StartNew();
        row = conn.QueryFirst<MyRow>(Sql);
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");

        Console.WriteLine("Reading one row, badly, via sync Dapper + LINQ...");
        timer = Stopwatch.StartNew();
        row = conn.Query<MyRow>(Sql, buffered: false).First();
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");
    }

    [DapperAot(true)]
    async static Task RunDapperAot(SqlConnection conn)
    {
        Console.WriteLine("Reading one row, badly, via async DapperAOT...");
        var timer = Stopwatch.StartNew();
        var row = await conn.QueryFirstAsync<MyRow>(Sql);
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");

        Console.WriteLine("Reading one row, badly, via async DapperAOT...");
        timer = Stopwatch.StartNew();
        row = await conn.QueryUnbufferedAsync<MyRow>(Sql).FirstAsync();
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");


        Console.WriteLine("Reading one row, badly, via sync DapperAOT + LINQ...");
        timer = Stopwatch.StartNew();
        row = conn.QueryFirst<MyRow>(Sql);
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");

        Console.WriteLine("Reading one row, badly, via sync DapperAOT + LINQ...");
        timer = Stopwatch.StartNew();
        row = conn.Query<MyRow>(Sql, buffered: false).First();
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");
    }

    async static Task RunAdoNet(SqlConnection conn, bool singleRow, bool readGrid)
    {
        var suffix = (singleRow ? "; single-row" : "") + (readGrid ? "; read-grid" : "");
        Console.WriteLine($"Reading one row, badly, via async ADO.NET{suffix}...");
        var timer = Stopwatch.StartNew();
        using var cmd = conn.CreateCommand();
        cmd.CommandType = CommandType.Text;
        cmd.CommandText = Sql;
        MyRow row;

        var Flags = (CommandBehavior.SequentialAccess | CommandBehavior.SingleResult) | (singleRow ? CommandBehavior.SingleRow : 0);

        await using (var reader = await cmd.ExecuteReaderAsync(Flags))
        {
            if (!await reader.ReadAsync())
            {
                throw new EndOfStreamException("no row!");
            }
            row = new()
            {
                Id = reader.GetInt32(0),
                Name = reader.GetString(1)
            };

            if (readGrid)
            {
                while (await reader.ReadAsync()) { }
            }
            while (await reader.NextResultAsync())
            {

            }
        }
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");

        Console.WriteLine($"Reading one row, badly, via sync ADO.NET{suffix}...");
        timer = Stopwatch.StartNew();
        using (var reader = cmd.ExecuteReader(Flags))
        {
            if (!reader.Read())
            {
                throw new EndOfStreamException("no row!");
            }
            row = new()
            {
                Id = reader.GetInt32(0),
                Name = reader.GetString(1)
            };

            if (readGrid)
            {
                while (reader.Read()) { }
            }
            while (reader.NextResult())
            {

            }
        }
        timer.Stop();
        Console.WriteLine($"Got row {row.Id} in {timer.ElapsedMilliseconds}ms");
    }

    static string CreateName(Random rand)
    {
        const string Alphabet = "abcdefghijklmnopqrstuvwxyz 0123456789,;-";
        var len = rand.Next(5, 101);
        Span<char> name = stackalloc char[len];
        foreach (ref char c in name)
        {
            c = Alphabet[rand.Next(Alphabet.Length)];
        }
        return new string(name);
    }

    public class MyRow
    {
        public int Id { get; set; }
        public string Name { get; set; } = "";
    }
}

@cph0
Copy link
Author

cph0 commented Oct 8, 2024

I am using System.Data.SqlClient. Switching to Microsoft.Data.SqlClient has no improvement.

I've ran your code and the speeds match. I tried my table and it takes 20 seconds for QueryUnbufferedAsync.

My table has an int primary key, another int column, a smallint column, a decimal column, three datetime columns and 3 other int columns.

My table has over 10 million records. If I just select the Id column for example, it works in under a second. Selecting all columns in SSMS takes 20 seconds.

If you create a table with enough columns and rows it will be slow in SSMS and hence slow with QueryUnbufferedAsync/FirstAsync.

Calling reader.DisposeAsync will wait for the command to finish (all rows).

https://stackoverflow.com/questions/66771460/sqldatareader-command-timing-out-on-close-dispose

@mgravell
Copy link
Member

mgravell commented Oct 8, 2024

See, the tricky thing here is: TDS. In TDS, errors can follow the data, and people usually want to see errors even if they occur after the data - we've had multiple error reports about when this hasn't been the case. While I'm sympathetic to making this as fast as it can be, ultimately select top 1 (or similar, pick a number) would seem to be a much more appropriate solution here. In fact, if you install the optional Dapper.Advisor package, you'll get DAP231:

SELECT for single row without WHERE or (TOP and ORDER BY)

linking to https://aot.dapperlib.dev/rules/DAP231

If we cancel the command, we don't correctly capture error metadata

@mgravell
Copy link
Member

mgravell commented Oct 8, 2024

To be explicit: my plan is to see whether we can get what we need with the change, in QueryRowAsync:

  {
      ...
-    while (await reader.ReadAsync(cancel).ConfigureAwait(false)) { /* ignore rows after the first */ }
      ...
  }
  while (await reader.NextResultAsync(cancel).ConfigureAwait(false)) { /* ignore result sets after the first */ }

mgravell added a commit that referenced this issue Oct 8, 2024
- correctness: do no use SingleRow by default; affects trailing errors
- performance: QueryUnbufferedAsync can mirror cmd?.Cancel() in finally (this is consistent with all other scenarios)
@mgravell
Copy link
Member

mgravell commented Oct 8, 2024

OK, so: good news first - we can trivially fix QueryUnbufferedAsync - you are correct, we should have been cancelling here, in the finally

Random context: I haven't changed QueryFirst* etc, because in that scenario, correctness should be prioritized; my testing here actually showed that some of our defaults downplay correctness - I've improved that in the PR.

@mgravell
Copy link
Member

mgravell commented Oct 8, 2024

typical test output now:

    bulk-insert complete; 500000 rows in 18ms
    bad things
    bad things
    sync top 1 read first complete; row 2119983792 in 6ms
    async top 1 read first complete; row 2119983792 in 1ms
    sync read first complete; row 2119983792 in 407ms
    async read first complete; row 2119983792 in 320ms
    sync read with error complete in 268ms; bad things
    async read with error complete in 314ms; bad things
    sync unbuffered LINQ read first complete; row 2119983792 in 1ms
    async unbuffered LINQ read first complete; row 2119983792 in 3ms

The bottom row is where we started here, so: that's fixed. The others are larger than I'd like, but as shown in the first 2 rows: the answer there is "add top or similar".

mgravell added a commit that referenced this issue Oct 9, 2024
…APIs (#2121)

* see #2115

- correctness: do no use SingleRow by default; affects trailing errors
- performance: QueryUnbufferedAsync can mirror cmd?.Cancel() in finally (this is consistent with all other scenarios)

* remove settings tweak
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants