Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chaining two ThenInclude clauses to eager load many-to-many entities results in very slow performance/SQL #19418

Closed
Webreaper opened this issue Dec 28, 2019 · 2 comments

Comments

@Webreaper
Copy link

I have the following code:

var entries = db.BasketEntries
   .Include(x => x.Image)
   .ThenInclude(x => x.ImageTags)
   .ThenInclude(x => x.Tag)
   .ToArray();

The data model is:

  • Image - contains a list of ImageTags
  • ImageTag - foreign key join between Image and Tag
  • Tag - has a primary key and a keyword
  • BasketEntries is a single foreign-key relation to Image.

This gives me a many-to-many relationship between image and tag, via ImageTag. Since BasketEntry and Image are one-to-one, starting with db.BasketEntries acts as a filter (there is only one record in BasketEntries).

The performance characteristics are the issue - I noticed that when I get to 500k images and 1.2m ImageTag records, the above code takes more than 3 seconds to pull back the single record from BasketEntries, and hence a single image (and around 10 related imageTags and tags).

Looking at the SQL generated, I can see that this creates this SQL:

SELECT "b"."ImageId", "b"."DateAdded", "i"."ImageId", "i"."BasketId", "i"."CameraId", "i"."Caption", "i"."DateIndexed", "i"."DateTaken", "i"."Description", "i"."FileCreationDate", "i"."FileLastModDate", "i"."FileName", "i"."FileSize", "i"."FolderId", "f"."FolderId", "f"."ParentId", "f"."Path", "t0"."ImageId", "t0"."TagId", "t0"."TagId0", "t0"."Keyword", "t0"."Type"
FROM "BasketEntries" AS "b"
INNER JOIN "Images" AS "i" ON "b"."ImageId" = "i"."ImageId"
INNER JOIN "Folders" AS "f" ON "i"."FolderId" = "f"."FolderId"
LEFT JOIN (
   SELECT "i0"."ImageId", "i0"."TagId", "t"."TagId" AS "TagId0", "t"."Keyword", "t"."Type"
   FROM "ImageTags" AS "i0"
   INNER JOIN "Tags" AS "t" ON "i0"."TagId" = "t"."TagId"
) AS "t0" ON "i"."ImageId" = "t0"."ImageId"
ORDER BY "b"."ImageId", "i"."ImageId", "f"."FolderId", "t0"."ImageId", "t0"."TagId", "t0"."TagId0"

I think the slow performance is caused by the fact that there's a sub-select being executed within the left-join, but because there's no where-clause in that sub-select, it'll pull back all 1.2 million rows from the ImageTags table, before the left-join filters the ones that match via imageID. I think it's that which is making the query so slow.

Next I changed it to the following, to remove the ImageTags eager-loading from the first query, and then iterating on the resultant entities to enrich them via ImageTags/Tags using explicit loading:

var entries = db.BasketEntries
    .Include(x => x.Image.Folder)
    .Include(x => x.Image)
    .ToArray();

foreach (var img in images)
{
   db.Entry(img)
   .Collection(e => e.ImageTags)
   .Query()
   .Include(e => e.Tag)
   .Load();
}

Then the whole thing takes around 200ms - so 95% faster. This is because the SQL is now two very fast queries:

SELECT "b"."ImageId", "b"."DateAdded", "i"."ImageId", "i"."BasketId", "i"."CameraId", "i"."Caption", "i"."DateIndexed", "i"."DateTaken", "i"."Description", "i"."FileCreationDate", "i"."FileLastModDate", "i"."FileName", "i"."FileSize", "i"."FolderId", "f"."FolderId", "f"."ParentId", "f"."Path"
FROM "BasketEntries" AS "b"
INNER JOIN "Images" AS "i" ON "b"."ImageId" = "i"."ImageId"
INNER JOIN "Folders" AS "f" ON "i"."FolderId" = "f"."FolderId"

SELECT "i"."ImageId", "i"."TagId", "t"."TagId", "t"."Keyword", "t"."Type"
FROM "ImageTags" AS "i"
INNER JOIN "Tags" AS "t" ON "i"."TagId" = "t"."TagId"
WHERE "i"."ImageId" = u/__p_0

This is super-fast, as the inner joins reduce the resultset quickly. Lastly, I tried this - just pulling in the imageTags entities, but not their child Tag entities:

      var entries = db.BasketEntries
                                .Include(x => x.Image)
                                .ThenInclude(x => x.ImageTags)
                                .Include(x => x.Image.Folder)
                                .ToArray();

just to see what SQL would be generated. This is where it gets interesting:

SELECT "b"."ImageId", "b"."DateAdded", "i"."ImageId", "i"."BasketId", "i"."CameraId", "i"."Caption", "i"."DateIndexed", "i"."DateTaken", "i"."Description", "i"."FileCreationDate", "i"."FileLastModDate", "i"."FileName", "i"."FileSize", "i"."FolderId", "f"."FolderId", "f"."ParentId", "f"."Path", "i0"."ImageId", "i0"."TagId"
FROM "BasketEntries" AS "b"
INNER JOIN "Images" AS "i" ON "b"."ImageId" = "i"."ImageId"
INNER JOIN "Folders" AS "f" ON "i"."FolderId" = "f"."FolderId"
LEFT JOIN "ImageTags" AS "i0" ON "i"."ImageId" = "i0"."ImageId"
ORDER BY "b"."ImageId", "i"."ImageId", "f"."FolderId", "i0"."ImageId", "i0"."TagId"

It's still doing the left join - but crucially not on a subselect, so the resultset from that left join will be filtered efficiently. This query takes around 300ms - so an order of magnitude faster than when the left join is done on the subselect.

I think what needs to happen is that instead of creating a sub-select and left-joining, having the two ThenInclude clauses should just keep adding left-joins to the query, but without the sub-select pulling back the entire table first.

I've tried finding examples of how to eager-load many-to-many related data, to see if there's some other convention, pattern or syntax I should be using to optimise this query, but there don't appear to be any good examples. Most of the ones out there are with datasets of a few hundred test records, so nobody probably notices the massive slow-down that two subsequent ThenInclude methods cause When they're chained together.

Further technical details

EF Core version: 3.1.0
Database provider: SQLite
Target framework: .NET Core 3.1
Operating system: OSX (and same behaviour seen on Ubuntu)
IDE: Visual Studio for Mac v8.4 build 2653

@roji
Copy link
Member

roji commented Dec 29, 2019

Possible duplicate of #17622

@roji roji added the area-perf label Dec 29, 2019
@ajcvickers
Copy link
Member

@smitpatel to take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants