Copy/upsert data across tables in bulk (INSERT ... SELECT) #27320

roji · 2022-01-31T14:34:35Z

This issue tracks introducing an API to copy data inside the database, from one table (or several) into a destination table. Note this is different from bulk importing data from the client into the database (#27333).

This was split off from #795 (bulk update/delete).

Basic API

All SQL database support a variant of the INSERT statement which accepts a query instead of a list of values:

INSERT INTO x SELECT * FROM y

The column list can be specified (INSERT INTO x (a, b) SELECT ...) or omitted (INSERT INTO x SELECT ...). If it's omitted, the subquery must return the exact number of columns in the destination table, with the correct type. Since it's problematic to rely on table column ordering (e.g. can't be changed after creation), we should probably force the user to always explicitly provide the column list.

Basic proposals:

// Variant 1: begin from the source table, flow data through the LINQ operators to the destination.
// The naming corresponds to the SQL (`INSERT INTO`), no ambiguity with change-tracking operations on DbSet (like with Update).
ctx.Blogs1.Where(...).InsertInto(ctx.Blogs2, b => b.Name)

// Variant 1.1: Naming-wise, we could make it extra-explicit that it's a bulk operation.
// Also more consistent with BulkDelete/BulkUpdate (with or without `Into` suffix).
// May be slightly ambiguous with bulk import (from client).
ctx.Blogs1.Where(...).BulkInsert(ctx.Blogs2, b => new { b.Name, b.Url })

// Variant 2: we can flip the order, but this adds nesting which seems unnecessary:
ctx.Blogs2.InsertFrom(ctx.Blogs1.Where(...), b => new { b.Name, b.Url })

Static column compatibility checking

It would be great to statically enforce that the column list matches the incoming columns from the source table, e.g. with the following signature:

public static void InsertInto<TSource, TDestination>(
    this IQueryable<TSource> source,
    DbSet<TDestination> destination,
    Expression<Func<TDestination, TSource>> columnSelector)
    where TDestination : class
{
}

This works great for a single column:

ctx.Blogs.Select(b => b.Name).InsertInto(ctx.Customers, c => c.Foo1);

With multiple columns, this fails if the anonymous type's field names differ:

ctx.Blogs.Select(b => new { b.Name, b.Url }).InsertInto(ctx.Customers, c => new { c.Foo1, c.Foo2 });

Requiring the source's and column's anonymous types to have the same field names seems... problematic (we really do want to project across different columns).

If we had value tuples in expression trees (yet again), this would work quite well:

ctx.Blogs.Select(b => (b.Name, b.Url)).InsertInto(ctx.Customers, c => (c.Foo1, c.Foo2));

In any case, if we don't want this to depend on value tuple syntax, we could give up static-time enforcing with the following signature:

public static void InsertInto<TSource, TDestination, TColumns>(
    this IQueryable<TSource> source,
    DbSet<TDestination> destination,
    Expression<Func<TDestination, TColumns>> columnSelector)
    where TDestination : class
{
}

... and the query would fail at runtime if things are mismatched.

Finally, note that if we want to, we could have a specific overload for copying between tables mapped to shared type entity types - in this case no column list is necessary:

ctx.Blogs1.InsertInto(ctx.Blogs2);

Fancier examples

// With navigation:
ctx.Blogs1
    .Where(b => b.Posts.Any())
    .Select(b => new { Blog = b, FirstPost = b.Posts.OrdersBy(p => p.Popularity).Take(1) })
    .Select(x => new { x.Blog.Url, x.Blog.Name, x.FirstPost.Title, x.FirstPost.Author })
    .InsertInto(ctx.Foo, f => new { f.Url, f.Name, f.Title, f.Author });

// Fetching generated columns back (RETURNING/OUTPUT):
var ids = ctx.Customers
    .Select(...)
    .InsertInto(
        ctx.Blogs,
        columnSelector: b => new { b.Name, b.Url },
        returningSelector: b => b.Id)
    .ToList();

// Insert or ignore:
ctx.Customers.Select(...).InsertIntoOrIgnore(
    ctx.Blogs2,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name);

// Insert or update (UPSERT). The update action can access both the existing row and the excluded (which we attempted to insert):
ctx.Customers.Select(...).InsertIntoOrUpdate(
    ctx.Blogs,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name,
    updateAction: (existing, excluded) => new Blog
    {
        Name = existing.Name + "_updated",
        Url = excluded.Url
    });

// The same UPSERT with generated column fetch:
ctx.Customers.Select(...).InsertIntoOrUpdate(
    ctx.Blogs,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name,
    updateAction: (existing, excluded) => new Blog
    {
        Name = existing.Name + "_updated",
        Url = excluded.Url
    },
    returningSelector: b => b.Id);


// With sproc as input on SQL Server (generates `INSERT ... EXECUTE`):
ctx.SprocSet.InsertInto(ctx.DbSetMappedToSproc, b => new { ... });

// With common table expression (depends on #26486):
EF.Functions.With(...).InsertInto(ctx.Blogs, b => new { ... });

Additional notes

SQL Server supports INSERT ... EXECUTE for copying the results of a sproc.
All providers support WITH with INSERT (WITH ... AS x INSERT INTO Y SELECT * FROM x) - this is important for recursive WITH queries. PostgreSQL even allows capturing the results of an UPDATE ... RETURNING with WITH, and inserting them into a table.
Just like with bulk update/delete, INSERT supports RETURNING/OUTPUT which returns columns; this is useful to get back auto-generated columns which aren't updated by the query (e.g. the ID). So like for update/delete, we'd have an overload which accepts an expression to determine the columns to return, and returns an IQueryable with those columns.

Documentation

Community implementations

The text was updated successfully, but these errors were encountered:

roji · 2022-02-02T11:23:16Z

Updated the above with both fetching back generated columns (RETURNING/OUTPUT), and how UPSERT would fit into this API ("bulk upsert").

UPSERT implementations across databases generally accept the following information:

Conflict action - either ignore or update. We should probably have two methods to represent this (InsertIntoOrIgnore, InsertIntoOrUpdate)
Columns that make up the uniqueness check, or match condition (usually expected to correspond to a unique constraint). In some cases (PG, SQLite) the constraint name can be provided directly.
- MySQL doesn't require or support this - violation of any unique constraint always triggers the update action (it's discouraged to use UPSERT in MySQL on tables with more than one unique constraint). MySQL can simply ignore this parameter (and even introduce an overload without it).
If doing insert or update, the SET part - this is very similar to how bulk update works (UPDATE ... SET ...). Note that in most databases it's possible for SET to refer to both the existing row and the excluded incoming one.

AndriySvyryd · 2022-02-02T17:59:29Z

Looks good. We'll probably need overloads for types without a DbSet:

ctx.Blogs1.Where(...).InsertInto<Blogs2>(b => b.Name);

ctx.Blogs1.Where(...).InsertInto<Blogs2>("STETBlog", b => b.Name);

borisdj · 2022-02-03T21:11:01Z

Just to add for cmnt list EFCore.BulkExtensions now has SourceTable feature with config CustomSourceTableName and also supports optionally dict CustomSourceDestinationMappingColumns.
Usage example:
context.BulkInsert(new List<Item>(), b => b.CustomSourceTableName = "ItemBase");

roji · 2022-10-01T07:03:20Z

Another use case that may be covered here is having client-provided data as the source, e.g. provide an array of .NET instances and upsert them into the target table in the database. With MERGE this looks like this:

MERGE TheTable t
USING @data d
ON t.Id = d.Id
WHEN MATCHED THEN 
    UPDATE SET [Text] = d.[Text]
WHEN NOT MATCHED THEN 
    INSERT (Id, [Text]) VALUES (d.[Id], d.[Text]);

The API shape above may not be suitable for this, since it starts with a DbSet as the source, but we want to send client data. An interesting idea is to add an API to create a (temporary) DbSet out of a collection of client-side instances, which we can then e.g. upsert into a table in the database.

Note that for SQL Server, the above requires a custom type, which is a problem (but PG should support this without one, as tables have types created for them implicitly).

AraHaan · 2022-12-30T10:28:24Z

Question, would this proposed API allow using this to insert items (from any linq query) in bulk without tracking them? I have ran into a use case of such a thing when running my Discord Bot in a 2 GB RAM ubuntu VPS and it ends up OOMing fast (which I think the change tracker contributes to the problem on top of Remora.Discord's caching).

roji · 2022-12-30T10:31:38Z

@AraHaan no - this is about copying data from one table to another. Bulk import from the client is covered by #27333.

In the meantime, you can drop down to your driver's low-level import mechanism (e.g. SqlBulkCopy for SQL Server).

stevenpua · 2024-01-10T13:11:09Z

Greetings all! How do we upvote this? This makes a lot of our app database provider agnostic. We no longer need to do ExecuteSql when this is ready. How will this work with the non-sql providers? (or no support for those?)

Leaving this comment to upvote this.

roji · 2024-01-10T15:43:03Z

@stevenpua you upvote issues by upvoting the top-most comment above (the issue description) - posting comments is not something that helps us prioritize, and creates needless notifications and churn.

roji · 2024-01-10T15:44:47Z

Re no-sql providers, the API here will likely be abstract and would be implementable by any provider type, so at least in principle, any database that supports a copy-across-tables mechanism would be covered. However, this is typically something that no-sql databases don't necessarily implement.

roji added type-enhancement customer-reported labels Jan 31, 2022

ajcvickers added this to the Backlog milestone Feb 1, 2022

ajcvickers added the area-save-changes label Feb 1, 2022

roji mentioned this issue Feb 1, 2022

Bulk import for efficient importing of data from the client into the database #27333

Open

roji mentioned this issue Feb 2, 2022

ExecuteUpdate/Delete (AKA bulk update, without loading data into memory) #795

Closed

ajcvickers added area-bulkcud and removed area-save-changes labels Jul 28, 2022

roji added the consider-for-next-release label Oct 1, 2022

roji changed the title ~~Copy data across tables in bulk (INSERT ... SELECT)~~ Copy/upsert data across tables in bulk (INSERT ... SELECT) Oct 1, 2022

roji added the needs-design label Oct 1, 2022

ajcvickers added propose-punt and removed propose-punt labels Oct 20, 2022

This was referenced Dec 19, 2022

Please add methods for ExecuteInsert & ExecuteInsertAsync #29894

Closed

Non-tracking ExecuteInsert #29897

Open

This comment was marked as resolved.

Sign in to view

AndriySvyryd added consider-for-current-release and removed consider-for-next-release labels Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy/upsert data across tables in bulk (INSERT ... SELECT) #27320

Copy/upsert data across tables in bulk (INSERT ... SELECT) #27320

roji commented Jan 31, 2022 •

edited

Loading

roji commented Feb 2, 2022 •

edited

Loading

AndriySvyryd commented Feb 2, 2022

borisdj commented Feb 3, 2022 •

edited

Loading

roji commented Oct 1, 2022 •

edited

Loading

AraHaan commented Dec 30, 2022 •

edited

Loading

roji commented Dec 30, 2022

This comment was marked as resolved.

This comment was marked as resolved.

stevenpua commented Jan 10, 2024 •

edited

Loading

roji commented Jan 10, 2024

roji commented Jan 10, 2024

Copy/upsert data across tables in bulk (INSERT ... SELECT) #27320

Copy/upsert data across tables in bulk (INSERT ... SELECT) #27320

Comments

roji commented Jan 31, 2022 • edited Loading

Basic API

Static column compatibility checking

Fancier examples

Additional notes

Documentation

Community implementations

roji commented Feb 2, 2022 • edited Loading

AndriySvyryd commented Feb 2, 2022

borisdj commented Feb 3, 2022 • edited Loading

roji commented Oct 1, 2022 • edited Loading

AraHaan commented Dec 30, 2022 • edited Loading

roji commented Dec 30, 2022

This comment was marked as resolved.

This comment was marked as resolved.

stevenpua commented Jan 10, 2024 • edited Loading

roji commented Jan 10, 2024

roji commented Jan 10, 2024

roji commented Jan 31, 2022 •

edited

Loading

roji commented Feb 2, 2022 •

edited

Loading

borisdj commented Feb 3, 2022 •

edited

Loading

roji commented Oct 1, 2022 •

edited

Loading

AraHaan commented Dec 30, 2022 •

edited

Loading

stevenpua commented Jan 10, 2024 •

edited

Loading