Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy/upsert data across tables in bulk (INSERT ... SELECT) #27320

Open
roji opened this issue Jan 31, 2022 · 11 comments
Open

Copy/upsert data across tables in bulk (INSERT ... SELECT) #27320

roji opened this issue Jan 31, 2022 · 11 comments

Comments

@roji
Copy link
Member

roji commented Jan 31, 2022

This issue tracks introducing an API to copy data inside the database, from one table (or several) into a destination table. Note this is different from bulk importing data from the client into the database (#27333).

This was split off from #795 (bulk update/delete).

Basic API

All SQL database support a variant of the INSERT statement which accepts a query instead of a list of values:

INSERT INTO x SELECT * FROM y

The column list can be specified (INSERT INTO x (a, b) SELECT ...) or omitted (INSERT INTO x SELECT ...). If it's omitted, the subquery must return the exact number of columns in the destination table, with the correct type. Since it's problematic to rely on table column ordering (e.g. can't be changed after creation), we should probably force the user to always explicitly provide the column list.

Basic proposals:

// Variant 1: begin from the source table, flow data through the LINQ operators to the destination.
// The naming corresponds to the SQL (`INSERT INTO`), no ambiguity with change-tracking operations on DbSet (like with Update).
ctx.Blogs1.Where(...).InsertInto(ctx.Blogs2, b => b.Name)

// Variant 1.1: Naming-wise, we could make it extra-explicit that it's a bulk operation.
// Also more consistent with BulkDelete/BulkUpdate (with or without `Into` suffix).
// May be slightly ambiguous with bulk import (from client).
ctx.Blogs1.Where(...).BulkInsert(ctx.Blogs2, b => new { b.Name, b.Url })

// Variant 2: we can flip the order, but this adds nesting which seems unnecessary:
ctx.Blogs2.InsertFrom(ctx.Blogs1.Where(...), b => new { b.Name, b.Url })

Static column compatibility checking

It would be great to statically enforce that the column list matches the incoming columns from the source table, e.g. with the following signature:

public static void InsertInto<TSource, TDestination>(
    this IQueryable<TSource> source,
    DbSet<TDestination> destination,
    Expression<Func<TDestination, TSource>> columnSelector)
    where TDestination : class
{
}

This works great for a single column:

ctx.Blogs.Select(b => b.Name).InsertInto(ctx.Customers, c => c.Foo1);

With multiple columns, this fails if the anonymous type's field names differ:

ctx.Blogs.Select(b => new { b.Name, b.Url }).InsertInto(ctx.Customers, c => new { c.Foo1, c.Foo2 });

Requiring the source's and column's anonymous types to have the same field names seems... problematic (we really do want to project across different columns).

If we had value tuples in expression trees (yet again), this would work quite well:

ctx.Blogs.Select(b => (b.Name, b.Url)).InsertInto(ctx.Customers, c => (c.Foo1, c.Foo2));

In any case, if we don't want this to depend on value tuple syntax, we could give up static-time enforcing with the following signature:

public static void InsertInto<TSource, TDestination, TColumns>(
    this IQueryable<TSource> source,
    DbSet<TDestination> destination,
    Expression<Func<TDestination, TColumns>> columnSelector)
    where TDestination : class
{
}

... and the query would fail at runtime if things are mismatched.

Finally, note that if we want to, we could have a specific overload for copying between tables mapped to shared type entity types - in this case no column list is necessary:

ctx.Blogs1.InsertInto(ctx.Blogs2);

Fancier examples

// With navigation:
ctx.Blogs1
    .Where(b => b.Posts.Any())
    .Select(b => new { Blog = b, FirstPost = b.Posts.OrdersBy(p => p.Popularity).Take(1) })
    .Select(x => new { x.Blog.Url, x.Blog.Name, x.FirstPost.Title, x.FirstPost.Author })
    .InsertInto(ctx.Foo, f => new { f.Url, f.Name, f.Title, f.Author });

// Fetching generated columns back (RETURNING/OUTPUT):
var ids = ctx.Customers
    .Select(...)
    .InsertInto(
        ctx.Blogs,
        columnSelector: b => new { b.Name, b.Url },
        returningSelector: b => b.Id)
    .ToList();

// Insert or ignore:
ctx.Customers.Select(...).InsertIntoOrIgnore(
    ctx.Blogs2,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name);

// Insert or update (UPSERT). The update action can access both the existing row and the excluded (which we attempted to insert):
ctx.Customers.Select(...).InsertIntoOrUpdate(
    ctx.Blogs,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name,
    updateAction: (existing, excluded) => new Blog
    {
        Name = existing.Name + "_updated",
        Url = excluded.Url
    });

// The same UPSERT with generated column fetch:
ctx.Customers.Select(...).InsertIntoOrUpdate(
    ctx.Blogs,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name,
    updateAction: (existing, excluded) => new Blog
    {
        Name = existing.Name + "_updated",
        Url = excluded.Url
    },
    returningSelector: b => b.Id);


// With sproc as input on SQL Server (generates `INSERT ... EXECUTE`):
ctx.SprocSet.InsertInto(ctx.DbSetMappedToSproc, b => new { ... });

// With common table expression (depends on #26486):
EF.Functions.With(...).InsertInto(ctx.Blogs, b => new { ... });

Additional notes

  • SQL Server supports INSERT ... EXECUTE for copying the results of a sproc.
  • All providers support WITH with INSERT (WITH ... AS x INSERT INTO Y SELECT * FROM x) - this is important for recursive WITH queries. PostgreSQL even allows capturing the results of an UPDATE ... RETURNING with WITH, and inserting them into a table.
  • Just like with bulk update/delete, INSERT supports RETURNING/OUTPUT which returns columns; this is useful to get back auto-generated columns which aren't updated by the query (e.g. the ID). So like for update/delete, we'd have an overload which accepts an expression to determine the columns to return, and returns an IQueryable with those columns.

Documentation

Community implementations

@roji
Copy link
Member Author

roji commented Feb 2, 2022

Updated the above with both fetching back generated columns (RETURNING/OUTPUT), and how UPSERT would fit into this API ("bulk upsert").

UPSERT implementations across databases generally accept the following information:

  1. Conflict action - either ignore or update. We should probably have two methods to represent this (InsertIntoOrIgnore, InsertIntoOrUpdate)
  2. Columns that make up the uniqueness check, or match condition (usually expected to correspond to a unique constraint). In some cases (PG, SQLite) the constraint name can be provided directly.
    • MySQL doesn't require or support this - violation of any unique constraint always triggers the update action (it's discouraged to use UPSERT in MySQL on tables with more than one unique constraint). MySQL can simply ignore this parameter (and even introduce an overload without it).
  3. If doing insert or update, the SET part - this is very similar to how bulk update works (UPDATE ... SET ...). Note that in most databases it's possible for SET to refer to both the existing row and the excluded incoming one.

@AndriySvyryd
Copy link
Member

Looks good. We'll probably need overloads for types without a DbSet:

ctx.Blogs1.Where(...).InsertInto<Blogs2>(b => b.Name);
ctx.Blogs1.Where(...).InsertInto<Blogs2>("STETBlog", b => b.Name);

@borisdj
Copy link

borisdj commented Feb 3, 2022

Just to add for cmnt list EFCore.BulkExtensions now has SourceTable feature with config CustomSourceTableName and also supports optionally dict CustomSourceDestinationMappingColumns.
Usage example:
context.BulkInsert(new List<Item>(), b => b.CustomSourceTableName = "ItemBase");

@roji
Copy link
Member Author

roji commented Oct 1, 2022

Another use case that may be covered here is having client-provided data as the source, e.g. provide an array of .NET instances and upsert them into the target table in the database. With MERGE this looks like this:

MERGE TheTable t
USING @data d
ON t.Id = d.Id
WHEN MATCHED THEN 
    UPDATE SET [Text] = d.[Text]
WHEN NOT MATCHED THEN 
    INSERT (Id, [Text]) VALUES (d.[Id], d.[Text]); 

The API shape above may not be suitable for this, since it starts with a DbSet as the source, but we want to send client data. An interesting idea is to add an API to create a (temporary) DbSet out of a collection of client-side instances, which we can then e.g. upsert into a table in the database.

Note that for SQL Server, the above requires a custom type, which is a problem (but PG should support this without one, as tables have types created for them implicitly).

@roji roji changed the title Copy data across tables in bulk (INSERT ... SELECT) Copy/upsert data across tables in bulk (INSERT ... SELECT) Oct 1, 2022
@AraHaan
Copy link
Member

AraHaan commented Dec 30, 2022

Question, would this proposed API allow using this to insert items (from any linq query) in bulk without tracking them? I have ran into a use case of such a thing when running my Discord Bot in a 2 GB RAM ubuntu VPS and it ends up OOMing fast (which I think the change tracker contributes to the problem on top of Remora.Discord's caching).

@roji
Copy link
Member Author

roji commented Dec 30, 2022

@AraHaan no - this is about copying data from one table to another. Bulk import from the client is covered by #27333.

In the meantime, you can drop down to your driver's low-level import mechanism (e.g. SqlBulkCopy for SQL Server).

@voltcode

This comment was marked as resolved.

@ajcvickers

This comment was marked as resolved.

@stevenpua
Copy link

stevenpua commented Jan 10, 2024

Greetings all! How do we upvote this? This makes a lot of our app database provider agnostic. We no longer need to do ExecuteSql when this is ready. How will this work with the non-sql providers? (or no support for those?)

Leaving this comment to upvote this.

@roji
Copy link
Member Author

roji commented Jan 10, 2024

@stevenpua you upvote issues by upvoting the top-most comment above (the issue description) - posting comments is not something that helps us prioritize, and creates needless notifications and churn.

@roji
Copy link
Member Author

roji commented Jan 10, 2024

Re no-sql providers, the API here will likely be abstract and would be implementable by any provider type, so at least in principle, any database that supports a copy-across-tables mechanism would be covered. However, this is typically something that no-sql databases don't necessarily implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants