Add support for appending data to external tables - CSV #6526

mustafasrepo · 2023-06-02T07:05:19Z

Which issue does this PR close?

Closes #

Rationale for this change

This PR adds the support for the following SQL queries:

CREATE EXTERNAL TABLE source_table (
    a1  VARCHAR NOT NULL,
    a2  INT NOT NULL
)
STORED AS CSV
WITH HEADER ROW
OPTIONS ('UNBOUNDED' 'TRUE')
LOCATION '{source}';

CREATE EXTERNAL TABLE sink_table (
    a1  VARCHAR NOT NULL,
    a2  INT NOT NULL
)
STORED AS CSV
WITH HEADER ROW
OPTIONS ('UNBOUNDED' 'TRUE')
LOCATION '{sink}';

INSERT INTO sink_table
SELECT a1, a2 FROM source_table;

This PR adds support for appending data to external tables, which previously only supported memory tables. It introduces new structs and modifications to existing structs, enabling users to efficiently work with file-based storage systems when appending data.

What changes are included in this PR?

Added FileSinkConfig struct for base configurations when creating a physical plan for any given file format.
Added FileWriterExt and to handle writing record batches to a file-like output.
Added CsvSink struct for which implements DataSink to write results to CSV file.

Are these changes tested?

Yes

Are there any user-facing changes?

This change allows users to append data to external tables, which was not possible before. Users can now work with file-based storage systems more efficiently, especially when appending data.

…le_insert_into_support

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

# Conflicts: # datafusion/core/src/datasource/file_format/file_type.rs # datafusion/core/src/physical_plan/file_format/mod.rs

# Conflicts: # datafusion/core/src/datasource/listing/table.rs # datafusion/core/src/physical_plan/file_format/csv.rs

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

ozankabak · 2023-06-03T20:43:18Z

I have been looking forward to this for a while. @metesynnada did a great job starting this work and @mustafasrepo took it to the finish line!

alamb

Thank you @mustafasrepo and @metesynnada and @ozankabak -- this is really nice

I went though it carefully and while I had some small suggestions I also think this could be merged as is and iterate on it from there.

Prior to review, I did not appreciate this PR lays the foundation for streaming (multi-part) writes to object store ❤️

I think the FileSInkConfig is an excellent idea and follows the existing FileScanConfig pattern 👍 It also seems able to extend nicely to writing multiple file partitions (eventually) which is excellent

Here is a list of potential follow on work (some/all of which I would like to help with) -- If you agree, I can file ticket and help with this too as we would love to have streaming write support in IOx as well.

End to end sql level tests in sqllogictests (I can do this)
Tests for streaming writes
Tests for abort behavior (making sure all writers are canceled and the correct error is returned)
Implement something similar for JSON and parquet.
Documentation / note that we support streaming multi-part writes

datafusion/core/src/datasource/listing/table.rs

datafusion/core/src/physical_plan/file_format/mod.rs

datafusion/core/src/datasource/file_format/csv.rs

datafusion/core/src/datasource/file_format/mod.rs

alamb · 2023-06-04T10:56:02Z

datafusion/core/src/datasource/file_format/mod.rs

+        &mut self,
+        cx: &mut Context<'_>,
+    ) -> Poll<std::result::Result<(), Error>> {
+        loop {


My reading of https://docs.rs/tokio/1.28.2/tokio/io/trait.AsyncWrite.html#tymethod.poll_shutdown implies that by writing on shutdown it means as written this writer will buffer everything before starting a write.

This is probably fine for the initial version, but it is probably something to improving in the future.

datafusion/core/src/physical_plan/file_format/csv.rs

datafusion/core/src/datasource/file_format/csv.rs

alamb · 2023-06-04T11:38:43Z

datafusion/core/src/datasource/file_format/mod.rs

+/// `AsyncPutWriter` is an object that facilitates asynchronous writing to object stores.
+/// It is specifically designed for the `object_store` crate's `put` method and sends
+/// whole bytes at once when the buffer is flushed.
+pub struct AsyncPutWriter {


@tustvold and/or @crepererum (my go to people for rust async / stream expertise) I wonder if you have some time to review this code that adapts DataFusion to use the object_store put multi part features to do streaming writes

Pretty exciting!

I'm confused to see this here, we already provide AsynWrite from ObjectStore, so I'm not sure why we are re-implementing the buffering here? Am i missing something?

We hypothesize that the consistent use of put_multipart for every put operation might adversely impact the cloud side, as it anticipates files exceeding a specific size (for example, 5MB for AWS). To mitigate this, we've developed a wrapper for the put operation that standardizes the write operation on AsyncWrite. Do you have a suggestion to simplify here?

Oh yes 100% put_multipart is overkill for most use-cases, my suggestion was to type-erase at the level of the BatchSerializer and then have different impls for the different write modes. The async Read + abort interface feels a tad over complicated, at least imo.

ozankabak · 2023-06-05T03:35:29Z

Here is a list of potential follow on work (some/all of which I would like to help with) -- If you agree, I can file ticket and help with this too as we would love to have streaming write support in IOx as well.

This sounds great! I checked in the comment improvements per your suggestions, @metesynnada and/or @mustafasrepo will shortly go through your other points. Thanks for the review.

tustvold

Had a quick review, I think this is probably fine, but type-erasing the writer mode seems a little peculiar to me.

This is because each of the methods has rather different characteristics, and imo warrants writing in a different manner.

Put

The write is completely synchronous (it is writing to memory) and is then atomically flushed, with no need for abort behaviour or async write. All file formats can support this mode

Put Multipart

The write is async with a final atomic close. Requires custom abort logic. All file formats can support this mode

Append

Abort is fatal (not even entirely sure how to surface this), only supported by row oriented file formats. Even then requires custom handling for things like CSV headers, etc...

Proposal

I guess my proposal would be to simply add a match block within DataSink for each of the various FileWriterMode. Over time I expect we will be able to extract common logic for each of the variants, e.g. a generic Put version using a RecordBatchWriter, etc... but I'm not sure that trying to unify all the writer modes is a good abstraction and certainly at this stage where we only have one impl seems a touch premature perhaps?

tustvold · 2023-06-05T10:43:25Z

datafusion/core/src/datasource/file_format/mod.rs

+
+impl<W: AsyncWrite + Unpin + Send> FileWriterExt for AsyncPut<W> {}
+
+/// An extension trait for `AsyncWrite` types that adds an `abort_writer` method.


What do you think about of instead of using a trait to add abort, just having a struct like

struct AbortableWrite<W> { write: W, abort: Option<Box<dyn FnOnce() -> BoxFuture<'static, Result<()>>>> }

datafusion/core/src/physical_plan/file_format/csv.rs

datafusion/core/src/datasource/file_format/mod.rs

datafusion/core/src/physical_plan/file_format/mod.rs

datafusion/core/src/datasource/file_format/csv.rs

mustafasrepo · 2023-06-06T08:35:00Z

Had a quick review, I think this is probably fine, but type-erasing the writer mode seems a little peculiar to me.

This is because each of the methods has rather different characteristics, and imo warrants writing in a different manner.

Put

The write is completely synchronous (it is writing to memory) and is then atomically flushed, with no need for abort behaviour or async write. All file formats can support this mode

Put Multipart

The write is async with a final atomic close. Requires custom abort logic. All file formats can support this mode

Append

Abort is fatal (not even entirely sure how to surface this), only supported by row oriented file formats. Even then requires custom handling for things like CSV headers, etc...

Proposal

I guess my proposal would be to simply add a match block within DataSink for each of the various FileWriterMode. Over time I expect we will be able to extract common logic for each of the variants, e.g. a generic Put version using a RecordBatchWriter, etc... but I'm not sure that trying to unify all the writer modes is a good abstraction and certainly at this stage where we only have one impl seems a touch premature perhaps?

You proposal makes sense to me. I have removed FileWriterFactory. Now writer is created in the CsvSink. I in the future need for trait arises we can do so.

tustvold

Thank you, I would have gone further and removed FileWriterExt in favour of just a type-erased BatchSerializer, but we can always continue to iterate on this design

mustafasrepo · 2023-06-06T10:53:24Z

Thank you, I would have gone further and removed FileWriterExt in favour of just a type-erased BatchSerializer, but we can always continue to iterate on this design

I have removed FileWriterExt. We now use AbortableWrite<W>. This increases the code reuse. Thanks for the suggestion.

alamb · 2023-06-06T10:57:22Z

datafusion/core/src/datasource/file_format/mod.rs

-/// A wrapper around an `AsyncWrite` type that provides append functionality.
-pub struct AsyncAppend<W: AsyncWrite + Unpin + Send> {
+/// A wrapper struct with abort method and writer
+struct AbortableWrite<W: AsyncWrite + Unpin + Send> {


👍 I think this construction makes the intent of the code much clearer. Thank you

# Conflicts: # datafusion/core/src/physical_plan/file_format/mod.rs

alamb · 2023-06-06T19:00:55Z

I am going to merge this PR so we can continue work on main. I am really excited to see this progress. Thank you @mustafasrepo @metesynnada @ozankabak and @tustvold

alamb · 2023-06-06T19:11:59Z

I filed #6569 to start tracking next steps for the streaming write API support

qrilka · 2023-06-09T15:33:16Z

datafusion/core/src/physical_plan/file_format/mod.rs

+///
+/// Prints in the format:
+/// ```text
+/// [file1, file2,...]


@mustafasrepo could you comment why this doesn't match the implementation? I don't see the trailing "..." there. FileGroupsDisplay uses 5 as the largest number or groups shown. Probably we should also limit this list to 5? Maybe having an explicit constant could be useful?
I found this conflicting with my attempt to implement #6383

You are right, I think functionality should match document, and display ... after a limit.
I think during refactor, I have lost this functionality. But I am not sure. We can add a test for this use case, to not lose functionality in the future.

Will you have time to do this or I should try?

I won't be able to look into this 2-3 days. In the mean time, if you want to do this, you can do so.

Thanks, I'll try to find a couple of hours over the weekend, will ping you if there will be something to share

PR created - #6637

metesynnada and others added 30 commits April 14, 2023 10:51

MemoryExec insert into refactor

f40b186

Merge branch 'main' into enhance/insert_into_as_exec

1e45a2c

Merge leftovers

7bb7757

Set target partition

741abdb

Comment and formatting improvements

242eca0

Comments on state.

558a468

Merge branch 'main' into enhance/insert_into_as_exec

223b1e4

ListingTable INSERT INTO support

e0d4dff

Merge branch 'main' into feature/listing_table_insert_into_support

4d0fe4d

Removing unnecessary code

4e84bbc

Some comments are leftover.

6d117c1

Compression import error

cd4ab10

Minor resolutions on cargo docs

41e4bb7

Merge branch 'main' into feature/listing_table_insert_into_support

4b11606

Corrections after merge

03d5d97

Make FileWriterExt available

53d9ca7

Merge remote-tracking branch 'upstream/main' into feature/listing_tab…

31de586

…le_insert_into_support

Single file support

997c19a

Resolve linter errors

962a03d

Minor changes, simplifications

3e0b14f

Fix failing tests because of api change

d6fc57d

Simplifications

dd0d47a

Replace block nesting with drop

bbc535f

Merge branch 'main' into feature/listing_table_insert_into_support

ea6fc47

# Conflicts: # datafusion/core/src/datasource/file_format/file_type.rs # datafusion/core/src/physical_plan/file_format/mod.rs

remove unnecessary code

e717343

Convert to new approach

5a443e9

Merge branch 'main' into feature/listing_table_insert_into_support2

83f9ded

# Conflicts: # datafusion/core/src/datasource/listing/table.rs # datafusion/core/src/physical_plan/file_format/csv.rs

simplify display

192d5c0

Update debug display

1a60f49

alamb added the api change Changes the API exposed to users of the crate label Jun 4, 2023

alamb approved these changes Jun 4, 2023

View reviewed changes

ozankabak and others added 2 commits June 4, 2023 22:38

Improve comments

8e591a1

Move insert into test to the explain.slt

22b9833

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 5, 2023

convert macro to function

f36ada4

tustvold reviewed Jun 5, 2023

View reviewed changes

mustafasrepo added 3 commits June 5, 2023 16:38

return error for abort in append mode.

19fa6d9

Simplify condition of has header

31d99cb

Update comments

8b3eb9c

alamb mentioned this pull request Jun 5, 2023

Move physical_plan::file_format to datasource::plan #6516

Merged

Remove file writer factory

39c80e6

tustvold approved these changes Jun 6, 2023

View reviewed changes

use AbortableWrite struct instead of trait

5ce2303

alamb reviewed Jun 6, 2023

View reviewed changes

Merge branch 'main' into feature/listing_table_insert_into_support2

72f8e36

# Conflicts: # datafusion/core/src/physical_plan/file_format/mod.rs

alamb merged commit 36292f6 into apache:main Jun 6, 2023

alamb mentioned this pull request Jun 6, 2023

[EPIC] Streaming partitioned writes #6569

Open

38 tasks

qrilka reviewed Jun 9, 2023

View reviewed changes

qrilka mentioned this pull request Jun 11, 2023

Unify formatting of both groups and files up to 5 elements #6637

Merged

mustafasrepo deleted the feature/listing_table_insert_into_support2 branch June 13, 2023 05:41

jiangzhx mentioned this pull request Jun 27, 2023

allow writing compressed files #6630

Closed

devinjdangelo mentioned this pull request Oct 30, 2023

Fix AsyncPutWriter (#7991) #7992

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for appending data to external tables - CSV #6526

Add support for appending data to external tables - CSV #6526

mustafasrepo commented Jun 2, 2023

ozankabak commented Jun 3, 2023

alamb left a comment •

edited

Loading

alamb Jun 4, 2023

alamb Jun 4, 2023

tustvold Jun 4, 2023

metesynnada Jun 6, 2023 •

edited

Loading

tustvold Jun 6, 2023 •

edited

Loading

ozankabak commented Jun 5, 2023

tustvold left a comment •

edited

Loading

tustvold Jun 5, 2023

mustafasrepo commented Jun 6, 2023

Put

Put Multipart

Append

Proposal

tustvold left a comment

mustafasrepo commented Jun 6, 2023

alamb Jun 6, 2023

alamb commented Jun 6, 2023

alamb commented Jun 6, 2023

qrilka Jun 9, 2023

mustafasrepo Jun 9, 2023

qrilka Jun 9, 2023

mustafasrepo Jun 9, 2023

qrilka Jun 9, 2023

qrilka Jun 11, 2023


		impl<W: AsyncWrite + Unpin + Send> FileWriterExt for AsyncPut<W> {}

		/// An extension trait for `AsyncWrite` types that adds an `abort_writer` method.

Add support for appending data to external tables - CSV #6526

Add support for appending data to external tables - CSV #6526

Conversation

mustafasrepo commented Jun 2, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ozankabak commented Jun 3, 2023

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metesynnada Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

ozankabak commented Jun 5, 2023

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

Put

Put Multipart

Append

Proposal

Choose a reason for hiding this comment

mustafasrepo commented Jun 6, 2023

Put

Put Multipart

Append

Proposal

tustvold left a comment

Choose a reason for hiding this comment

mustafasrepo commented Jun 6, 2023

Choose a reason for hiding this comment

alamb commented Jun 6, 2023

alamb commented Jun 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment •

edited

Loading

metesynnada Jun 6, 2023 •

edited

Loading

tustvold Jun 6, 2023 •

edited

Loading

tustvold left a comment •

edited

Loading