Extend insert into support to include Json backed tables #7212

devinjdangelo · 2023-08-06T23:50:07Z

Which issue does this PR close?

None, but progresses towards the goals of #5076 and #7079. Follow on to #7141.

Rationale for this change

Adds support for insert into <table> for tables which are backed by Json files.

What changes are included in this PR?

Implements JsonSink in similar fashion to CsvSink
Minor refactor of CsvSink to support code reuse with JsonSink
Generalized existing tests of insert into to be easily extensible for additional FileFormats and options
Added test coverage for appending to existing Json files and appending new Json files to a ListingTable
Added checks and throw error if attempting insert to sorted or compressed table since not implemented yet

Are these changes tested?

Yes

Are there any user-facing changes?

Insert into Json table will work now.

metesynnada

Overall looks good. I appreciate your hard work. However, I have made some comments regarding the changes.

datafusion/core/src/datasource/listing/table.rs

datafusion/core/src/datasource/file_format/json.rs

metesynnada · 2023-08-07T06:53:12Z

datafusion/core/src/datasource/file_format/mod.rs

+/// Serialization is assumed to be stateless, i.e.
+/// each RecordBatch can be serialized without any
+/// dependency on the RecordBatches before or after.
+async fn stateless_serialize_and_write_files(


It makes sense to consolidate these into a unified approach.

I agree -- and I think that will mean when we parallelize the logic more all the writers will benefit

datafusion/core/src/datasource/file_format/mod.rs

datafusion/core/src/datasource/listing/table.rs

Co-authored-by: Metehan Yıldırım <100111937+metesynnada@users.noreply.github.com>

metesynnada · 2023-08-07T14:04:46Z

datafusion/core/src/datasource/file_format/csv.rs

@@ -608,17 +592,17 @@ impl DataSink for CsvSink {
                ))
            }
            FileWriterMode::PutMultipart => {
-                //currently assuming only 1 partition path (i.e. not hive style partitioning on a column)
+                // Currently assuming only 1 partition path (i.e. not hive-style partitioning on a column)


@alamb If this is OK for you, overall LGTM.

alamb · 2023-08-07T21:04:09Z

Thank you -- I quickly skimmed this PR and it looks great @devinjdangelo -- thank you for the review @devinjdangelo . I will take a closer look tomorrow morning.

alamb

The code looks great to me -- thank you @devinjdangelo and @metesynnada for the review. I tried it out locally and it was 👌 very nice.

It is somewhat awkward at the moment to use this feature as you can't create new tables, only append to existing:

$ mkdir /tmp/my_table
❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.

❯ insert into my_table values (1,2), (3, 4);
Error during planning: Cannot append 1 partitions to 0 files!

I filed #7228 to track improving this

alamb · 2023-08-08T12:28:03Z

datafusion/core/src/datasource/file_format/mod.rs

+/// Serialization is assumed to be stateless, i.e.
+/// each RecordBatch can be serialized without any
+/// dependency on the RecordBatches before or after.
+async fn stateless_serialize_and_write_files(


I agree -- and I think that will mean when we parallelize the logic more all the writers will benefit

alamb · 2023-08-08T12:56:57Z

datafusion/core/src/datasource/listing/table.rs

-            .map_err(|e| DataFusionError::Internal(e.to_string()))?;
+
+        // Read the records in the table
+        let batches = session_ctx.sql("select * from t").await?.collect().await?;


alamb · 2023-08-08T13:18:24Z

Again, thanks again !

alamb · 2023-08-08T13:22:34Z

Here is a small follow on to reduce some duplication #7229

devinjdangelo added 4 commits August 6, 2023 10:27

jsonsink and test simplemented

6e3647f

fix tests and clean up

b9a1cbb

clippy

7be1f64

minor refactor

b6c7d1e

github-actions bot added the core Core DataFusion crate label Aug 6, 2023

metesynnada reviewed Aug 7, 2023

View reviewed changes

devinjdangelo and others added 2 commits August 7, 2023 06:45

comments + append existing file test check no new files added

8200bf1

format comments

f4fea24

Co-authored-by: Metehan Yıldırım <100111937+metesynnada@users.noreply.github.com>

metesynnada approved these changes Aug 7, 2023

View reviewed changes

alamb approved these changes Aug 8, 2023

View reviewed changes

alamb merged commit 3d917a0 into apache:main Aug 8, 2023

alamb mentioned this pull request Aug 8, 2023

Minor: remove duplication in create_writer #7229

Merged

devinjdangelo mentioned this pull request Aug 9, 2023

Extend insert into to support Parquet backed tables #7244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend insert into support to include Json backed tables #7212

Extend insert into support to include Json backed tables #7212

devinjdangelo commented Aug 6, 2023

metesynnada left a comment

metesynnada Aug 7, 2023

alamb Aug 8, 2023

metesynnada Aug 7, 2023

alamb commented Aug 7, 2023

alamb left a comment

alamb Aug 8, 2023

alamb Aug 8, 2023

alamb commented Aug 8, 2023

alamb commented Aug 8, 2023

Extend insert into support to include Json backed tables #7212

Extend insert into support to include Json backed tables #7212

Conversation

devinjdangelo commented Aug 6, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

metesynnada left a comment

Choose a reason for hiding this comment

metesynnada Aug 7, 2023

Choose a reason for hiding this comment

alamb Aug 8, 2023

Choose a reason for hiding this comment

metesynnada Aug 7, 2023

Choose a reason for hiding this comment

alamb commented Aug 7, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 8, 2023

Choose a reason for hiding this comment

alamb Aug 8, 2023

Choose a reason for hiding this comment

alamb commented Aug 8, 2023

alamb commented Aug 8, 2023