-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend insert into support to include Json backed tables #7212
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I appreciate your hard work. However, I have made some comments regarding the changes.
/// Serialization is assumed to be stateless, i.e. | ||
/// each RecordBatch can be serialized without any | ||
/// dependency on the RecordBatches before or after. | ||
async fn stateless_serialize_and_write_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to consolidate these into a unified approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- and I think that will mean when we parallelize the logic more all the writers will benefit
Co-authored-by: Metehan Yıldırım <100111937+metesynnada@users.noreply.github.com>
@@ -608,17 +592,17 @@ impl DataSink for CsvSink { | |||
)) | |||
} | |||
FileWriterMode::PutMultipart => { | |||
//currently assuming only 1 partition path (i.e. not hive style partitioning on a column) | |||
// Currently assuming only 1 partition path (i.e. not hive-style partitioning on a column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb If this is OK for you, overall LGTM.
Thank you -- I quickly skimmed this PR and it looks great @devinjdangelo -- thank you for the review @devinjdangelo . I will take a closer look tomorrow morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great to me -- thank you @devinjdangelo and @metesynnada for the review. I tried it out locally and it was 👌 very nice.
It is somewhat awkward at the moment to use this feature as you can't create new tables, only append to existing:
$ mkdir /tmp/my_table
❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.
❯ insert into my_table values (1,2), (3, 4);
Error during planning: Cannot append 1 partitions to 0 files!
I filed #7228 to track improving this
/// Serialization is assumed to be stateless, i.e. | ||
/// each RecordBatch can be serialized without any | ||
/// dependency on the RecordBatches before or after. | ||
async fn stateless_serialize_and_write_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- and I think that will mean when we parallelize the logic more all the writers will benefit
.map_err(|e| DataFusionError::Internal(e.to_string()))?; | ||
|
||
// Read the records in the table | ||
let batches = session_ctx.sql("select * from t").await?.collect().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Again, thanks again ! |
Here is a small follow on to reduce some duplication #7229 |
Which issue does this PR close?
None, but progresses towards the goals of #5076 and #7079. Follow on to #7141.
Rationale for this change
Adds support for
insert into <table>
for tables which are backed by Json files.What changes are included in this PR?
JsonSink
in similar fashion toCsvSink
CsvSink
to support code reuse withJsonSink
insert into
to be easily extensible for additionalFileFormats
andoptions
ListingTable
Are these changes tested?
Yes
Are there any user-facing changes?
Insert into Json table will work now.