-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Support for Copy To Logical and Physical plans #7283
Conversation
Thank you @devinjdangelo -- this looks epic -- I plan to review it tomorrow |
---- | ||
2 | ||
|
||
#Explain copy queries not currently working |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that EXPLAIN <copy statement>
currently does not work. When prefixed by EXPLAIN
the subsequent COPY
token is being parsed into the COPY
statement defined in the sqlparser
crate. We haven't implemented a logical plan for that and our expected syntax for COPY
is different from sqlparser
, so this leads to various errors.
When the COPY
token starts the statement, it is correctly parsed as a DFStatement defined within DataFusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where the parsing diverges. We might need to add some special case for parsing explain copy
, but I'm not sure if there is a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a special case for explain copy is probably the right thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I coded up support for EXPLAIN COPY in #7291
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏆 work @devinjdangelo -- it is amazing to see COPY finally working in DataFusion.
I tried this out locally and it worked great 👍 -- thank you so much!
In terms of what to do with this PR, I have a few comments, but nothing that I think is required prior to merge, so therefore I am approving it.
Here are some of the tickets:
- Add documentation for copy statement
- Implement additional parameters
- Explain support
I am happy to file these if you would like. Just let me know.
Example
DataFusion CLI v29.0.0
❯ copy (values (1), (2)) to '/tmp/foo.parquet';
+-------+
| count |
+-------+
| 2 |
+-------+
1 row in set. Query took 0.050 seconds.
❯
\q
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2/datafusion-cli$ file /tmp/foo.parquet
/tmp/foo.parquet: Apache Parquet
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2/datafusion-cli$ datafusion-cli -c "select * from '/tmp/foo.parquet'";
DataFusion CLI v28.0.0
+---------+
| column1 |
+---------+
| 1 |
| 2 |
+---------+
2 rows in set. Query took 0.041 seconds.
This is so cool
copy (select * from '/Users/alamb/Software/clickbench_hits_compatible/hits.parquet' limit 100) to '/tmp/hits.10.parquet';
+-------+
| count |
+-------+
| 100 |
+-------+
1 row in set. Query took 2.809 seconds.
.gitignore
Outdated
@@ -104,3 +104,6 @@ datafusion/CHANGELOG.md.bak | |||
|
|||
# Generated tpch data | |||
datafusion/core/tests/sqllogictests/test_files/tpch/data/* | |||
|
|||
# Scratch temp dir for sqllogictests | |||
datafusion/core/tests/sqllogictests/test_files/scratch* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will likely have a logical conflict with #7284, FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I rebased locally and then updated these relative paths throughout the pr to fix the pipeline.
}; | ||
|
||
// TODO: implement statement level overrides for each file type | ||
// E.g. CsvFormat::from_options(options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -58,10 +59,26 @@ pub async fn main() -> Result<()> { | |||
run_tests().await | |||
} | |||
|
|||
/// Sets up an empty directory at tests/sqllogictests/test_files/scratch/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ -- we should also probably add a note about scratch
to https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/README.md eventually too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #7312
---- | ||
2 | ||
|
||
#Explain copy queries not currently working |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a special case for explain copy is probably the right thing
statement error DataFusion error: This feature is not implemented: `COPY \.\. TO \.\.` statement is not yet supported | ||
COPY source_table to '/tmp/table.parquet' (row_group_size 55); | ||
query IT | ||
COPY source_table to 'tests/sqllogictests/test_files/scratch/table.json' (row_group_size 55); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row_group_size for json is somewhat surprising to me as I expect it is a parquet thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, setting that as JSON with row_group_size is a mistake. It doesn't cause any issue because options are ignored right now.
This raises an interesting question about the desired behavior in this scenario. If the options specify an irrelevant setting (row_group_size for a json setting), should DataFusion:
- Ignore the irrelevant setting (current behavior)
- Ignore the irrelevant setting but emit a warning
- Raise an error and refuse to execute the query entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think raising an error and refusing to execute the query entirely is the most sensible thing -- I left a comment on #7298 (comment)
input: LogicalPlan, | ||
output_url: String, | ||
file_format: OutputFileFormat, | ||
per_thread_output: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some more documentation about what each of the parameters mean (specifically per_thread_output
)
Also, do you envision other options here (like overwrite
vs append
)? If so maybe it makes sense to make a struct lik
CopyOptions {
output_url: String,
file_format: OutputFileFormat,
per_thread_output: bool,
other: Vec<(String, String)
}
Making a config struct like that would not only allow additional options to be easily added without an API change, it would also provide a natural location to document the options and what they meant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is documentation for these parameters here:
https://github.com/devinjdangelo/arrow-datafusion/blob/27e062a880d8960a9aef4657bc7416d98c9a744c/datafusion/expr/src/logical_plan/dml.rs#L30-L43
I envision most additional options being passed in optionally via the options Vec<>. I think the list supported by DuckDB would be a good starting point.
It might make sense to implement a CopyOptions struct as you show with a CopyOptions::from(Vec<String,String>) method. Then, for the DataFrame::write_* API the user could construct CopyOptions directly rather than passing a Vec<String,String>, which is an awkward interface in comparison for use directly from rust code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #7322 to track this idea
pub options: Vec<(String, String)>, | ||
} | ||
|
||
/// The file formats that CopyTo can output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to the existing FileType
enum: https://docs.rs/datafusion/latest/datafusion/datasource/file_format/file_type/enum.FileType.html
Perhaps we could move FileType
into datafusion_common
so it could be used by both the logical plan and datasource?
let mut op_str = String::new(); | ||
op_str.push('('); | ||
for (key, val) in options { | ||
if !op_str.is_empty() { | ||
op_str.push(','); | ||
} | ||
op_str.push_str(&format!("{key} {val}")); | ||
} | ||
op_str.push(')'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use join()
here to be more conscise: https://doc.rust-lang.org/std/primitive.slice.html#method.join
But then you would probably have to make the (key val) pairs in a separate Vec resulting in another copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a .map + .join to do this more elegantly? I take another look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a stab at making this more concise in #7294
I took the liberty of merging up from main to resolve conflicts on this PR |
Thanks @devinjdangelo -- I am merging this PR in and will file follow on PRs / tickets for the remaining items we have identified. Thanks again 🙏 |
Thanks @devinjdangelo ! |
Which issue does this PR close?
closes #5076
closes #6539
Part of #5654
Rationale for this change
In many cases, we want to be able to export data to file(s) in an ObjectStore without first registering an external table. This is possible with
COPY ... TO ...
statements. We can leverage the FileSinks created to support inserting to ListingTables for part of the implementation for this.What changes are included in this PR?
InsertExec
toFileSinkExec
FileSinkExec
DataFrame::write_*
methods to useCopy To
FileSinks
andCopy To
so user can specify if they want only one file output or possibly many is okNote that this PR does not add support for most statement level settings / overrides yet. That will be important to implement before closing out #5654.
This graphic shows how all of the write related code is wired up after this PR:
Are these changes tested?
Yes, see expanded copy.slt for new tests.
I also have plans to expand insert.slt to improve testing of recent additions of
insert into
support.Are there any user-facing changes?
Copy To statements (less most statement level overrides) are supported now.
DataFrame write_* APIs have some small changes will need more changes as support for statement level overrides is added for copy to