Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement COPY ... TO statement #5654

Closed
alamb opened this issue Mar 20, 2023 · 12 comments
Closed

Implement COPY ... TO statement #5654

alamb opened this issue Mar 20, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 20, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to parquet data from one format to another, for example to see the effects of page pruning -- #4085 or different orderings on compression and other properties

arrow-rs and DataFusion have all the parts we need (reading from files, sorting, writing to files) we just now need to put them together

We do have a very specialized version in the tpch benchmark driver
https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/benchmarks/src/tpch.rs#L332-L400

Describe the solution you'd like
I would like DataFusion to support duckdb style COPY sql statements

For example:

-- export the table `t` to data.parquet
COPY t TO 'data.parquet' (FORMAT PARQUET);
-- export as parquet, compressed with ZSTD, with a row_group_size of 100000
COPY t TO 'data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);
--- export the output of of a query `SELECT * FROM tbl`
COPY (SELECT * FROM tbl ORDER BY time) TO 'data.parquet' (FORMAT PARQUET);

Reference:

  1. https://duckdb.org/docs/sql/statements/copy
  2. https://duckdb.org/docs/sql/statements/export

Describe alternatives you've considered
@metesynnada is working on INSERT INTO style syntax in #5130

Bonus points for CSV support (ideally the code structure will allow support in the long term but not as part of the initial PR)

-- export as CSV with the given options
COPY t TO 'data.csv'  (FORMAT CSV, DELIMITER '|', HEADER);

Additional context

#5130 (comment)

@alamb alamb added the enhancement New feature or request label Mar 20, 2023
@doki23
Copy link
Contributor

doki23 commented Mar 22, 2023

If I'm not mistaken, we could implement it by deleting all the target files and then executing an insert into t select ... from ... query as #5130 proposes. The important thing is we need get a lock before do it?
I'd like to contribute this if no one is working on it.

@metesynnada
Copy link
Contributor

metesynnada commented Mar 22, 2023

At the moment, we do not support the INSERT INTO operation for files. However, we have developed an implementation and are awaiting the version updates of arrow-rs and object-store. You can find more details about this implementation in the following pull request: https://github.com/synnada-ai/arrow-datafusion/pull/62.

In addition to this, the COPY TO syntax is a widely adopted method for sinking data. Therefore, we should introduce a new API in the TableProvider to accommodate this functionality.

@doki23
Copy link
Contributor

doki23 commented Mar 22, 2023

Therefore, we should introduce a new API in the TableProvider to accommodate this functionality.

COPY TO is close to INSERT INTO SELECT FROM. The difference is copy needs clean up existed data while insert not. So maybe they could share the same api like sink(&self, state: &SessionState, input: &LogicalPlan, truncate: bool)?

@metesynnada
Copy link
Contributor

The similarity you mention does not apply to MemTable or ViewTable, INSERT INTO does not imply sinking to a file in these cases. It's possible that some dyn TableProvider implementations may not support the INSERT INTO operation, while others may not implement the COPY to operation. While there is some resemblance, it is essential not to over-generalize the behavior of the ListingTable.

Maybe we can find a better way to reduce duplicates in logic.

@doki23
Copy link
Contributor

doki23 commented Mar 22, 2023

It's possible that some dyn TableProvider implementations may not support the INSERT INTO operation, while others may not implement the COPY to operation.

It makes sense to me

@timrobertson100
Copy link

timrobertson100 commented Apr 26, 2023

+1 on this feature

My use case is to run SQL filters over parquet files, to produce a CSV file of the output. I was looking for a create external table stored as csv as select.... and found this issue.

(Edited to add: and thank you to the contributors)

@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2023

Thanks @timrobertson100 -- I think we are quite close to having all the pieces -- e.g. #6049 from @metesynnada

@doki23
Copy link
Contributor

doki23 commented Apr 27, 2023

I'm busy recently and not working on it. Anyone interesting feels free to take it.

@aprimadi
Copy link
Contributor

aprimadi commented Apr 27, 2023

The current sqlparser implementation doesn't support arbitrary query as a source in COPY <source> TO .. statement. This issue and pull request address that.

apache/datafusion-sqlparser-rs#857

@metesynnada
Copy link
Contributor

The current sqlparser implementation doesn't support arbitrary query as a source in COPY <source> TO .. statement. This issue and pull request address that.

sqlparser-rs/sqlparser-rs#857

Yeah, it would be a great contribution.

@alamb
Copy link
Contributor Author

alamb commented May 10, 2023

I had some free time this afternoon and so I hacked up an initial implementation here: #6313

I was able to write a custom parser in DataFusion, as well as hook it up via the same mechanism as used by CREATE TABLE AS SELECT.

@alamb
Copy link
Contributor Author

alamb commented Aug 17, 2023

The basic feature was completed by @devinjdangelo in #7283. There are still some pieces left but I am closing this ticket as done for now

@alamb alamb closed this as completed Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
5 participants