Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Improve DataFusions ability to work with files #1777

Closed
9 of 16 tasks
matthewmturner opened this issue Feb 7, 2022 · 24 comments · Fixed by #1922
Closed
9 of 16 tasks

[EPIC] Improve DataFusions ability to work with files #1777

matthewmturner opened this issue Feb 7, 2022 · 24 comments · Fixed by #1922
Labels
enhancement New feature or request

Comments

@matthewmturner
Copy link
Contributor

matthewmturner commented Feb 7, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

I would like to add functionality for writing files from DataFusion. To start, I've thought of the below.

  • (1) Add write functionality to ObjectStore
  • (2)Add write_json to ExecutionContext
  • (3) Add write_arrow to ExecutionContext
  • (4) Add COPY / COPY TO command for SQL (like postgres https://www.postgresql.org/docs/current/sql-copy.html)
  • (5) Add ability to write partitioned datasets
  • (6) Add support for writing metadata
  • (7) Add write_csv method to DataFrame
  • (8) Add write_parquet method to DataFrame
  • (9) Add write_arrow method to DataFrame
  • (10) Add write_json method to DataFrame
  • (11) Add read_json method to DataFrame
  • (12) Add read_json method to ExecutionContext
  • (13) Add register_json method to ExecutionContext
  • (14) Support IF NOT EXISTS for CREATE EXTERNAL TABLE
  • (15) Add read_arrow to ExecutionContext
  • (16) Add read_arrow to DataFrame

I will use this as a parent / tracker issue for the above points which will each have an issue.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@matthewmturner matthewmturner added the enhancement New feature or request label Feb 7, 2022
@matthewmturner
Copy link
Contributor Author

Just wanted to create this to start collecting feedback / thoughts. I havent had chance to really look into details of this yet. I hope to start in a couple weeks when I have some more time.

@matthewmturner matthewmturner changed the title Add support for COPY command Improve DataFusions ability to write files Feb 7, 2022
@matthewmturner
Copy link
Contributor Author

@seddonm1 @houqp FYI - in case you have any thoughts.

@matthewmturner
Copy link
Contributor Author

@yjshen also interested in your view - in particular on the ObjectStore point.

@Igosuki
Copy link
Contributor

Igosuki commented Feb 9, 2022

@matthewmturner I'd love to see PARTITION BY be implemented, which would output typical k=v partitions usable by the ListingTableProvider.

@matthewmturner
Copy link
Contributor Author

@matthewmturner I'd love to see PARTITION BY be implemented, which would output typical k=v partitions usable by the ListingTableProvider.

@Igosuki Agreed that would be great, I've added it to the list.

@Igosuki
Copy link
Contributor

Igosuki commented Feb 9, 2022

And, probably, support writing metadata, both arbitrary and specific to datafusion like other apache engines do ?

@matthewmturner
Copy link
Contributor Author

And, probably, support writing metadata, both arbitrary and specific to datafusion like other apache engines do ?

Added that as well.

For my information, do you have any examples from other systems that I could reference?

@xudong963
Copy link
Member

The direct implementation of COPY FROM (copy data from a file to a table) is if a table exists, then we can create a new table from the file and union new table with the old table, otherwise, we can let the new table unions with an empty table with the same scheme.

BTW, we should make sqlparser-rs support those commands firstly.

@matthewmturner
Copy link
Contributor Author

The direct implementation of COPY FROM (copy data from a file to a table) is if a table exists, then we can create a new table from the file and union new table with the old table, otherwise, we can let the new table unions with an empty table with the same scheme.

BTW, we should make sqlparser-rs support those commands firstly.

@xudong963 To confirm - does COPY FROM functionality already exist? I'm not sure if i understood correctly, but I was focused on writing tables/dataframes from datafusion to files for this issue.

And definitely agree on checking sqlparser-rs. I will check that out.

@xudong963
Copy link
Member

To confirm - does COPY FROM functionality already exist

No, looking forward to seeing it. I just describe my idea of how to implement it.

@xudong963
Copy link
Member

And definitely agree on checking sqlparser-rs. I will check that out.

FYI @matthewmturner apache/datafusion-sqlparser-rs#409

@matthewmturner
Copy link
Contributor Author

@xudong963 very helpful, thank you.

@Igosuki
Copy link
Contributor

Igosuki commented Feb 10, 2022

@matthewmturner
For instance : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala
Or after writing a parquet file with pandas :

############ file meta data ############
created_by: parquet-cpp-arrow version 6.0.1

Of course then other arbitrary metadata, for other systems such as warehouses.

@matthewmturner
Copy link
Contributor Author

@Igosuki Thx much.

Will keep you posted when I start working on this.

@matthewmturner
Copy link
Contributor Author

From what I see it looks like only the execution context can write files right now - let me know if im mistaken. I think it makes sense to add write functionality to dataframes as well. I've added to the list.

@Igosuki
Copy link
Contributor

Igosuki commented Mar 6, 2022

@alamb is write_parquet outside of the scope now ?

@matthewmturner
Copy link
Contributor Author

@alamb FYI I don't think this issue should be closed yet. I'm using it as a tracker with the task list in the description.

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Mar 6, 2022

@Igosuki just to confirm what youre looking for - you want to be able to write partitioned parquet files from SQL?

Plus writing more metadata.

@Igosuki
Copy link
Contributor

Igosuki commented Mar 6, 2022 via email

@alamb
Copy link
Contributor

alamb commented Mar 7, 2022

reopened -- I think github got a little over eager and interpreted the "closes #1777 task 7" comment to mean it should close the ticket

Screen Shot 2022-03-07 at 3 49 19 PM

@matthewmturner matthewmturner changed the title Improve DataFusions ability to write files Improve DataFusions ability to work with files Mar 15, 2022
@matthewmturner
Copy link
Contributor Author

@Igosuki sry if im being dumb / bad at searching google but i havent been able to find an example / docs of writing partitioned parquet files from SQL. Only writing with dataframe API or reading a partitioned parquet dataset with SQL. I havent had the chance to test, but is the command youre looking for something like:

COPY TABLE abc
TO `abc`
STORED AS PARQUET
PARTITION BY year, month

@Igosuki
Copy link
Contributor

Igosuki commented Mar 20, 2022

Personnally, I'm fine using the dataframe api, but the partitioned output isn't available right now in the API ? Spark goes with write.partitionBy(...).parquet("")

@matthewmturner
Copy link
Contributor Author

Ok great - using dataframe api aligns more with my thinking. It was just the SQL part that was throwing me off. Thx.

@alamb alamb changed the title Improve DataFusions ability to work with files [EPIC] Improve DataFusions ability to work with files Mar 20, 2023
@alamb
Copy link
Contributor

alamb commented Nov 20, 2023

@alamb alamb closed this as completed Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants