[EPIC] Improve DataFusions ability to work with files #1777

matthewmturner · 2022-02-07T21:06:18Z

matthewmturner · 2022-02-07T21:09:12Z

Just wanted to create this to start collecting feedback / thoughts. I havent had chance to really look into details of this yet. I hope to start in a couple weeks when I have some more time.

matthewmturner · 2022-02-07T22:06:45Z

@seddonm1 @houqp FYI - in case you have any thoughts.

matthewmturner · 2022-02-09T05:43:33Z

@yjshen also interested in your view - in particular on the ObjectStore point.

Igosuki · 2022-02-09T20:05:30Z

@matthewmturner I'd love to see PARTITION BY be implemented, which would output typical k=v partitions usable by the ListingTableProvider.

matthewmturner · 2022-02-09T20:34:54Z

@matthewmturner I'd love to see PARTITION BY be implemented, which would output typical k=v partitions usable by the ListingTableProvider.

@Igosuki Agreed that would be great, I've added it to the list.

Igosuki · 2022-02-09T21:42:43Z

And, probably, support writing metadata, both arbitrary and specific to datafusion like other apache engines do ?

matthewmturner · 2022-02-09T22:12:44Z

And, probably, support writing metadata, both arbitrary and specific to datafusion like other apache engines do ?

Added that as well.

For my information, do you have any examples from other systems that I could reference?

xudong963 · 2022-02-09T23:22:10Z

The direct implementation of COPY FROM (copy data from a file to a table) is if a table exists, then we can create a new table from the file and union new table with the old table, otherwise, we can let the new table unions with an empty table with the same scheme.

BTW, we should make sqlparser-rs support those commands firstly.

matthewmturner · 2022-02-09T23:37:16Z

The direct implementation of COPY FROM (copy data from a file to a table) is if a table exists, then we can create a new table from the file and union new table with the old table, otherwise, we can let the new table unions with an empty table with the same scheme.

BTW, we should make sqlparser-rs support those commands firstly.

@xudong963 To confirm - does COPY FROM functionality already exist? I'm not sure if i understood correctly, but I was focused on writing tables/dataframes from datafusion to files for this issue.

And definitely agree on checking sqlparser-rs. I will check that out.

xudong963 · 2022-02-09T23:40:31Z

To confirm - does COPY FROM functionality already exist

No, looking forward to seeing it. I just describe my idea of how to implement it.

xudong963 · 2022-02-10T00:07:55Z

And definitely agree on checking sqlparser-rs. I will check that out.

FYI @matthewmturner apache/datafusion-sqlparser-rs#409

matthewmturner · 2022-02-10T03:53:33Z

@xudong963 very helpful, thank you.

Igosuki · 2022-02-10T10:20:40Z

@matthewmturner
For instance : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala
Or after writing a parquet file with pandas :

############ file meta data ############
created_by: parquet-cpp-arrow version 6.0.1

Of course then other arbitrary metadata, for other systems such as warehouses.

matthewmturner · 2022-02-10T14:13:19Z

@Igosuki Thx much.

Will keep you posted when I start working on this.

matthewmturner · 2022-02-21T15:30:54Z

From what I see it looks like only the execution context can write files right now - let me know if im mistaken. I think it makes sense to add write functionality to dataframes as well. I've added to the list.

Igosuki · 2022-03-06T15:28:24Z

@alamb is write_parquet outside of the scope now ?

matthewmturner · 2022-03-06T15:37:39Z

@alamb FYI I don't think this issue should be closed yet. I'm using it as a tracker with the task list in the description.

matthewmturner · 2022-03-06T15:55:11Z

@Igosuki just to confirm what youre looking for - you want to be able to write partitioned parquet files from SQL?

Plus writing more metadata.

Igosuki · 2022-03-06T21:39:50Z

Exactly, like we can do with Spark. Le dim. 6 mars 2022 à 16:55, Matthew Turner ***@***.***> a écrit :

…

@Igosuki <https://github.com/Igosuki> just to confirm what youre looking for - you want to be able to write partitioned parquet files from SQL? — Reply to this email directly, view it on GitHub <#1777 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADDFBQ6C2RS7ORZUH7OC2TU6TIOVANCNFSM5NYSNHWQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

alamb · 2022-03-07T20:49:54Z

reopened -- I think github got a little over eager and interpreted the "closes #1777 task 7" comment to mean it should close the ticket

matthewmturner · 2022-03-18T14:34:46Z

@Igosuki sry if im being dumb / bad at searching google but i havent been able to find an example / docs of writing partitioned parquet files from SQL. Only writing with dataframe API or reading a partitioned parquet dataset with SQL. I havent had the chance to test, but is the command youre looking for something like:

COPY TABLE abc
TO `abc`
STORED AS PARQUET
PARTITION BY year, month

Igosuki · 2022-03-20T20:43:29Z

Personnally, I'm fine using the dataframe api, but the partitioned output isn't available right now in the API ? Spark goes with write.partitionBy(...).parquet("")

matthewmturner · 2022-03-20T20:52:26Z

Ok great - using dataframe api aligns more with my thinking. It was just the SQL part that was throwing me off. Thx.

alamb · 2023-11-20T12:11:20Z

We have pretty much completed the list on this ticket now. It is pretty amazing to see

https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_avro
https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_arrow

matthewmturner added the enhancement New feature or request label Feb 7, 2022

matthewmturner changed the title ~~Add support for COPY command~~ Improve DataFusions ability to write files Feb 7, 2022

This was referenced Feb 19, 2022

Add write method to JsonWriter apache/arrow-rs#1340

Closed

Question - Can datafusion-python be used without pyarrow? datafusion-contrib/datafusion-python#22

Open

This was referenced Feb 28, 2022

Add write_ipc to ExecutionContext #1893

Closed

Add write_csv to DataFrame #1922

Merged

alamb closed this as completed in #1922 Mar 6, 2022

matthewmturner mentioned this issue Mar 7, 2022

Add write_parquet to DataFrame #1940

Merged

alamb reopened this Mar 7, 2022

matthewmturner changed the title ~~Improve DataFusions ability to write files~~ Improve DataFusions ability to work with files Mar 15, 2022

matthewmturner mentioned this issue Mar 16, 2022

Add write_json, read_json, register_json, and JsonFormat to CREATE EXTERNAL TABLE functionality #2023

Merged

This was referenced Apr 2, 2022

Update quarterly roadmap for Q2 #2133

Merged

Add IF NOT EXISTS to CREATE TABLE and CREATE EXTERNAL TABLE #2143

Merged

ObjectStore write support #2185

Closed

matthewmturner mentioned this issue Apr 16, 2022

Add write and delete operations to ObjectStore #2246

Closed

4 tasks

alamb changed the title ~~Improve DataFusions ability to work with files~~ [EPIC] Improve DataFusions ability to work with files Mar 20, 2023

devinjdangelo mentioned this issue Jul 16, 2023

use ObjectStore for dataframe writes #6987

Merged

alamb closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Improve DataFusions ability to work with files #1777

[EPIC] Improve DataFusions ability to work with files #1777

matthewmturner commented Feb 7, 2022 •

edited by alamb

Loading

matthewmturner commented Feb 7, 2022

matthewmturner commented Feb 7, 2022

matthewmturner commented Feb 9, 2022

Igosuki commented Feb 9, 2022

matthewmturner commented Feb 9, 2022

Igosuki commented Feb 9, 2022

matthewmturner commented Feb 9, 2022

xudong963 commented Feb 9, 2022

matthewmturner commented Feb 9, 2022

xudong963 commented Feb 9, 2022

xudong963 commented Feb 10, 2022

matthewmturner commented Feb 10, 2022

Igosuki commented Feb 10, 2022 •

edited

Loading

matthewmturner commented Feb 10, 2022

matthewmturner commented Feb 21, 2022

Igosuki commented Mar 6, 2022

matthewmturner commented Mar 6, 2022

matthewmturner commented Mar 6, 2022 •

edited

Loading

Igosuki commented Mar 6, 2022 via email

alamb commented Mar 7, 2022 •

edited

Loading

matthewmturner commented Mar 18, 2022

Igosuki commented Mar 20, 2022

matthewmturner commented Mar 20, 2022

alamb commented Nov 20, 2023

[EPIC] Improve DataFusions ability to work with files #1777

[EPIC] Improve DataFusions ability to work with files #1777

Comments

matthewmturner commented Feb 7, 2022 • edited by alamb Loading

matthewmturner commented Feb 7, 2022

matthewmturner commented Feb 7, 2022

matthewmturner commented Feb 9, 2022

Igosuki commented Feb 9, 2022

matthewmturner commented Feb 9, 2022

Igosuki commented Feb 9, 2022

matthewmturner commented Feb 9, 2022

xudong963 commented Feb 9, 2022

matthewmturner commented Feb 9, 2022

xudong963 commented Feb 9, 2022

xudong963 commented Feb 10, 2022

matthewmturner commented Feb 10, 2022

Igosuki commented Feb 10, 2022 • edited Loading

matthewmturner commented Feb 10, 2022

matthewmturner commented Feb 21, 2022

Igosuki commented Mar 6, 2022

matthewmturner commented Mar 6, 2022

matthewmturner commented Mar 6, 2022 • edited Loading

Igosuki commented Mar 6, 2022 via email

alamb commented Mar 7, 2022 • edited Loading

matthewmturner commented Mar 18, 2022

Igosuki commented Mar 20, 2022

matthewmturner commented Mar 20, 2022

alamb commented Nov 20, 2023

matthewmturner commented Feb 7, 2022 •

edited by alamb

Loading

Igosuki commented Feb 10, 2022 •

edited

Loading

matthewmturner commented Mar 6, 2022 •

edited

Loading

alamb commented Mar 7, 2022 •

edited

Loading