-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Improve DataFusions ability to work with files #1777
Comments
Just wanted to create this to start collecting feedback / thoughts. I havent had chance to really look into details of this yet. I hope to start in a couple weeks when I have some more time. |
@yjshen also interested in your view - in particular on the |
@matthewmturner I'd love to see PARTITION BY be implemented, which would output typical k=v partitions usable by the ListingTableProvider. |
@Igosuki Agreed that would be great, I've added it to the list. |
And, probably, support writing metadata, both arbitrary and specific to datafusion like other apache engines do ? |
Added that as well. For my information, do you have any examples from other systems that I could reference? |
The direct implementation of BTW, we should make |
@xudong963 To confirm - does And definitely agree on checking |
No, looking forward to seeing it. I just describe my idea of how to implement it. |
|
@xudong963 very helpful, thank you. |
@matthewmturner
Of course then other arbitrary metadata, for other systems such as warehouses. |
@Igosuki Thx much. Will keep you posted when I start working on this. |
From what I see it looks like only the execution context can write files right now - let me know if im mistaken. I think it makes sense to add write functionality to dataframes as well. I've added to the list. |
@alamb is write_parquet outside of the scope now ? |
@alamb FYI I don't think this issue should be closed yet. I'm using it as a tracker with the task list in the description. |
@Igosuki just to confirm what youre looking for - you want to be able to write partitioned parquet files from SQL? Plus writing more metadata. |
Exactly, like we can do with Spark.
Le dim. 6 mars 2022 à 16:55, Matthew Turner ***@***.***> a
écrit :
… @Igosuki <https://github.com/Igosuki> just to confirm what youre looking
for - you want to be able to write partitioned parquet files from SQL?
—
Reply to this email directly, view it on GitHub
<#1777 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDFBQ6C2RS7ORZUH7OC2TU6TIOVANCNFSM5NYSNHWQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
reopened -- I think github got a little over eager and interpreted the "closes #1777 task 7" comment to mean it should close the ticket |
@Igosuki sry if im being dumb / bad at searching google but i havent been able to find an example / docs of writing partitioned parquet files from SQL. Only writing with dataframe API or reading a partitioned parquet dataset with SQL. I havent had the chance to test, but is the command youre looking for something like:
|
Personnally, I'm fine using the dataframe api, but the partitioned output isn't available right now in the API ? Spark goes with write.partitionBy(...).parquet("") |
Ok great - using dataframe api aligns more with my thinking. It was just the SQL part that was throwing me off. Thx. |
We have pretty much completed the list on this ticket now. It is pretty amazing to see https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_avro |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)
I would like to add functionality for writing files from DataFusion. To start, I've thought of the below.
ObjectStore
write_json
toExecutionContext
write_arrow
toExecutionContext
COPY
/COPY TO
command for SQL (like postgres https://www.postgresql.org/docs/current/sql-copy.html)write_csv
method toDataFrame
write_parquet
method toDataFrame
write_arrow
method toDataFrame
write_json
method toDataFrame
read_json
method toDataFrame
read_json
method toExecutionContext
register_json
method toExecutionContext
IF NOT EXISTS
forCREATE EXTERNAL TABLE
read_arrow
toExecutionContext
read_arrow
toDataFrame
I will use this as a parent / tracker issue for the above points which will each have an issue.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: