-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write splinkdf to csv parquet #1194
Conversation
Test: test_2_rounds_1k_duckdbPercentage change: -30.3%
Test: test_2_rounds_1k_sqlitePercentage change: -25.9%
Click here for vega lite time series charts |
On duckdb I believe they're adding native save parquet and save CSV functions, I think they're already there in the python API If so, perhaps we could get away with adding something like as_duckdb_table, which itself allows for writing to parquet and CSV, similar to as_spark_dataframe. Edit: yeah so here's example 0.7 code
Not completely sure how this interacts with the connection |
splink/spark/spark_linker.py
Outdated
# "header", "true").save(filepath) | ||
# but this partitions the data and means pandas can't load it in | ||
|
||
self.as_pandas_dataframe().to_csv(filepath, index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we definitely want to turn it into a Pandas dataframe here? It will result in bad performance/out of memory for large outputs. Should we use spark.write.csv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I was using spark.write.csv
for this (as you can see from the commented text).
I switched primarily because:
- Pandas doesn't seem to work with
write.csv
- though I had limited testing in trying to get this work. - It's less intuitive to have multiple csv files for people that aren't used to that.
I'm not sure why you'd use csv's over parquet for any reason other than familiarity and wanting to use other tools such as excel more easily.
To your points above - yes, as we are removing the repartitioning and coalescing we will have a significant performance hit and may well cause OOM issues through this approach.
If you think it's best to use the native write.csv
method in spark then I'm happy to go with that. I don't think it will be used all that often anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - was skimming so I missed the commented out code - apologies
Thinking about it a little bit more, I still think go with write.csv
because:
- If the user has chosen the
SparkLinker
, then they're probably working with data too big for DuckDB - If so, running a
as_pandas_dataframe()
is likely to take a long time/fail - The user has the option to do this explicitly if they actually want to collect the whole dataframe to the driver
I do see the argument re: reading into pandas - I guess another way of looking at it is they're using the SparkLinker they may be more likely to read the result into spark (rather than pandas) anyway, and there are other tools (like duckdb and arrow) that support reading folders of csvs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code makes sense from my end and runs as expected, so I'm happy to approve it. 👍
One thought I had was whether it was worth implementing to_csv()
and to_parquet
in this PR for Athena too. I know that you have added the general methods in splink_dataframe.py
to throw a NotImplemented
error, but this feels like quite a useful feature which shouldn't be too difficult to implement. Happy to leave it for SQLite
The reason I didn't add this to Athena is because the tables will already exist on s3 as parquet files as part of the process. However, on reflection it would probably be useful as an option for exporting files to a specific location. I'd suggest for parquet we:
For csv we'll probably just need to write the raw data directly to csv.
|
A nice quick one for you both.
This adds the ability to write SplinkDataFrames directly to csv and parquet. This should simplify some of the code in our splink3 workflows.
I've left my commented notes in for spark's
to_csv
deliberately, but can remove them if you think they're not useful.