Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write splinkdf to csv parquet #1194

Merged
merged 7 commits into from
May 16, 2023
Merged

Conversation

ThomasHepworth
Copy link
Contributor

A nice quick one for you both.

This adds the ability to write SplinkDataFrames directly to csv and parquet. This should simplify some of the code in our splink3 workflows.

I've left my commented notes in for spark's to_csv deliberately, but can remove them if you think they're not useful.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 19, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -30.3%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1599 2023-04-26 14:19:55 1.32098 1.30576 (detached head) 15b5cff Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7935 GHz 15b5cff

Test: test_2_rounds_1k_sqlite

Percentage change: -25.9%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1601 2023-04-26 14:19:55 3.16183 3.15682 (detached head) 15b5cff Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7935 GHz 15b5cff

Click here for vega lite time series charts

@RobinL
Copy link
Member

RobinL commented Apr 19, 2023

On duckdb I believe they're adding native save parquet and save CSV functions, I think they're already there in the python API
https://duckdb.org/2023/02/13/announcing-duckdb-070.html#python-api-improvements
Sorry on my phone but probably worth checking if it can reduce the amount of Splink code we need

If so, perhaps we could get away with adding something like as_duckdb_table, which itself allows for writing to parquet and CSV, similar to as_spark_dataframe.

Edit: yeah so here's example 0.7 code

df = duckdb.sql(sql)
df.to_parquet("data.parquet")

Not completely sure how this interacts with the connection

# "header", "true").save(filepath)
# but this partitions the data and means pandas can't load it in

self.as_pandas_dataframe().to_csv(filepath, index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we definitely want to turn it into a Pandas dataframe here? It will result in bad performance/out of memory for large outputs. Should we use spark.write.csv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was using spark.write.csv for this (as you can see from the commented text).

I switched primarily because:

  • Pandas doesn't seem to work with write.csv - though I had limited testing in trying to get this work.
  • It's less intuitive to have multiple csv files for people that aren't used to that.

I'm not sure why you'd use csv's over parquet for any reason other than familiarity and wanting to use other tools such as excel more easily.

To your points above - yes, as we are removing the repartitioning and coalescing we will have a significant performance hit and may well cause OOM issues through this approach.

If you think it's best to use the native write.csv method in spark then I'm happy to go with that. I don't think it will be used all that often anyway.

Copy link
Member

@RobinL RobinL Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - was skimming so I missed the commented out code - apologies

Thinking about it a little bit more, I still think go with write.csv because:

  • If the user has chosen the SparkLinker, then they're probably working with data too big for DuckDB
  • If so, running a as_pandas_dataframe() is likely to take a long time/fail
  • The user has the option to do this explicitly if they actually want to collect the whole dataframe to the driver

I do see the argument re: reading into pandas - I guess another way of looking at it is they're using the SparkLinker they may be more likely to read the result into spark (rather than pandas) anyway, and there are other tools (like duckdb and arrow) that support reading folders of csvs

Copy link
Contributor

@RossKen RossKen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code makes sense from my end and runs as expected, so I'm happy to approve it. 👍

One thought I had was whether it was worth implementing to_csv() and to_parquet in this PR for Athena too. I know that you have added the general methods in splink_dataframe.py to throw a NotImplemented error, but this feels like quite a useful feature which shouldn't be too difficult to implement. Happy to leave it for SQLite

@ThomasHepworth
Copy link
Contributor Author

ThomasHepworth commented Apr 27, 2023

The reason I didn't add this to Athena is because the tables will already exist on s3 as parquet files as part of the process.

However, on reflection it would probably be useful as an option for exporting files to a specific location.

I'd suggest for parquet we:

  1. Check if the SplinkDF exists on s3
  2. If it does, copy those parquet files across to the newly selected area
  3. If they don't, use AWSwrangler to export the tables to a location in s3.

For csv we'll probably just need to write the raw data directly to csv.

I'll add that as a separate PR anyway, as it will be relatively involved.
Sorry, I skimmed the bit about "adding Athena to this PR...". I'll add it and re-request a review.

@ThomasHepworth ThomasHepworth merged commit 7e5d06c into master May 16, 2023
@ThomasHepworth ThomasHepworth deleted the write_splinkdf_to_csv_parquet branch May 16, 2023 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants