Write splinkdf to csv parquet #1194

ThomasHepworth · 2023-04-19T15:19:16Z

A nice quick one for you both.

This adds the ability to write SplinkDataFrames directly to csv and parquet. This should simplify some of the code in our splink3 workflows.

I've left my commented notes in for spark's to_csv deliberately, but can remove them if you think they're not useful.

github-actions · 2023-04-19T15:20:31Z

Test: test_2_rounds_1k_duckdb

Percentage change: -30.3%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1599	2023-04-26	14:19:55	1.32098	1.30576	(detached head)	`15b5cff`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7935 GHz	`15b5cff`

Test: test_2_rounds_1k_sqlite

Percentage change: -25.9%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1601	2023-04-26	14:19:55	3.16183	3.15682	(detached head)	`15b5cff`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7935 GHz	`15b5cff`

Click here for vega lite time series charts

RobinL · 2023-04-19T16:20:05Z

On duckdb I believe they're adding native save parquet and save CSV functions, I think they're already there in the python API
https://duckdb.org/2023/02/13/announcing-duckdb-070.html#python-api-improvements
Sorry on my phone but probably worth checking if it can reduce the amount of Splink code we need

If so, perhaps we could get away with adding something like as_duckdb_table, which itself allows for writing to parquet and CSV, similar to as_spark_dataframe.

Edit: yeah so here's example 0.7 code

df = duckdb.sql(sql)
df.to_parquet("data.parquet")

Not completely sure how this interacts with the connection

RobinL · 2023-04-19T16:21:56Z

splink/spark/spark_linker.py

+        #     "header", "true").save(filepath)
+        # but this partitions the data and means pandas can't load it in
+
+        self.as_pandas_dataframe().to_csv(filepath, index=False)


Do we definitely want to turn it into a Pandas dataframe here? It will result in bad performance/out of memory for large outputs. Should we use spark.write.csv?

Initially I was using spark.write.csv for this (as you can see from the commented text).

I switched primarily because:

Pandas doesn't seem to work with write.csv - though I had limited testing in trying to get this work.

It's less intuitive to have multiple csv files for people that aren't used to that.

I'm not sure why you'd use csv's over parquet for any reason other than familiarity and wanting to use other tools such as excel more easily.

To your points above - yes, as we are removing the repartitioning and coalescing we will have a significant performance hit and may well cause OOM issues through this approach.

If you think it's best to use the native write.csv method in spark then I'm happy to go with that. I don't think it will be used all that often anyway.

Yep - was skimming so I missed the commented out code - apologies

Thinking about it a little bit more, I still think go with write.csv because:

If the user has chosen the SparkLinker, then they're probably working with data too big for DuckDB

If so, running a as_pandas_dataframe() is likely to take a long time/fail

The user has the option to do this explicitly if they actually want to collect the whole dataframe to the driver

I do see the argument re: reading into pandas - I guess another way of looking at it is they're using the SparkLinker they may be more likely to read the result into spark (rather than pandas) anyway, and there are other tools (like duckdb and arrow) that support reading folders of csvs

RossKen

Code makes sense from my end and runs as expected, so I'm happy to approve it. 👍

One thought I had was whether it was worth implementing to_csv() and to_parquet in this PR for Athena too. I know that you have added the general methods in splink_dataframe.py to throw a NotImplemented error, but this feels like quite a useful feature which shouldn't be too difficult to implement. Happy to leave it for SQLite

ThomasHepworth · 2023-04-27T10:45:05Z

The reason I didn't add this to Athena is because the tables will already exist on s3 as parquet files as part of the process.

However, on reflection it would probably be useful as an option for exporting files to a specific location.

I'd suggest for parquet we:

Check if the SplinkDF exists on s3
If it does, copy those parquet files across to the newly selected area
If they don't, use AWSwrangler to export the tables to a location in s3.

For csv we'll probably just need to write the raw data directly to csv.

~~I'll add that as a separate PR anyway, as it will be relatively involved.~~
Sorry, I skimmed the bit about "adding Athena to this PR...". I'll add it and re-request a review.

ThomasHepworth added 2 commits April 19, 2023 16:17

Add the option to write SplinkDF to parquet and csv

497dff2

Add a test to check if write to csv/parquet is working as expected

18ce475

ThomasHepworth requested review from RossKen and ADBond April 19, 2023 15:19

move _test_write_functionality test

2155243

RobinL reviewed Apr 19, 2023

View reviewed changes

ThomasHepworth added 4 commits April 26, 2023 15:15

adjust csv read/write for both the linkers and tests

2b70e93

lint with black

abf83e9

lambda -> function lint

95d20b6

lint with black

0a6c643

RossKen reviewed Apr 27, 2023

View reviewed changes

ThomasHepworth mentioned this pull request May 16, 2023

Athena -> write to csv and parquet #1241

Open

ThomasHepworth merged commit 7e5d06c into master May 16, 2023

ThomasHepworth deleted the write_splinkdf_to_csv_parquet branch May 16, 2023 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write splinkdf to csv parquet #1194

Write splinkdf to csv parquet #1194

ThomasHepworth commented Apr 19, 2023

github-actions bot commented Apr 19, 2023 •

edited

Loading

RobinL commented Apr 19, 2023 •

edited

Loading

RobinL Apr 19, 2023

ThomasHepworth Apr 19, 2023

RobinL Apr 19, 2023 •

edited

Loading

RossKen left a comment

ThomasHepworth commented Apr 27, 2023 •

edited

Loading

Write splinkdf to csv parquet #1194

Write splinkdf to csv parquet #1194

Conversation

ThomasHepworth commented Apr 19, 2023

github-actions bot commented Apr 19, 2023 • edited Loading

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

RobinL commented Apr 19, 2023 • edited Loading

RobinL Apr 19, 2023

Choose a reason for hiding this comment

ThomasHepworth Apr 19, 2023

Choose a reason for hiding this comment

RobinL Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

RossKen left a comment

Choose a reason for hiding this comment

ThomasHepworth commented Apr 27, 2023 • edited Loading

github-actions bot commented Apr 19, 2023 •

edited

Loading

RobinL commented Apr 19, 2023 •

edited

Loading

RobinL Apr 19, 2023 •

edited

Loading

ThomasHepworth commented Apr 27, 2023 •

edited

Loading