[BUG] fix how nulls are registered in pyspark when loading a pandas df #1373

ThomasHepworth · 2023-06-27T15:54:32Z

Type of PR

BUG

Give a brief description for the solution you have provided

A small fix to adjust how nulls/nans are registered in pyspark when using createDataFrame (as we do in register_table).

This simply fills any nulls that are found with np.nan and should fix the issue at the cost of computational cost.

Given that we recommend users feed the SparkLinker a spark df, I think it's fine to go with this approach as it will rarely trigger.

github-actions · 2023-06-27T15:55:45Z

Test: test_2_rounds_1k_duckdb

Percentage change: 2.0%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1800	2023-07-04	14:19:54	1.91842	1.91283	(detached head)	`de562fa`	Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz	2.2947 GHz	`de562fa`

Test: test_2_rounds_1k_sqlite

Percentage change: 0.3%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1802	2023-07-04	14:19:54	4.30157	4.26977	(detached head)	`de562fa`	Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz	2.2947 GHz	`de562fa`

Click here for vega lite time series charts

RobinL · 2023-06-27T22:05:47Z

Looks good. Are there any other replacements we want to do?

Here was something we previously suggested:

data = [
    {'a': '', 'b': pd.NA, 'c':np.nan}
]

df = pd.DataFrame(data)

# deal with col a
df = df.replace(r'^\s*$', None, regex=True)

# deal with col b and c
df2 = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])

I don't think we should do the zero length string ('') to None conversion in this PR, but how about pd.NA?

(I must admit, I don't understand the difference between np.nan and pd.NA, so I'm not sure the above code is correct...)

ThomasHepworth · 2023-07-04T14:56:03Z

@RobinL sorry for the delay, but could you check you're happy w/ the minor edit and approve?

The only adjustment we may want to make it with the addition of an argument around whether cleaning should take place or not.

I'd lean towards not adding this in, as it feels as if this change resolves a bug with the pandas -> spark registration process. I don't think pd.NA or np.nan should be registered as nulls.

We may also need to check this upon the release of pandas 2.0.

RobinL

Yeah I think this is fine to do automatically. I can't think of an instance when you wouldn't want it, and if you want complete control, you can just use Spark natively to load data

fix how nulls are registered in pyspark when loading a pandas df

667b11f

ThomasHepworth requested review from RobinL and ADBond June 27, 2023 15:54

add pd.NA to the list of recognised null values in pandas

c286668

RobinL approved these changes Jul 4, 2023

View reviewed changes

ThomasHepworth merged commit d89f60e into master Jul 4, 2023

ThomasHepworth deleted the fix_spark_duckdb_registration branch July 4, 2023 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] fix how nulls are registered in pyspark when loading a pandas df #1373

[BUG] fix how nulls are registered in pyspark when loading a pandas df #1373

ThomasHepworth commented Jun 27, 2023

github-actions bot commented Jun 27, 2023 •

edited

Loading

RobinL commented Jun 27, 2023

ThomasHepworth commented Jul 4, 2023

RobinL left a comment

[BUG] fix how nulls are registered in pyspark when loading a pandas df #1373

[BUG] fix how nulls are registered in pyspark when loading a pandas df #1373

Conversation

ThomasHepworth commented Jun 27, 2023

Type of PR

Give a brief description for the solution you have provided

github-actions bot commented Jun 27, 2023 • edited Loading

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

RobinL commented Jun 27, 2023

ThomasHepworth commented Jul 4, 2023

RobinL left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 27, 2023 •

edited

Loading