Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fix how nulls are registered in pyspark when loading a pandas df #1373

Merged
merged 2 commits into from
Jul 4, 2023

Conversation

ThomasHepworth
Copy link
Contributor

Type of PR

  • BUG

Give a brief description for the solution you have provided

A small fix to adjust how nulls/nans are registered in pyspark when using createDataFrame (as we do in register_table).

This simply fills any nulls that are found with np.nan and should fix the issue at the cost of computational cost.

Given that we recommend users feed the SparkLinker a spark df, I think it's fine to go with this approach as it will rarely trigger.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 27, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: 2.0%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1800 2023-07-04 14:19:54 1.91842 1.91283 (detached head) de562fa Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.2947 GHz de562fa

Test: test_2_rounds_1k_sqlite

Percentage change: 0.3%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1802 2023-07-04 14:19:54 4.30157 4.26977 (detached head) de562fa Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.2947 GHz de562fa

Click here for vega lite time series charts

@RobinL
Copy link
Member

RobinL commented Jun 27, 2023

Looks good. Are there any other replacements we want to do?

Here was something we previously suggested:

data = [
    {'a': '', 'b': pd.NA, 'c':np.nan}
]

df = pd.DataFrame(data)

# deal with col a
df = df.replace(r'^\s*$', None, regex=True)

# deal with col b and c
df2 = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])

I don't think we should do the zero length string ('') to None conversion in this PR, but how about pd.NA?

(I must admit, I don't understand the difference between np.nan and pd.NA, so I'm not sure the above code is correct...)

@ThomasHepworth
Copy link
Contributor Author

@RobinL sorry for the delay, but could you check you're happy w/ the minor edit and approve?

The only adjustment we may want to make it with the addition of an argument around whether cleaning should take place or not.

I'd lean towards not adding this in, as it feels as if this change resolves a bug with the pandas -> spark registration process. I don't think pd.NA or np.nan should be registered as nulls.

We may also need to check this upon the release of pandas 2.0.

Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is fine to do automatically. I can't think of an instance when you wouldn't want it, and if you want complete control, you can just use Spark natively to load data

@ThomasHepworth ThomasHepworth merged commit d89f60e into master Jul 4, 2023
@ThomasHepworth ThomasHepworth deleted the fix_spark_duckdb_registration branch July 4, 2023 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants