Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix link only cartesian calc #1204

Merged
merged 6 commits into from
Apr 27, 2023
Merged

Conversation

ThomasHepworth
Copy link
Contributor

This is a follow up to PR #1193 which fixed a bug with link only jobs. It looks as if I missed the cartesian join code in that PR.

This pull request adds:

  • A fix for the cartesian join code, now allowing single dataframes with a source_dataset column to be run.
  • Adds a _verify_link_only_job to assess whether a link_only job has valid dataframe(s). This means we get a more informative error if only a single df is passed. Historically, you'd end up with a divide by 0 error.

For the divide by 0 error, you can run a link job with a single dataframe and no source_dataset column (or only a single value in the column).

@github-actions
Copy link
Contributor

github-actions bot commented Apr 26, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -7.8%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1599 2023-04-27 10:48:21 1.77524 1.72841 (detached head) ace12fd Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz 2.3945 GHz ace12fd

Test: test_2_rounds_1k_sqlite

Percentage change: -8.4%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1601 2023-04-27 10:48:21 3.94387 3.90201 (detached head) ace12fd Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz 2.3945 GHz ace12fd

Click here for vega lite time series charts

Copy link
Contributor

@RossKen RossKen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runs as expected (including throwing helpful warnings when the inputs are incorrect) and the changes make sense.

Happy to approve, but if you can remove that stray if statement in misc.py that would be ace

splink/misc.py Outdated Show resolved Hide resolved
splink/misc.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants