Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1030 option for auto typecasting datediff #1162

Merged
merged 28 commits into from
Apr 25, 2023

Conversation

aliceoleary0
Copy link
Contributor

Adding date-casting so date comparisons can accept date inputs as strings

@github-actions
Copy link
Contributor

github-actions bot commented Mar 31, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -31.4%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1587 2023-04-24 15:21:04 1.31726 1.28662 (detached head) 12f22ce Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 12f22ce

Test: test_2_rounds_1k_sqlite

Percentage change: -24.5%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1589 2023-04-24 15:21:04 3.21643 3.21365 (detached head) 12f22ce Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 12f22ce

Click here for vega lite time series charts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file

.DS_Store Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file

Copy link
Contributor

@RossKen RossKen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good work! 🎉 Everything seems to be working as expected - have left some comments on the specific files where a little bit of cleaning up/documentation is needed.

One additional thing - could you add your extra parameters to the date_comparison() function in comparison_template as well? The typecasting is a really useful feature so it would be great to make the most of it wherever we are using datediffs.

splink/comparison_level_library.py Outdated Show resolved Hide resolved
splink/comparison_library.py Outdated Show resolved Hide resolved
splink/linker.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove from pr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove from pr

tests/.DS_Store Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove from pr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense and passes - good work!

Copy link
Member

@RobinL RobinL Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to be a bit careful here I think: If we need this to make our tests pass, it means others might need it to use Splink. We'd prob rather avoid them needing to change the default if we can avoid it, because it's another thing to document and it might be incompatible with their other (non Splink) code. Was just skimming though so might have got the wrong end of the stick.

The general pattern I've tried to go for elsewhere in splink is to try and use whatever spark config is provided by the user rather than than needing to do specific spark conflict to make splink work

Copy link
Contributor

@RossKen RossKen Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing

if date_format is None:
        date_format = '%x'

I think it may be better to change the default in cl and cll functions to date_format = '%x' rather than None so it is clearer to the user how dates are being parsed, rather than being hidden in this different function

Comment on lines 523 to 528
# def load_settings(self, settings_dict):
# # call parent method Linker.load_settings
# super().load_settings(settings_dict)
# # if that worked okay then the linker should have `_settings_obj_` and `_settings_dict` set
# # now warn if we need to:
# self._check_ansi_enabled_if_converting_dates()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these lines be deleted?

@RossKen
Copy link
Contributor

RossKen commented Mar 31, 2023

Also, per the CI checks - there are a few linting errors. I think they are mostly to do with line length so if you just run source scripts/lint.sh -f # with the fix flag locally (or look in the CI logs) you should be able to find them and fix them pretty easily

@aliceoleary0
Copy link
Contributor Author

Please ignore recent activity - didn't realise this would push things through test suite again! Have made the changes above and was trying to resolve some horrible merge conflicts before running the linter

Copy link
Contributor

@RossKen RossKen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes @aliceoleary0. The examples in the docs and the spark warning have really helped 👍

This is good to merge from my perspective!

@aliceoleary0 aliceoleary0 merged commit 3a9e09d into master Apr 25, 2023
Comment on lines +21 to +40
if cast_str:
if date_metric == "day":
date_f = f"""abs(datediff(to_timestamp({col_name_l},
'{date_format}'),to_timestamp({col_name_r},'{date_format}')))"""
elif date_metric in ["month", "year"]:
date_f = f"""floor(abs(months_between(to_timestamp({col_name_l},
'{date_format}'),to_timestamp({col_name_r}, '{date_format}'))"""
if date_metric == "year":
date_f += " / 12))"
else:
date_f += "))"
else:
if date_metric == "day":
date_f = f"abs(datediff({col_name_l}, {col_name_r}))"
elif date_metric in ["month", "year"]:
date_f = f"ceil(abs(months_between({col_name_l}, {col_name_r})"
if date_metric == "year":
date_f += " / 12))"
else:
date_f += "))"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've been nosey and had a quick poke around this PR.

I think the following may be slightly cleaner/easier to read:

Suggested change
if cast_str:
if date_metric == "day":
date_f = f"""abs(datediff(to_timestamp({col_name_l},
'{date_format}'),to_timestamp({col_name_r},'{date_format}')))"""
elif date_metric in ["month", "year"]:
date_f = f"""floor(abs(months_between(to_timestamp({col_name_l},
'{date_format}'),to_timestamp({col_name_r}, '{date_format}'))"""
if date_metric == "year":
date_f += " / 12))"
else:
date_f += "))"
else:
if date_metric == "day":
date_f = f"abs(datediff({col_name_l}, {col_name_r}))"
elif date_metric in ["month", "year"]:
date_f = f"ceil(abs(months_between({col_name_l}, {col_name_r})"
if date_metric == "year":
date_f += " / 12))"
else:
date_f += "))"
if cast_str:
col_name_l = f"to_timestamp({col_name_l}, '{date_format}')"
col_name_r = f"to_timestamp({col_name_r}, '{date_format}')"
if date_metric == "day":
date_f = f"abs(datediff({col_name_l}, {col_name_r}))"
elif date_metric in ["month", "year"]:
date_f = f"ceil(abs(months_between({col_name_l}, {col_name_r})"
if date_metric == "year":
date_f += " / 12))"
else:
date_f += "))"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree- this is a lot nicer and thanks for the suggestion! How do I integrate this change/ is there a way to reopen the pull request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just open a new PR and ask Ross for a quick review 😊

@RobinL RobinL deleted the 1030_option_for_auto_typecasting_datediff branch August 12, 2024 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants