-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1030 option for auto typecasting datediff #1162
1030 option for auto typecasting datediff #1162
Conversation
…tion Linker method for Spark
…to tweak settings initialization Linker method for Spark
Test: test_2_rounds_1k_duckdbPercentage change: -31.4%
Test: test_2_rounds_1k_sqlitePercentage change: -24.5%
Click here for vega lite time series charts |
.vscode/launch.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file
.DS_Store
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good work! 🎉 Everything seems to be working as expected - have left some comments on the specific files where a little bit of cleaning up/documentation is needed.
One additional thing - could you add your extra parameters to the date_comparison()
function in comparison_template as well? The typecasting is a really useful feature so it would be great to make the most of it wherever we are using datediffs.
splink/spark/spark_base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
testing_datediff.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove from pr
testing_splink_spark.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove from pr
tests/.DS_Store
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove from pr
tests/test_datediff_level.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense and passes - good work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to be a bit careful here I think: If we need this to make our tests pass, it means others might need it to use Splink. We'd prob rather avoid them needing to change the default if we can avoid it, because it's another thing to document and it might be incompatible with their other (non Splink) code. Was just skimming though so might have got the wrong end of the stick.
The general pattern I've tried to go for elsewhere in splink is to try and use whatever spark config is provided by the user rather than than needing to do specific spark conflict to make splink work
splink/duckdb/duckdb_base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing
if date_format is None:
date_format = '%x'
I think it may be better to change the default in cl and cll functions to date_format = '%x'
rather than None
so it is clearer to the user how dates are being parsed, rather than being hidden in this different function
splink/spark/spark_linker.py
Outdated
# def load_settings(self, settings_dict): | ||
# # call parent method Linker.load_settings | ||
# super().load_settings(settings_dict) | ||
# # if that worked okay then the linker should have `_settings_obj_` and `_settings_dict` set | ||
# # now warn if we need to: | ||
# self._check_ansi_enabled_if_converting_dates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these lines be deleted?
Also, per the CI checks - there are a few linting errors. I think they are mostly to do with line length so if you just run |
Rebaseing datecasting code. SQL working in DuckDb and Spark but need to tweak settings initialization Linker method for Spark
…tion Linker method for Spark
Please ignore recent activity - didn't realise this would push things through test suite again! Have made the changes above and was trying to resolve some horrible merge conflicts before running the linter |
…m:moj-analytical-services/splink into 1030_option_for_auto_typecasting_datediff
…to 1030_option_for_auto_typecasting_datediff
…m:moj-analytical-services/splink into 1030_option_for_auto_typecasting_datediff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making those changes @aliceoleary0. The examples in the docs and the spark warning have really helped 👍
This is good to merge from my perspective!
if cast_str: | ||
if date_metric == "day": | ||
date_f = f"""abs(datediff(to_timestamp({col_name_l}, | ||
'{date_format}'),to_timestamp({col_name_r},'{date_format}')))""" | ||
elif date_metric in ["month", "year"]: | ||
date_f = f"""floor(abs(months_between(to_timestamp({col_name_l}, | ||
'{date_format}'),to_timestamp({col_name_r}, '{date_format}'))""" | ||
if date_metric == "year": | ||
date_f += " / 12))" | ||
else: | ||
date_f += "))" | ||
else: | ||
if date_metric == "day": | ||
date_f = f"abs(datediff({col_name_l}, {col_name_r}))" | ||
elif date_metric in ["month", "year"]: | ||
date_f = f"ceil(abs(months_between({col_name_l}, {col_name_r})" | ||
if date_metric == "year": | ||
date_f += " / 12))" | ||
else: | ||
date_f += "))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've been nosey and had a quick poke around this PR.
I think the following may be slightly cleaner/easier to read:
if cast_str: | |
if date_metric == "day": | |
date_f = f"""abs(datediff(to_timestamp({col_name_l}, | |
'{date_format}'),to_timestamp({col_name_r},'{date_format}')))""" | |
elif date_metric in ["month", "year"]: | |
date_f = f"""floor(abs(months_between(to_timestamp({col_name_l}, | |
'{date_format}'),to_timestamp({col_name_r}, '{date_format}'))""" | |
if date_metric == "year": | |
date_f += " / 12))" | |
else: | |
date_f += "))" | |
else: | |
if date_metric == "day": | |
date_f = f"abs(datediff({col_name_l}, {col_name_r}))" | |
elif date_metric in ["month", "year"]: | |
date_f = f"ceil(abs(months_between({col_name_l}, {col_name_r})" | |
if date_metric == "year": | |
date_f += " / 12))" | |
else: | |
date_f += "))" | |
if cast_str: | |
col_name_l = f"to_timestamp({col_name_l}, '{date_format}')" | |
col_name_r = f"to_timestamp({col_name_r}, '{date_format}')" | |
if date_metric == "day": | |
date_f = f"abs(datediff({col_name_l}, {col_name_r}))" | |
elif date_metric in ["month", "year"]: | |
date_f = f"ceil(abs(months_between({col_name_l}, {col_name_r})" | |
if date_metric == "year": | |
date_f += " / 12))" | |
else: | |
date_f += "))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree- this is a lot nicer and thanks for the suggestion! How do I integrate this change/ is there a way to reopen the pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just open a new PR and ask Ross for a quick review 😊
Adding date-casting so date comparisons can accept date inputs as strings