Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add comparison (level) library functions for Regular Expressions #1190

Closed
RossKen opened this issue Apr 14, 2023 · 3 comments
Closed
Labels
comparison levels enhancement New feature or request good first issue Good for newcomers

Comments

@RossKen
Copy link
Contributor

RossKen commented Apr 14, 2023

Is your proposal related to a problem?

Regular expressions are a key tool for manipulating strings (subsetting, etc). It feels like this would be a good, general capability to have as a function in the comparison level library.

Users can currently define a comparison-level dictionary with SQL expressions, but the syntax of regex functions varies across backends. E.g. (courtesy of Chat GPT)

DuckDB:
DuckDB supports the use of regular expressions in queries through the built-in SQLite-compatible functions REGEXP , REGEXP_LIKE, and REGEXP_REPLACE.
Spark SQL:
Spark SQL provides a range of built-in functions to work with regular expressions. These include functions like regexp_extract, regexp_replace, regexp_replace_all, regexp_substr, regexp_split, and rlike.
Athena SQL:
Athena SQL supports the use of regular expressions in queries through the built-in RLIKE operator.
SQLite:
SQLite has a built-in regex library that can be used through the REGEXP and REGEXP_LIKE functions.

This would be useful for users to create their own comparison levels, but also for the likes of #215 where we are creating a wrapper function to compare different sections of a postcode that needs to be compatible with multiple backends.

Describe the solution you'd like

The following would return a comparison level matching the area (e.g. SW1A 2AA gives area SW) - or something similar

cll.regexp_extract_level("postcode", "^[A-Z]{1,2}")

equivalent to regexp_extract

{
    "sql_condition": "regexp_extract(postcode, '^[A-Z]{1,2}') = regex_extract(postcode, '^[A-Z]{1,2}')",
    "label_for_charts": "Regex match on '^[A-Z]{1,2}'",
}

This could also be wrapped up into an comparison library function to generate multiple levels for different thresholds of regex match. For example:

cl.regexp_extract_at_thresholds("postcode", ["^[A-Z]{1,2}", "[0-9][A-Z]{2}$"]

equivalent to

{'output_column_name': 'postcode',
 'comparison_levels': [{'sql_condition': '"postcode_l" IS NULL OR "postcode_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"postcode_l" = "postcode_r"',
   'label_for_charts': 'Exact match'},
  {"sql_condition": "regexp_extract(postcode, '^[A-Z]{1,2}') = regex_extract(postcode, '^[A-Z]{1,2}')",
    "label_for_charts": "Regex match on '^[A-Z]{1,2}'",}
  {"sql_condition": "regexp_extract(postcode, '[0-9][A-Z]{2}$') = regex_extract(postcode, '[0-9][A-Z]{2}$')",
    "label_for_charts": "Regex match on '[0-9][A-Z]{2}$'",}
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. regex at thresholds ^[A-Z]{1,2}, [0-9][A-Z]{2}$ vs. anything else'}

Describe alternatives you've considered

The only alternative considered was to do this step in preprocessing (e.g. create multiple columns for the different parts of a postcode). Unless @zslade and I are missing something obvious, we think it is more performant to perform the regex functions at the comparison stage as it will only run on pairwise comparisons that have:

  1. Passed the blocking rules
  2. Not been captured by preceding comparison levels

However, this has be be traded off with the fact that generating these features in preprocessing means they only need to be created once vs every time you run the model.

One other consideration for doing during preprocessing is that it would allow you to perform term frequency adjustments on the sections of your string (eg postcode area). I can't think of how these would be applied when only defining the sections of a string within a comparison level.

Additional context

@RossKen RossKen added enhancement New feature or request comparison levels good first issue Good for newcomers labels Apr 14, 2023
@RossKen RossKen changed the title [FEAT] Add comparison level library functions for Regular Expressions [FEAT] Add comparison (level) library functions for Regular Expressions Apr 14, 2023
@RossKen
Copy link
Contributor Author

RossKen commented Apr 14, 2023

More info on Regex performance - https://www.loggly.com/blog/regexes-the-bad-better-best/

@RossKen
Copy link
Contributor Author

RossKen commented Apr 14, 2023

Also consider adding a regex parameter within string comparison level functions to perform on subsets of columns values.

In which case you could have

cll.exact_match_level("postcode", regex="^[A-Z]{1,2}")

instead of the regexp_extract_level function above.

@RossKen
Copy link
Contributor Author

RossKen commented May 11, 2023

Closed by #1203

@RossKen RossKen closed this as completed May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comparison levels enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant