[FEAT] Add comparison (level) library functions for Regular Expressions #1190

RossKen · 2023-04-14T15:26:08Z

Is your proposal related to a problem?

Regular expressions are a key tool for manipulating strings (subsetting, etc). It feels like this would be a good, general capability to have as a function in the comparison level library.

Users can currently define a comparison-level dictionary with SQL expressions, but the syntax of regex functions varies across backends. E.g. (courtesy of Chat GPT)

DuckDB:
DuckDB supports the use of regular expressions in queries through the built-in SQLite-compatible functions REGEXP , REGEXP_LIKE, and REGEXP_REPLACE.
Spark SQL:
Spark SQL provides a range of built-in functions to work with regular expressions. These include functions like regexp_extract, regexp_replace, regexp_replace_all, regexp_substr, regexp_split, and rlike.
Athena SQL:
Athena SQL supports the use of regular expressions in queries through the built-in RLIKE operator.
SQLite:
SQLite has a built-in regex library that can be used through the REGEXP and REGEXP_LIKE functions.

This would be useful for users to create their own comparison levels, but also for the likes of #215 where we are creating a wrapper function to compare different sections of a postcode that needs to be compatible with multiple backends.

Describe the solution you'd like

The following would return a comparison level matching the area (e.g. SW1A 2AA gives area SW) - or something similar

cll.regexp_extract_level("postcode", "^[A-Z]{1,2}")

equivalent to regexp_extract

{
    "sql_condition": "regexp_extract(postcode, '^[A-Z]{1,2}') = regex_extract(postcode, '^[A-Z]{1,2}')",
    "label_for_charts": "Regex match on '^[A-Z]{1,2}'",
}

This could also be wrapped up into an comparison library function to generate multiple levels for different thresholds of regex match. For example:

cl.regexp_extract_at_thresholds("postcode", ["^[A-Z]{1,2}", "[0-9][A-Z]{2}$"]

equivalent to

{'output_column_name': 'postcode',
 'comparison_levels': [{'sql_condition': '"postcode_l" IS NULL OR "postcode_r" IS NULL',
   'label_for_charts': 'Null',
   'is_null_level': True},
  {'sql_condition': '"postcode_l" = "postcode_r"',
   'label_for_charts': 'Exact match'},
  {"sql_condition": "regexp_extract(postcode, '^[A-Z]{1,2}') = regex_extract(postcode, '^[A-Z]{1,2}')",
    "label_for_charts": "Regex match on '^[A-Z]{1,2}'",}
  {"sql_condition": "regexp_extract(postcode, '[0-9][A-Z]{2}$') = regex_extract(postcode, '[0-9][A-Z]{2}$')",
    "label_for_charts": "Regex match on '[0-9][A-Z]{2}$'",}
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'Exact match vs. regex at thresholds ^[A-Z]{1,2}, [0-9][A-Z]{2}$ vs. anything else'}

Describe alternatives you've considered

The only alternative considered was to do this step in preprocessing (e.g. create multiple columns for the different parts of a postcode). Unless @zslade and I are missing something obvious, we think it is more performant to perform the regex functions at the comparison stage as it will only run on pairwise comparisons that have:

Passed the blocking rules
Not been captured by preceding comparison levels

However, this has be be traded off with the fact that generating these features in preprocessing means they only need to be created once vs every time you run the model.

One other consideration for doing during preprocessing is that it would allow you to perform term frequency adjustments on the sections of your string (eg postcode area). I can't think of how these would be applied when only defining the sections of a string within a comparison level.

Additional context

The text was updated successfully, but these errors were encountered:

RossKen · 2023-04-14T16:20:58Z

More info on Regex performance - https://www.loggly.com/blog/regexes-the-bad-better-best/

RossKen · 2023-04-14T18:41:41Z

Also consider adding a regex parameter within string comparison level functions to perform on subsets of columns values.

In which case you could have

cll.exact_match_level("postcode", regex="^[A-Z]{1,2}")

instead of the regexp_extract_level function above.

RossKen · 2023-05-11T17:43:17Z

Closed by #1203

RossKen added enhancement New feature or request comparison levels good first issue Good for newcomers labels Apr 14, 2023

RossKen changed the title ~~[FEAT] Add comparison level library functions for Regular Expressions~~ [FEAT] Add comparison (level) library functions for Regular Expressions Apr 14, 2023

This was referenced Apr 14, 2023

[FEAT] Create CTL function for email addresses #1176

Closed

Add default postcode comparison function #215

Closed

zslade mentioned this issue Apr 25, 2023

Add regex extract functionality to comparisons #1203

Merged

RossKen closed this as completed May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add comparison (level) library functions for Regular Expressions #1190

[FEAT] Add comparison (level) library functions for Regular Expressions #1190

RossKen commented Apr 14, 2023 •

edited

Loading

RossKen commented Apr 14, 2023

RossKen commented Apr 14, 2023

RossKen commented May 11, 2023

[FEAT] Add comparison (level) library functions for Regular Expressions #1190

[FEAT] Add comparison (level) library functions for Regular Expressions #1190

Comments

RossKen commented Apr 14, 2023 • edited Loading

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

RossKen commented Apr 14, 2023

RossKen commented Apr 14, 2023

RossKen commented May 11, 2023

RossKen commented Apr 14, 2023 •

edited

Loading