-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add comparison (level) library functions for Regular Expressions #1190
Labels
Comments
RossKen
added
enhancement
New feature or request
comparison levels
good first issue
Good for newcomers
labels
Apr 14, 2023
RossKen
changed the title
[FEAT] Add comparison level library functions for Regular Expressions
[FEAT] Add comparison (level) library functions for Regular Expressions
Apr 14, 2023
This was referenced Apr 14, 2023
More info on Regex performance - https://www.loggly.com/blog/regexes-the-bad-better-best/ |
Also consider adding a regex parameter within string comparison level functions to perform on subsets of columns values. In which case you could have
instead of the |
Closed by #1203 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Is your proposal related to a problem?
Regular expressions are a key tool for manipulating strings (subsetting, etc). It feels like this would be a good, general capability to have as a function in the comparison level library.
Users can currently define a comparison-level dictionary with SQL expressions, but the syntax of regex functions varies across backends. E.g. (courtesy of Chat GPT)
DuckDB:
DuckDB supports the use of regular expressions in queries through the built-in SQLite-compatible functions
REGEXP
,REGEXP_LIKE
, andREGEXP_REPLACE
.Spark SQL:
Spark SQL provides a range of built-in functions to work with regular expressions. These include functions like
regexp_extract
,regexp_replace
,regexp_replace_all
,regexp_substr
,regexp_split
, andrlike
.Athena SQL:
Athena SQL supports the use of regular expressions in queries through the built-in
RLIKE
operator.SQLite:
SQLite has a built-in regex library that can be used through the
REGEXP
andREGEXP_LIKE
functions.This would be useful for users to create their own comparison levels, but also for the likes of #215 where we are creating a wrapper function to compare different sections of a postcode that needs to be compatible with multiple backends.
Describe the solution you'd like
The following would return a comparison level matching the area (e.g. SW1A 2AA gives area SW) - or something similar
equivalent to regexp_extract
This could also be wrapped up into an comparison library function to generate multiple levels for different thresholds of regex match. For example:
equivalent to
Describe alternatives you've considered
The only alternative considered was to do this step in preprocessing (e.g. create multiple columns for the different parts of a postcode). Unless @zslade and I are missing something obvious, we think it is more performant to perform the regex functions at the comparison stage as it will only run on pairwise comparisons that have:
However, this has be be traded off with the fact that generating these features in preprocessing means they only need to be created once vs every time you run the model.
One other consideration for doing during preprocessing is that it would allow you to perform term frequency adjustments on the sections of your string (eg postcode area). I can't think of how these would be applied when only defining the sections of a string within a comparison level.
Additional context
The text was updated successfully, but these errors were encountered: