Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start working on a RegEx dialect as a normalization showcase. #210

Merged
merged 35 commits into from
Jun 22, 2023

Conversation

fodinabor
Copy link
Collaborator

@fodinabor fodinabor commented Apr 28, 2023

A simple use case that showcases the power of the normalization framework present in Thorin is regular expressions.
Defines a set of axioms that represent ranges, any character, alternatives, sequences and quantifiers.
These should be sufficient to define (most) compile-time regular expressions (missing {n,m} quantifiers, lookaheads, .. atm)

Accompanying the axioms are normalizers that simplify the RegEx.
An example of normalization is quantifier merging:
(\d*)?, (\d?)*, (\d+)?, (\d?)+, (\d*)+, (\d+)* all normalize to \d*.
For disjunctions: duplicates are removed, opposite classes (e.g. [\s\S]) are reduced to match any character ".".
All classes and literals are translated to disjunctions of ranges and the ranges are merged.
Disjunctions and conjunctions are always normalized to only have two arguments to make the matcher impl simpler.

There is also a phase that recursively replaces the applied regex with rather primitive matcher functions.
Therefore we can actually match the regex. Note, currently only deterministic regex can be matched -> \w\w+ works, but \w+\w doesn't.

https://github.com/fodinabor/thorin_regex_benchmark contains a benchmark comparing Thorin's against CTRE, PCRE2 and std::regex:

engine,average[us],min[us],max[us],deviation[%],runs[us]
pcre2_jit,  1032,   1026,   1046,   1,  1033 1032 1041 1046 1037 1027 1026 1026 1028 1029 
ctre,       4053,   4049,   4065,   0,  4051 4059 4052 4051 4049 4050 4065 4051 4050 4051 
thorin,     909,    893,    1017,   13, 897 893 1017 902 901 894 897 894 899 896 
pcre2,      3028,   3016,   3071,   1,  3071 3029 3016 3019 3027 3026 3028 3025 3020 3024 
std,        9232,   8837,   10141,  14, 9204 10141 8837 9028 9186 9139 9153 9173 9167 9294 

Which looks pretty favorable for thorin on this regex benchmark.
Similarly, the compilation time of the thorin matcher is looking good (note, pcre2 performs runtime compilation, not measured by either of the metrics):

time ../thorin_mail_bench/thorin2/build/install/bin/thorin ../thorin_mail_bench/match_mail.thorin --output-ll ../thorin_mail_bench/build/thorin_match.ll --aggr-lam-spec
real    0m0,154s
time /usr/bin/clang++ thorin_match.ll -c -o ../thorin_mail_bench/build/thorin_match.o
real    0m0,049s
time /usr/bin/clang++  ... ctre_match_mail.cpp
real    0m1,566s
time /usr/bin/clang++  ... pcre2_match_mail.cpp
real    0m0,047s
time /usr/bin/clang++  ... std_match_mail.cpp
real    0m2,342s

@fodinabor fodinabor marked this pull request as draft April 28, 2023 14:35
@fodinabor fodinabor force-pushed the feature/regex-dialect branch from bb305ea to 4508968 Compare May 5, 2023 11:57
@leissa leissa mentioned this pull request May 12, 2023
fodinabor added 3 commits May 13, 2023 10:08
This phase replaces the regex axioms with calls to annexes that implement the regex matching.
Thus far the `cls`s and `conj` are implemented.
Note, `conj` only works with uniform `cls`s as paramters so far..
@fodinabor fodinabor changed the title WIP: Start working on a RegEx dialect as a normalization showcase. Start working on a RegEx dialect as a normalization showcase. Jun 9, 2023
@fodinabor fodinabor marked this pull request as ready for review June 9, 2023 12:00
Copy link
Member

@leissa leissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic work :)

dialects/regex/normalizers_regex.cpp Outdated Show resolved Hide resolved
dialects/regex/normalizers_regex.cpp Outdated Show resolved Hide resolved
dialects/regex/regex.thorin Outdated Show resolved Hide resolved
@leissa leissa merged commit ccc2e55 into master Jun 22, 2023
@leissa leissa deleted the feature/regex-dialect branch June 22, 2023 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants