Start working on a RegEx dialect as a normalization showcase. #210

fodinabor · 2023-04-28T14:35:13Z

A simple use case that showcases the power of the normalization framework present in Thorin is regular expressions.
Defines a set of axioms that represent ranges, any character, alternatives, sequences and quantifiers.
These should be sufficient to define (most) compile-time regular expressions (missing {n,m} quantifiers, lookaheads, .. atm)

Accompanying the axioms are normalizers that simplify the RegEx.
An example of normalization is quantifier merging:
(\d*)?, (\d?)*, (\d+)?, (\d?)+, (\d*)+, (\d+)* all normalize to \d*.
For disjunctions: duplicates are removed, opposite classes (e.g. [\s\S]) are reduced to match any character ".".
All classes and literals are translated to disjunctions of ranges and the ranges are merged.
Disjunctions and conjunctions are always normalized to only have two arguments to make the matcher impl simpler.

There is also a phase that recursively replaces the applied regex with rather primitive matcher functions.
Therefore we can actually match the regex. Note, currently only deterministic regex can be matched -> \w\w+ works, but \w+\w doesn't.

https://github.com/fodinabor/thorin_regex_benchmark contains a benchmark comparing Thorin's against CTRE, PCRE2 and std::regex:

engine,average[us],min[us],max[us],deviation[%],runs[us]
pcre2_jit,  1032,   1026,   1046,   1,  1033 1032 1041 1046 1037 1027 1026 1026 1028 1029 
ctre,       4053,   4049,   4065,   0,  4051 4059 4052 4051 4049 4050 4065 4051 4050 4051 
thorin,     909,    893,    1017,   13, 897 893 1017 902 901 894 897 894 899 896 
pcre2,      3028,   3016,   3071,   1,  3071 3029 3016 3019 3027 3026 3028 3025 3020 3024 
std,        9232,   8837,   10141,  14, 9204 10141 8837 9028 9186 9139 9153 9173 9167 9294

Which looks pretty favorable for thorin on this regex benchmark.
Similarly, the compilation time of the thorin matcher is looking good (note, pcre2 performs runtime compilation, not measured by either of the metrics):

time ../thorin_mail_bench/thorin2/build/install/bin/thorin ../thorin_mail_bench/match_mail.thorin --output-ll ../thorin_mail_bench/build/thorin_match.ll --aggr-lam-spec
real    0m0,154s
time /usr/bin/clang++ thorin_match.ll -c -o ../thorin_mail_bench/build/thorin_match.o
real    0m0,049s
time /usr/bin/clang++  ... ctre_match_mail.cpp
real    0m1,566s
time /usr/bin/clang++  ... pcre2_match_mail.cpp
real    0m0,047s
time /usr/bin/clang++  ... std_match_mail.cpp
real    0m2,342s

…/regex-dialect

…x-dialect

This phase replaces the regex axioms with calls to annexes that implement the regex matching. Thus far the `cls`s and `conj` are implemented. Note, `conj` only works with uniform `cls`s as paramters so far..

This allows PE of the impl without needing copy prop through branches.

Also make `disj` normalization fold `lit`s into already present `cls`s

Enjoy with caution. :)

Fix normalizers with ranges.

Probably not exhaustive..

leissa

fantastic work :)

dialects/regex/normalizers_regex.cpp

dialects/regex/regex.thorin

fodinabor marked this pull request as draft April 28, 2023 14:35

leissa mentioned this pull request May 2, 2023

Support for char & string literals #211

Merged

fodinabor added 3 commits May 5, 2023 13:54

Start working on a RegEx dialect as a normalization showcase.

c984384

Stable sort & CMake deps.

79830a3

Doxygen fix.

4508968

fodinabor force-pushed the feature/regex-dialect branch from bb305ea to 4508968 Compare May 5, 2023 11:57

Merge remote-tracking branch 'origin/feature/extensions' into feature…

8d3d64c

…/regex-dialect

leissa mentioned this pull request May 12, 2023

Bugfix/ds2csp #216

Merged

fodinabor added 3 commits May 13, 2023 10:08

Merge remote-tracking branch 'origin/bugfix/ds2csp' into feature/rege…

efb8100

…x-dialect

Merge remote-tracking branch 'origin/master' into feature/regex-dialect

4e01ee1

Add lower_regex phase.

9dff6d8

This phase replaces the regex axioms with calls to annexes that implement the regex matching. Thus far the `cls`s and `conj` are implemented. Note, `conj` only works with uniform `cls`s as paramters so far..

leissa mentioned this pull request May 18, 2023

Alpha equiv 2 - Electric Boogaloo #219

Merged

fodinabor added 18 commits May 19, 2023 12:28

Cleanup cls_w_impl.

1ce21b6

Set curry level for regex.lit and regex.cls.

b65bd1f

Fix regex normalizer tests.

643a30f

Cleanup match.thorin.

24fea51

Maybe fix dependencies..

c9731ed

Support conj with different clss.

02476b9

Remove ; from CMake..

53c0a55

Lower regex early.

c25fe70

Change conj normal form to be 2 element cascaded.

572bec0

This allows PE of the impl without needing copy prop through branches.

Implement match_disj and match_lit.

c3881f5

Also make `disj` normalization fold `lit`s into already present `cls`s

Impl quantifiers.

1c03a74

Enjoy with caution. :)

Make regex FileCheck more useful.

dc480ed

WIP: add regex.range and consolidate.

aad84e9

Cleanup normalizers & regex.thorin.

f4129de

Use lets for regex.cls.*

03cfa3e

Use aggressive lamspec for regex matches.

d5e4ff7

Fix normalizers with ranges.

Make disj/not merging less computational complex.

e56deca

Use simpler, deterministic mail regex.

c0610f7

fodinabor added 7 commits June 2, 2023 17:01

Eliminate non-optimizable extracts.

a9dabe0

Allow local install.

1665105

Cleanup.

7c8edc7

Merge remote-tracking branch 'origin/master' into feature/regex-dialect

1f12e36

Cleanup a few !s.

581d2de

Probably not exhaustive..

First construction of regex.lit is not PE'd yet.

8958c1c

Don't throw from plugin..

2e7cbe8

fodinabor changed the title ~~WIP: Start working on a RegEx dialect as a normalization showcase.~~ Start working on a RegEx dialect as a normalization showcase. Jun 9, 2023

fodinabor marked this pull request as ready for review June 9, 2023 12:00

leissa reviewed Jun 20, 2023

View reviewed changes

dialects/regex/normalizers_regex.cpp Outdated Show resolved Hide resolved

dialects/regex/normalizers_regex.cpp Outdated Show resolved Hide resolved

dialects/regex/regex.thorin Outdated Show resolved Hide resolved

fodinabor added 3 commits June 22, 2023 13:37

[RegEx] Codestyle fixes. Rename quantifiers.

e0530bc

Fix install of config.h

90f735a

Fix THORIN_INSTALL_DEPENDENCIES ..

64860ed

leissa merged commit ccc2e55 into master Jun 22, 2023

leissa deleted the feature/regex-dialect branch June 22, 2023 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start working on a RegEx dialect as a normalization showcase. #210

Start working on a RegEx dialect as a normalization showcase. #210

fodinabor commented Apr 28, 2023 •

edited

Loading

leissa left a comment

Start working on a RegEx dialect as a normalization showcase. #210

Start working on a RegEx dialect as a normalization showcase. #210

Conversation

fodinabor commented Apr 28, 2023 • edited Loading

leissa left a comment

Choose a reason for hiding this comment

fodinabor commented Apr 28, 2023 •

edited

Loading