MimIR E-Mail Address RegEx Benchmark

This repo contains benchmarks comparing MimIR's regex dialect performance against, CTRE, PCRE2 and std::regex.

The common theme is to match e-mail addresses harvested from a spam e-mail dataset (https://www.kaggle.com/datasets/rtatman/fraudulent-email-corpus). The to-be-matched regex is: ^[a-zA-Z0-9](?:[a-zA-Z0-9]*[._\-]+[a-zA-Z0-9])*[a-zA-Z0-9]*@[a-zA-Z0-9](?:[a-zA-Z0-9]*[_\-]+[a-zA-Z0-9])*[a-zA-Z0-9]*\.(?:(?:[a-zA-Z0-9]*[_\-]+[a-zA-Z0-9])*[a-zA-Z0-9]+\.)*[a-zA-Z][a-zA-Z]+$

Setup

Requirements

GCC >=11
Clang
PCRE2 (8-bit lib, tested with version 10.42)

Init Submodules & Build

git submodule update --init --recursive
mkdir -p build && mkdir -p mimir/build && cd mimir/build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_CXX_COMPILER=g++ -DCMAKE_C_COMPILER=gcc -DCMAKE_INSTALL_PREFIX=`pwd`/install
make -j`nproc` install
cd ../../build
cmake .. -DCMAKE_BUILD_TYPE=Release -DMim_DIR=`pwd`/../mimir/build/install/lib/cmake/mim -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang
make -j`nproc`

Dataset

To prepare the dataset, run the following in the project root:

grep -oE "[a-zA-Z0-9_.\-]+@[a-zA-Z0-9_.\-]+" fradulent_emails.txt > addresses.txt
python3 annotate_matched.py

Benchmark

To run the benchmark, provide the benchmark executable with the path to the dataset file:

./build/benchmark_mail annotated.txt 2> full_results.csv

Compile-time benchmark

For compile-time benchmarking, add -DREGEX_COMPILE_TIME_BENCHMARK=ON and ensure that you do not hit any caches, such as ccache. One way to do so is using the full path to your clang, usually: /usr/bin/clang++. Also note, that with CMake, there's a rather high overhead. You might want to run make clean; make -n benchmark_mail | grep -E "(clang++|bin/mim)" | sed "s/^/time /" | bash --verbose instead. Note, for some reason, this unnecessarily compiles the MimIR thing twice. That's a build-system issue, not a MimIR limitation.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
compile-time-regular-expressions @ a0e6e2c		compile-time-regular-expressions @ a0e6e2c
mimir @ 57a3daa		mimir @ 57a3daa
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
ReadMe.md		ReadMe.md
annotate_matched.py		annotate_matched.py
benchmark.sh		benchmark.sh
benchmark_mail.cpp		benchmark_mail.cpp
ctre_match_mail.cpp		ctre_match_mail.cpp
eval.cpp		eval.cpp
fradulent_emails.txt		fradulent_emails.txt
manual_match_mail.cpp		manual_match_mail.cpp
match_mail.mim		match_mail.mim
match_mail_nondet.mim		match_mail_nondet.mim
pcre2_match_mail.cpp		pcre2_match_mail.cpp
std_match_mail.cpp		std_match_mail.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MimIR E-Mail Address RegEx Benchmark

Setup

Requirements

Init Submodules & Build

Dataset

Benchmark

Compile-time benchmark

About

Releases

Packages

Languages

License

fodinabor/mimir_regex_benchmark

Folders and files

Latest commit

History

Repository files navigation

MimIR E-Mail Address RegEx Benchmark

Setup

Requirements

Init Submodules & Build

Dataset

Benchmark

Compile-time benchmark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages