update Rust and tweak the benchmark to be fairer with others #21

BurntSushi · 2020-03-17T20:40:19Z

Hello!

Disclaimer: I am the author of Rust's regex engine.

I've seen this benchmark come up quite a bit, and while I don't really agree with how it's being presented, I think there are some things should be minimally corrected. Principally, the Rust regex benchmark is measuring regexes with Unicode mode enabled while not doing that for other benchmarks, like the C PCRE2 benchmark. One could make the argument that this benchmark is specifically testing the "default" mode, but even the PCRE2 benchmark is setting non-default options (e.g., PCRE2_UTF) and going out of its way to enable faster matching (via the JIT). In my view, since the benchmark does not seem to be dependent on Unicode features (even though the input is not strictly ASCII), it's fairer to disable Unicode mode in Rust's regex engine. An alternative strategy would be to enable it in others (which does indeed slow down PCRE2 by quite a bit).

I've done a full comparison of both options. (This is included in the corresponding commit message.)

Results on the C PCRE2 benchmark, comparing status quo on master (no Unicode) and with Unicode (by enabling PCRE2_UCP; PCRE2_UTF is already enabled on master):

$ ./benchmark-master ../input-text.txt
30.372077 - 92
27.980834 - 5301
9.845239 - 5

$ ./benchmark-unicode ../input-text.txt
59.389168 - 92
50.923444 - 5301
9.824077 - 5

Results for Rust's regex engine, comparing status quo on master (with Unicode) with no Unicode:

$ ./target/release/benchmark-master ../input-text.txt
27.063138 - 92
26.204103 - 5301
6.527913 - 5

$ ./target/release/benchmark-no-unicode ../input-text.txt
19.721541 - 92
14.545442000000001 - 5301
6.050694 - 5

So either way you slice it, once you compare apples-to-apples, Rust's regex engine performs better than PCRE2 here.

Interestingly, this change does not actually make Rust's regex engine search faster. It actually makes the regex compile more quickly, which is impacting its benchmark results since compilation time is included in the benchmark. We can "fix" this by increasing the input dramatically. Just to drive the point home, we increase it by 100x and re-run the same benchmarks above.

PCRE2:

$ ./benchmark-master ../big-input-text.txt
2250.779087 - 9200
2239.042453 - 530100
783.535698 - 500

$ ./benchmark-unicode ../big-input-text.txt
4306.606260 - 9200
4106.626449 - 530100
783.947407 - 500

Rust:

$ ./target/release/benchmark-master ../big-input-text.txt
1051.648524 - 9200
1153.597741 - 530100
479.305029 - 500

$ ./target/release/benchmark-no-unicode ../big-input-text.txt
1032.259767 - 9200
1142.573962 - 530100
459.696583 - 500

Notice that in this case, the performance of Rust's regex engine doesn't change at all, and is faster than PCRE2 in both cases since the search time dominates. PCRE2 on the other hand slows down quite a bit once Unicode mode is enabled.

I've added a couple other commits that clean up the code and update to the latest version of the regex crate. This benchmark was using a version that was almost two years old.

The last release of regex 0.2 was almost two years ago. This does not materially change performance on this benchmark.

This removes the `time` dependency and instead uses std::time. We also clean up the Rust code quite a bit and make things more idiomatic. This does not impact benchmark results.

This dramatically improves the benchmark results for Rust's regex engine. Notably, this makes it a fairer comparison with other regex engines. For example, the PCRE2 benchmark program does *not* enable the PCRE2_UCP option, which is disabled by default. Therefore, Rust's regex engine should also disable its corresponding Unicode mode, which is enabled by default. Alternatively, we could enable Unicode mode in PCRE2, which will in turn slow down its benchmark result. Since the benchmark is not impacted by Unicode support (even though the input is not strictly ASCII), and because most regex engines in this benchmark are *not* Unicode by default, it seems prudent to turn Unicode off by default. Results on the C PCRE2 benchmark, comparing status quo on master (no Unicode) with Unicode: $ ./benchmark-master ../input-text.txt 30.372077 - 92 27.980834 - 5301 9.845239 - 5 $ ./benchmark-unicode ../input-text.txt 59.389168 - 92 50.923444 - 5301 9.824077 - 5 Results for Rust's regex engine, comparing status quo on master (with Unicode) with no Unicode: $ ./target/release/benchmark-master ../input-text.txt 27.063138 - 92 26.204103 - 5301 6.527913 - 5 $ ./target/release/benchmark-no-unicode ../input-text.txt 19.721541 - 92 14.545442000000001 - 5301 6.050694 - 5 So either way you slice it, once you compare apples-to-apples, Rust's regex engine performs better than PCRE2 here. Interestingly, this change does not actually make Rust's regex engine *search* faster. It actually makes the regex compile more quickly, which is impacting its benchmark results since compilation time is included in the benchmark. We can "fix" this by increasing the input dramatically. Just to drive the point home, we increase it by 100x and re-run the same benchmarks above. PCRE2: $ ./benchmark-master ../big-input-text.txt 2250.779087 - 9200 2239.042453 - 530100 783.535698 - 500 $ ./benchmark-unicode ../big-input-text.txt 4306.606260 - 9200 4106.626449 - 530100 783.947407 - 500 Rust: $ ./target/release/benchmark-master ../big-input-text.txt 1051.648524 - 9200 1153.597741 - 530100 479.305029 - 500 $ ./target/release/benchmark-no-unicode ../big-input-text.txt 1032.259767 - 9200 1142.573962 - 530100 459.696583 - 500 Notice that in this case, the performance of Rust's regex engine doesn't change at all, and is faster than PCRE2 in both cases since the search time dominates. PCRE2 on the other hand slows down quite a bit once Unicode mode is enabled.

xsoheilalizadeh · 2020-03-18T05:15:40Z

Thanks @BurntSushi, for your PR with a comprehensive explanation, I'll close mine.

BurntSushi · 2020-03-18T10:43:08Z

Oh, thanks! I didn't see your PR before I submitted mine. Sorry about that!

mariomka · 2020-05-01T13:18:53Z

Sorry, but I have been very busy for the past few months!
Thank you for your great explanation, I will try to review it this weekend.

BurntSushi · 2020-05-01T13:31:03Z

Thanks! Happy to answer any questions.

mariomka · 2020-05-03T13:04:03Z

Thank you.

Language/engine features seem to be very controversial and give us a lot of different opinions. I have been thinking and I think the best solution, for now, is to use the regex engine with the default settings for all the languages.
I know, it can be unfair for some languages but this is a simple benchmark.
Maybe in the future, we can add a Unicode or non-Unicode versions or similar.

I have reviewed and updated some benchmarks (C, C# and Rust) but I'm not an expert in these languages/engines.

BurntSushi · 2020-05-03T13:09:09Z

Thanks for taking a look and for making the README better.

It looks like you're considering PCRE2's jit as default though? You kind of have to go out of your way to use it. Arguably that's not the default either.

mariomka · 2020-05-03T13:40:47Z

Sorry, you are right, I will remove jit.

mariomka · 2020-05-03T14:00:12Z

I'm running the benchmark, it takes a while, then I will update the results.

On the other hand, I'm thinking to create a branch where code and settings optimizations will be allowed. There seem to be some people interested in optimizing the benchmark until the limit and it will be valuable for some people.

BurntSushi added 3 commits March 17, 2020 16:10

rust: update to regex 1.3.5

2d37781

The last release of regex 0.2 was almost two years ago. This does not materially change performance on this benchmark.

rust: update to Rust 2018 and clean up code

fb61136

This removes the `time` dependency and instead uses std::time. We also clean up the Rust code quite a bit and make things more idiomatic. This does not impact benchmark results.

BurntSushi force-pushed the ag/update-rust branch from b8ad0e9 to 4209415 Compare March 17, 2020 20:54

xsoheilalizadeh mentioned this pull request Mar 18, 2020

Update Rust Benchmark #20

Closed

mariomka merged commit 6ea3307 into mariomka:master May 3, 2020

BurntSushi deleted the ag/update-rust branch May 3, 2020 13:09

This was referenced May 4, 2020

Disable Unicode support for C# to match Java, Rust, C, C++ #26

Closed

Upgrade to .NET Core 2.2 and compiled regexes #14

Merged

BurntSushi mentioned this pull request May 9, 2021

Compile time regex for C++ #37

Open

BurntSushi mentioned this pull request Jul 2, 2021

How to use RE2 in a C++ entry of a regexp implementations comparison benchmark google/re2#314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update Rust and tweak the benchmark to be fairer with others #21

update Rust and tweak the benchmark to be fairer with others #21

BurntSushi commented Mar 17, 2020

xsoheilalizadeh commented Mar 18, 2020

BurntSushi commented Mar 18, 2020

mariomka commented May 1, 2020

BurntSushi commented May 1, 2020

mariomka commented May 3, 2020

BurntSushi commented May 3, 2020

mariomka commented May 3, 2020

mariomka commented May 3, 2020

update Rust and tweak the benchmark to be fairer with others #21

update Rust and tweak the benchmark to be fairer with others #21

Conversation

BurntSushi commented Mar 17, 2020

xsoheilalizadeh commented Mar 18, 2020

BurntSushi commented Mar 18, 2020

mariomka commented May 1, 2020

BurntSushi commented May 1, 2020

mariomka commented May 3, 2020

BurntSushi commented May 3, 2020

mariomka commented May 3, 2020

mariomka commented May 3, 2020