Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update Rust and tweak the benchmark to be fairer with others #21

Merged
merged 3 commits into from
May 3, 2020

Conversation

BurntSushi
Copy link
Contributor

Hello!

Disclaimer: I am the author of Rust's regex engine.

I've seen this benchmark come up quite a bit, and while I don't really agree with how it's being presented, I think there are some things should be minimally corrected. Principally, the Rust regex benchmark is measuring regexes with Unicode mode enabled while not doing that for other benchmarks, like the C PCRE2 benchmark. One could make the argument that this benchmark is specifically testing the "default" mode, but even the PCRE2 benchmark is setting non-default options (e.g., PCRE2_UTF) and going out of its way to enable faster matching (via the JIT). In my view, since the benchmark does not seem to be dependent on Unicode features (even though the input is not strictly ASCII), it's fairer to disable Unicode mode in Rust's regex engine. An alternative strategy would be to enable it in others (which does indeed slow down PCRE2 by quite a bit).

I've done a full comparison of both options. (This is included in the corresponding commit message.)

Results on the C PCRE2 benchmark, comparing status quo on master (no Unicode) and with Unicode (by enabling PCRE2_UCP; PCRE2_UTF is already enabled on master):

$ ./benchmark-master ../input-text.txt
30.372077 - 92
27.980834 - 5301
9.845239 - 5

$ ./benchmark-unicode ../input-text.txt
59.389168 - 92
50.923444 - 5301
9.824077 - 5

Results for Rust's regex engine, comparing status quo on master (with Unicode) with no Unicode:

$ ./target/release/benchmark-master ../input-text.txt
27.063138 - 92
26.204103 - 5301
6.527913 - 5

$ ./target/release/benchmark-no-unicode ../input-text.txt
19.721541 - 92
14.545442000000001 - 5301
6.050694 - 5

So either way you slice it, once you compare apples-to-apples, Rust's regex engine performs better than PCRE2 here.

Interestingly, this change does not actually make Rust's regex engine search faster. It actually makes the regex compile more quickly, which is impacting its benchmark results since compilation time is included in the benchmark. We can "fix" this by increasing the input dramatically. Just to drive the point home, we increase it by 100x and re-run the same benchmarks above.

PCRE2:

$ ./benchmark-master ../big-input-text.txt
2250.779087 - 9200
2239.042453 - 530100
783.535698 - 500

$ ./benchmark-unicode ../big-input-text.txt
4306.606260 - 9200
4106.626449 - 530100
783.947407 - 500

Rust:

$ ./target/release/benchmark-master ../big-input-text.txt
1051.648524 - 9200
1153.597741 - 530100
479.305029 - 500

$ ./target/release/benchmark-no-unicode ../big-input-text.txt
1032.259767 - 9200
1142.573962 - 530100
459.696583 - 500

Notice that in this case, the performance of Rust's regex engine doesn't change at all, and is faster than PCRE2 in both cases since the search time dominates. PCRE2 on the other hand slows down quite a bit once Unicode mode is enabled.

I've added a couple other commits that clean up the code and update to the latest version of the regex crate. This benchmark was using a version that was almost two years old.

The last release of regex 0.2 was almost two years ago.

This does not materially change performance on this benchmark.
This removes the `time` dependency and instead uses std::time.

We also clean up the Rust code quite a bit and make things more
idiomatic.

This does not impact benchmark results.
This dramatically improves the benchmark results for Rust's regex
engine. Notably, this makes it a fairer comparison with other regex
engines. For example, the PCRE2 benchmark program does *not* enable the
PCRE2_UCP option, which is disabled by default. Therefore, Rust's regex
engine should also disable its corresponding Unicode mode, which is
enabled by default. Alternatively, we could enable Unicode mode in
PCRE2, which will in turn slow down its benchmark result.

Since the benchmark is not impacted by Unicode support (even though
the input is not strictly ASCII), and because most regex engines in
this benchmark are *not* Unicode by default, it seems prudent to turn
Unicode off by default.

Results on the C PCRE2 benchmark, comparing status quo on master (no
Unicode) with Unicode:

    $ ./benchmark-master ../input-text.txt
    30.372077 - 92
    27.980834 - 5301
    9.845239 - 5

    $ ./benchmark-unicode ../input-text.txt
    59.389168 - 92
    50.923444 - 5301
    9.824077 - 5

Results for Rust's regex engine, comparing status quo on master (with
Unicode) with no Unicode:

    $ ./target/release/benchmark-master ../input-text.txt
    27.063138 - 92
    26.204103 - 5301
    6.527913 - 5

    $ ./target/release/benchmark-no-unicode ../input-text.txt
    19.721541 - 92
    14.545442000000001 - 5301
    6.050694 - 5

So either way you slice it, once you compare apples-to-apples, Rust's
regex engine performs better than PCRE2 here.

Interestingly, this change does not actually make Rust's regex engine
*search* faster. It actually makes the regex compile more quickly, which
is impacting its benchmark results since compilation time is included in
the benchmark. We can "fix" this by increasing the input dramatically.
Just to drive the point home, we increase it by 100x and re-run the
same benchmarks above.

PCRE2:

    $ ./benchmark-master ../big-input-text.txt
    2250.779087 - 9200
    2239.042453 - 530100
    783.535698 - 500

    $ ./benchmark-unicode ../big-input-text.txt
    4306.606260 - 9200
    4106.626449 - 530100
    783.947407 - 500

Rust:

    $ ./target/release/benchmark-master ../big-input-text.txt
    1051.648524 - 9200
    1153.597741 - 530100
    479.305029 - 500

    $ ./target/release/benchmark-no-unicode ../big-input-text.txt
    1032.259767 - 9200
    1142.573962 - 530100
    459.696583 - 500

Notice that in this case, the performance of Rust's regex engine doesn't
change at all, and is faster than PCRE2 in both cases since the search
time dominates. PCRE2 on the other hand slows down quite a bit once
Unicode mode is enabled.
@xsoheilalizadeh
Copy link

Thanks @BurntSushi, for your PR with a comprehensive explanation, I'll close mine.

@BurntSushi
Copy link
Contributor Author

Oh, thanks! I didn't see your PR before I submitted mine. Sorry about that!

@mariomka
Copy link
Owner

mariomka commented May 1, 2020

Sorry, but I have been very busy for the past few months!
Thank you for your great explanation, I will try to review it this weekend.

@BurntSushi
Copy link
Contributor Author

Thanks! Happy to answer any questions.

@mariomka mariomka merged commit 6ea3307 into mariomka:master May 3, 2020
@mariomka
Copy link
Owner

mariomka commented May 3, 2020

Thank you.

Language/engine features seem to be very controversial and give us a lot of different opinions. I have been thinking and I think the best solution, for now, is to use the regex engine with the default settings for all the languages.
I know, it can be unfair for some languages but this is a simple benchmark.
Maybe in the future, we can add a Unicode or non-Unicode versions or similar.

I have reviewed and updated some benchmarks (C, C# and Rust) but I'm not an expert in these languages/engines.

@BurntSushi
Copy link
Contributor Author

Thanks for taking a look and for making the README better.

It looks like you're considering PCRE2's jit as default though? You kind of have to go out of your way to use it. Arguably that's not the default either.

@BurntSushi BurntSushi deleted the ag/update-rust branch May 3, 2020 13:09
@mariomka
Copy link
Owner

mariomka commented May 3, 2020

Sorry, you are right, I will remove jit.

@mariomka
Copy link
Owner

mariomka commented May 3, 2020

I'm running the benchmark, it takes a while, then I will update the results.

On the other hand, I'm thinking to create a branch where code and settings optimizations will be allowed. There seem to be some people interested in optimizing the benchmark until the limit and it will be valuable for some people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants