-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update Rust and tweak the benchmark to be fairer with others #21
Conversation
The last release of regex 0.2 was almost two years ago. This does not materially change performance on this benchmark.
This removes the `time` dependency and instead uses std::time. We also clean up the Rust code quite a bit and make things more idiomatic. This does not impact benchmark results.
This dramatically improves the benchmark results for Rust's regex engine. Notably, this makes it a fairer comparison with other regex engines. For example, the PCRE2 benchmark program does *not* enable the PCRE2_UCP option, which is disabled by default. Therefore, Rust's regex engine should also disable its corresponding Unicode mode, which is enabled by default. Alternatively, we could enable Unicode mode in PCRE2, which will in turn slow down its benchmark result. Since the benchmark is not impacted by Unicode support (even though the input is not strictly ASCII), and because most regex engines in this benchmark are *not* Unicode by default, it seems prudent to turn Unicode off by default. Results on the C PCRE2 benchmark, comparing status quo on master (no Unicode) with Unicode: $ ./benchmark-master ../input-text.txt 30.372077 - 92 27.980834 - 5301 9.845239 - 5 $ ./benchmark-unicode ../input-text.txt 59.389168 - 92 50.923444 - 5301 9.824077 - 5 Results for Rust's regex engine, comparing status quo on master (with Unicode) with no Unicode: $ ./target/release/benchmark-master ../input-text.txt 27.063138 - 92 26.204103 - 5301 6.527913 - 5 $ ./target/release/benchmark-no-unicode ../input-text.txt 19.721541 - 92 14.545442000000001 - 5301 6.050694 - 5 So either way you slice it, once you compare apples-to-apples, Rust's regex engine performs better than PCRE2 here. Interestingly, this change does not actually make Rust's regex engine *search* faster. It actually makes the regex compile more quickly, which is impacting its benchmark results since compilation time is included in the benchmark. We can "fix" this by increasing the input dramatically. Just to drive the point home, we increase it by 100x and re-run the same benchmarks above. PCRE2: $ ./benchmark-master ../big-input-text.txt 2250.779087 - 9200 2239.042453 - 530100 783.535698 - 500 $ ./benchmark-unicode ../big-input-text.txt 4306.606260 - 9200 4106.626449 - 530100 783.947407 - 500 Rust: $ ./target/release/benchmark-master ../big-input-text.txt 1051.648524 - 9200 1153.597741 - 530100 479.305029 - 500 $ ./target/release/benchmark-no-unicode ../big-input-text.txt 1032.259767 - 9200 1142.573962 - 530100 459.696583 - 500 Notice that in this case, the performance of Rust's regex engine doesn't change at all, and is faster than PCRE2 in both cases since the search time dominates. PCRE2 on the other hand slows down quite a bit once Unicode mode is enabled.
b8ad0e9
to
4209415
Compare
Thanks @BurntSushi, for your PR with a comprehensive explanation, I'll close mine. |
Oh, thanks! I didn't see your PR before I submitted mine. Sorry about that! |
Sorry, but I have been very busy for the past few months! |
Thanks! Happy to answer any questions. |
Thank you. Language/engine features seem to be very controversial and give us a lot of different opinions. I have been thinking and I think the best solution, for now, is to use the regex engine with the default settings for all the languages. I have reviewed and updated some benchmarks (C, C# and Rust) but I'm not an expert in these languages/engines. |
Thanks for taking a look and for making the README better. It looks like you're considering PCRE2's jit as default though? You kind of have to go out of your way to use it. Arguably that's not the default either. |
Sorry, you are right, I will remove jit. |
I'm running the benchmark, it takes a while, then I will update the results. On the other hand, I'm thinking to create a branch where code and settings optimizations will be allowed. There seem to be some people interested in optimizing the benchmark until the limit and it will be valuable for some people. |
Hello!
Disclaimer: I am the author of Rust's regex engine.
I've seen this benchmark come up quite a bit, and while I don't really agree with how it's being presented, I think there are some things should be minimally corrected. Principally, the Rust regex benchmark is measuring regexes with Unicode mode enabled while not doing that for other benchmarks, like the C PCRE2 benchmark. One could make the argument that this benchmark is specifically testing the "default" mode, but even the PCRE2 benchmark is setting non-default options (e.g.,
PCRE2_UTF
) and going out of its way to enable faster matching (via the JIT). In my view, since the benchmark does not seem to be dependent on Unicode features (even though the input is not strictly ASCII), it's fairer to disable Unicode mode in Rust's regex engine. An alternative strategy would be to enable it in others (which does indeed slow down PCRE2 by quite a bit).I've done a full comparison of both options. (This is included in the corresponding commit message.)
Results on the C PCRE2 benchmark, comparing status quo on master (no Unicode) and with Unicode (by enabling
PCRE2_UCP
;PCRE2_UTF
is already enabled on master):Results for Rust's regex engine, comparing status quo on master (with Unicode) with no Unicode:
So either way you slice it, once you compare apples-to-apples, Rust's regex engine performs better than PCRE2 here.
Interestingly, this change does not actually make Rust's regex engine search faster. It actually makes the regex compile more quickly, which is impacting its benchmark results since compilation time is included in the benchmark. We can "fix" this by increasing the input dramatically. Just to drive the point home, we increase it by 100x and re-run the same benchmarks above.
PCRE2:
Rust:
Notice that in this case, the performance of Rust's regex engine doesn't change at all, and is faster than PCRE2 in both cases since the search time dominates. PCRE2 on the other hand slows down quite a bit once Unicode mode is enabled.
I've added a couple other commits that clean up the code and update to the latest version of the regex crate. This benchmark was using a version that was almost two years old.