-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the DFA inner loop. #202
Conversation
This employs a number of tricks to make the inner loop faster: 1. **Elides bounds checks with `unsafe`**. This is the first use of `unsafe` in regex. It is justified below with benchmarks, and comments in the source make an argument for correctness. 2. Store meta data about states in the upper bits of a state pointer. This reduces the amount of branching needed. 3. Create a inner inner loop that handles all transitions between non-dead, non-match and non-start states. (i.e., The majority of cases.) In particular, this lets us avoid having to check specifically whether each state is a match state or not. 4. Start states are only treated specially if there is a prefix detected that we should scan for. Otherwise, start states are no different than any other state. 5. Move transitions from `State` and into one giant transition table, which should hopefully improve locality and make better use of the cache. The use of `unsafe` is unfortunate, but it significantly reduces the number of instructions executed in a search. When the DFA spends a lot of time in the inner loop, eliding the bounds checks leads to better performance. In most cases, the boost is worth about 5%, but in some extreme cases (e.g., a match is the entirety of a large haystack), the boost can be worth nearly 50%. Here is a comparison between code without `unsafe` and with `unsafe`: ``` $ cargo-benchcmp rust-safe rust-unsafe --threshold 3 name rust-safe ns/iter rust-unsafe ns/iter diff ns/iter diff % misc::anchored_literal_long_match 29 (13,448 MB/s) 26 (15,000 MB/s) -3 -10.34% misc::anchored_literal_short_match 28 (928 MB/s) 26 (1,000 MB/s) -2 -7.14% misc::easy0_1MB 49 (21,400,061 MB/s) 42 (24,966,738 MB/s) -7 -14.29% misc::easy1_1K 79 (13,215 MB/s) 76 (13,736 MB/s) -3 -3.80% misc::easy1_32 79 (658 MB/s) 76 (684 MB/s) -3 -3.80% misc::easy1_32K 80 (409,850 MB/s) 76 (431,421 MB/s) -4 -5.00% misc::hard_1K 104 (10,105 MB/s) 100 (10,510 MB/s) -4 -3.85% misc::match_class_unicode 595 (270 MB/s) 571 (281 MB/s) -24 -4.03% misc::medium_1MB 51 (20,560,862 MB/s) 44 (23,831,909 MB/s) -7 -13.73% misc::no_exponential 378 (264 MB/s) 361 (277 MB/s) -17 -4.50% misc::not_literal 206 (247 MB/s) 196 (260 MB/s) -10 -4.85% misc::one_pass_long_prefix 116 (224 MB/s) 111 (234 MB/s) -5 -4.31% misc::one_pass_long_prefix_not 116 (224 MB/s) 108 (240 MB/s) -8 -6.90% misc::one_pass_short 81 (209 MB/s) 76 (223 MB/s) -5 -6.17% misc::one_pass_short_not 79 (215 MB/s) 75 (226 MB/s) -4 -5.06% misc::reallyhard_1K 3,796 (276 MB/s) 3,629 (289 MB/s) -167 -4.40% misc::reallyhard_1MB 3,765,536 (278 MB/s) 3,602,215 (291 MB/s) -163,321 -4.34% misc::reallyhard_32 234 (252 MB/s) 222 (265 MB/s) -12 -5.13% misc::reallyhard_32K 117,917 (278 MB/s) 112,604 (291 MB/s) -5,313 -4.51% misc::replace_all 144 137 -7 -4.86% sherlock::before_holmes 2,163,856 (274 MB/s) 2,077,792 (286 MB/s) -86,064 -3.98% sherlock::everything_greedy 3,641,444 (163 MB/s) 2,578,502 (230 MB/s) -1,062,942 -29.19% sherlock::everything_greedy_nl 2,109,164 (282 MB/s) 1,080,933 (550 MB/s) -1,028,231 -48.75% sherlock::holmes_coword_watson 1,087,276 (547 MB/s) 1,037,918 (573 MB/s) -49,358 -4.54% sherlock::ing_suffix 2,419,816 (245 MB/s) 2,308,945 (257 MB/s) -110,871 -4.58% sherlock::ing_suffix_limited_space 2,360,927 (251 MB/s) 2,259,791 (263 MB/s) -101,136 -4.28% sherlock::letters 27,710,372 (21 MB/s) 25,348,374 (23 MB/s) -2,361,998 -8.52% sherlock::letters_lower 26,888,541 (22 MB/s) 24,759,385 (24 MB/s) -2,129,156 -7.92% sherlock::letters_upper 3,138,611 (189 MB/s) 2,989,327 (199 MB/s) -149,284 -4.76% sherlock::line_boundary_sherlock_holmes 2,132,889 (278 MB/s) 2,046,399 (290 MB/s) -86,490 -4.06% sherlock::name_alt1 35,964 (16,542 MB/s) 37,164 (16,008 MB/s) 1,200 3.34% sherlock::name_whitespace 88,768 (6,702 MB/s) 85,322 (6,972 MB/s) -3,446 -3.88% sherlock::quotes 800,085 (743 MB/s) 769,792 (772 MB/s) -30,293 -3.79% sherlock::the_whitespace 1,315,168 (452 MB/s) 1,238,173 (480 MB/s) -76,995 -5.85% sherlock::words 11,230,278 (52 MB/s) 9,855,296 (60 MB/s) -1,374,982 -12.24% ```
Comparison with current master:
|
Nice! Everything about I wonder if it'd be worth investigating some fuzzing techniques like afl to test this out a bit? I'm pretty happy with the level of comments and thought here though, so I'd be fine merging at any time :) |
Fuzzing is definitely a good idea. There is some fuzzy in |
This employs a number of tricks to make the inner loop faster:
unsafe
. This is the first use ofunsafe
in regex. It is justified below with benchmarks, and commentsin the source make an argument for correctness.
This reduces the amount of branching needed.
non-dead, non-match and non-start states. (i.e., The majority of cases.)
In particular, this lets us avoid having to check specifically whether
each state is a match state or not. It is unrolled 4 times.
that we should scan for. Otherwise, start states are no different than
any other state.
State
and into one giant transition table,which should hopefully improve locality and make better use of the
cache.
The use of
unsafe
is unfortunate, but it significantly reduces thenumber of instructions executed in a search. When the DFA spends a lot
of time in the inner loop, eliding the bounds checks leads to better
performance. In most cases, the boost is worth about 5%, but in some
extreme cases (e.g., a match is the entirety of a large haystack), the
boost can be worth nearly 50%.
Here is a comparison between code without
unsafe
and withunsafe
: