implement a faster memory pool #188

BurntSushi · 2016-03-21T15:25:22Z

Problem

When a regex search executes, it has to choose a matching engine (sometimes more than one) to carry out the search. Each matching engine needs some amount of fixed mutable space on the heap to carry out a search, which I'll call "scratch space." In general, this space is reusable and reusing the space leads to significant performance benefits when using a regular expression to carry out multiple searches. (For example, the scratch space may contain computed DFA states.) Scratch space is used every time a regular expression executes a search. For example, calling re.find_iter("...") will execute possibly many searches, depending on how many matches it finds.

Here are some constraints I've been working with:

A Regex must be Send and Sync. This permits one to share a regex across multiple threads without any external synchronization.
The regex crate should never spawn a thread.
It's OK to have multiple copies of the scratch space. In other words, it's OK to use a little bit more memory to reduce contention.
Using a regex in a single thread is far more prevalent than sharing a regex across multiple threads, so it is OK to optimize for the single threaded use case.

Constraint (1) is the killer, because it means synchronizing concurrent access to mutable state. For example, one might have a Arc<Regex> where the Regex is used simultaneously among multiple threads. If we gave up on the Sync bound, then callers would need to either Clone regular expressions to use them across multiple threads, or put them in a Mutex. The latter is a bit surprising since Regex doesn't have any public mutable methods. The latter is also very poor performance wise because one search will completely block all other searches. The former, cloning, is somewhat acceptable if a little wasteful. That is, if we dropped the Sync bound, I'd expect users to clone regular expressions if they want to use them across multiple threads.

Constraint (2) just seems like good sense. To me, a thread spawned as a result of running a regex violates the principle of least surprise.

Constraint (3) permits the implementation to be simple and should make contention be a non-factor. If we were more memory conscious, we'd never copy the scratch space, which would mean that each of the regex engines would be forced to do their own type of synchronization. Not only is this much more complex, but it means contention could be a real problem while searching, which seems unfortunate.

Given all of this, it is my belief that the key thing worth optimizing is the overhead of synchronization itself. The cost of copying the scratch space should be amortized through reuse, and contention should be extremely limited since synchronization only needs to occur at the start and end of every search. (I suppose contention could become an issue if a regex that matches very often on very short spans is used simultaneously across multiple threads. The good thing about this is that the caller could work around this by simply cloning the regex, avoiding contention altogether.)

Current solution

The most straight-forward solution to this problem is to wrap some collection data structure in a Mutex. The data structure could trivially be a queue, stack or (singly) linked list. The current implementation is so simple that it can be digested in a quick skim. In particular, the only operations we need to support are get and pop. No ordering invariants are necessary since all copies of scratch space are equally usable for every regex search.

Benchmarks

I have three sets of benchmarks representing three pretty easy-to-implement strategies. The first is the base line, which uses a RefCell<Vec<T>>. This obviously does no synchronization, so in this world, Regex doesn't impl Sync. The second is the current solution, which uses Mutex<Vec<T>>. The third uses a lock free stack from crossbeam, which uses TreiberStack<T>.

Here is a comparison between the base line and Mutex<Vec<T>>, showing only benchmarks with more than 10% difference (positive percent indicates how much slower mutexes are than refcells in this case):

$ cargo-benchcmp rust-refcell.5 rust-mutex.5 --threshold 10
name                                rust-refcell.5 ns/iter  rust-mutex.5 ns/iter    diff ns/iter  diff %
misc::anchored_literal_long_match   49 (7,959 MB/s)         79 (4,936 MB/s)                   30  61.22%
misc::anchored_literal_short_match  48 (541 MB/s)           77 (337 MB/s)                     29  60.42%
misc::easy0_1K                      226 (4,650 MB/s)        275 (3,821 MB/s)                  49  21.68%
misc::easy0_32                      182 (324 MB/s)          225 (262 MB/s)                    43  23.63%
misc::easy1_1K                      257 (4,062 MB/s)        298 (3,503 MB/s)                  41  15.95%
misc::easy1_32                      160 (325 MB/s)          193 (269 MB/s)                    33  20.62%
misc::medium_32                     255 (235 MB/s)          289 (207 MB/s)                    34  13.33%
misc::one_pass_long_prefix          182 (142 MB/s)          218 (119 MB/s)                    36  19.78%
misc::one_pass_long_prefix_not      154 (168 MB/s)          189 (137 MB/s)                    35  22.73%
misc::one_pass_short                112 (151 MB/s)          147 (115 MB/s)                    35  31.25%
misc::one_pass_short_not            112 (151 MB/s)          146 (116 MB/s)                    34  30.36%
sherlock::everything_greedy         6,838,126 (87 MB/s)     8,261,531 (72 MB/s)        1,423,405  20.82%
sherlock::everything_greedy_nl      5,502,273 (108 MB/s)    6,761,252 (87 MB/s)        1,258,979  22.88%
sherlock::letters                   34,884,851 (17 MB/s)    65,399,440 (9 MB/s)       30,514,589  87.47%
sherlock::letters_lower             33,173,868 (17 MB/s)    60,289,303 (9 MB/s)       27,115,435  81.74%
sherlock::letters_upper             3,787,833 (157 MB/s)    4,650,326 (127 MB/s)         862,493  22.77%
sherlock::name_alt4                 257,680 (2,308 MB/s)    295,505 (2,013 MB/s)          37,825  14.68%
sherlock::name_whitespace           53,913 (11,035 MB/s)    60,149 (9,890 MB/s)            6,236  11.57%
sherlock::the_whitespace            1,634,006 (364 MB/s)    2,056,631 (289 MB/s)         422,625  25.86%
sherlock::words                     12,970,070 (45 MB/s)    20,320,171 (29 MB/s)       7,350,101  56.67%

And here is a comparison between the base line and TreiberStack<T>:

$ cargo-benchcmp rust-refcell.5 rust-treiber.5 --threshold 10
name                                rust-refcell.5 ns/iter  rust-treiber.5 ns/iter    diff ns/iter   diff %
misc::anchored_literal_long_match   49 (7,959 MB/s)         129 (3,023 MB/s)                    80  163.27%
misc::anchored_literal_short_match  48 (541 MB/s)           129 (201 MB/s)                      81  168.75%
misc::easy0_1K                      226 (4,650 MB/s)        319 (3,294 MB/s)                    93   41.15%
misc::easy0_32                      182 (324 MB/s)          268 (220 MB/s)                      86   47.25%
misc::easy1_1K                      257 (4,062 MB/s)        342 (3,052 MB/s)                    85   33.07%
misc::easy1_32                      160 (325 MB/s)          241 (215 MB/s)                      81   50.62%
misc::hard_32                       307 (192 MB/s)          382 (154 MB/s)                      75   24.43%
misc::medium_1K                     609 (1,727 MB/s)        695 (1,513 MB/s)                    86   14.12%
misc::medium_32                     255 (235 MB/s)          339 (176 MB/s)                      84   32.94%
misc::no_exponential                485 (206 MB/s)          576 (173 MB/s)                      91   18.76%
misc::not_literal                   271 (188 MB/s)          350 (145 MB/s)                      79   29.15%
misc::one_pass_long_prefix          182 (142 MB/s)          268 (97 MB/s)                       86   47.25%
misc::one_pass_long_prefix_not      154 (168 MB/s)          234 (111 MB/s)                      80   51.95%
misc::one_pass_short                112 (151 MB/s)          191 (89 MB/s)                       79   70.54%
misc::one_pass_short_not            112 (151 MB/s)          190 (89 MB/s)                       78   69.64%
sherlock::everything_greedy         6,838,126 (87 MB/s)     10,548,218 (56 MB/s)         3,710,092   54.26%
sherlock::ing_suffix                3,027,510 (196 MB/s)    3,548,306 (167 MB/s)           520,796   17.20%
sherlock::ing_suffix_limited_space  2,965,394 (200 MB/s)    3,444,708 (172 MB/s)           479,314   16.16%
sherlock::letters                   34,884,851 (17 MB/s)    109,569,267 (5 MB/s)        74,684,416  214.09%
sherlock::letters_lower             33,173,868 (17 MB/s)    106,638,088 (5 MB/s)        73,464,220  221.45%
sherlock::letters_upper             3,787,833 (157 MB/s)    6,230,733 (95 MB/s)          2,442,900   64.49%
sherlock::name_alt4                 257,680 (2,308 MB/s)    363,242 (1,637 MB/s)           105,562   40.97%
sherlock::name_whitespace           53,913 (11,035 MB/s)    71,406 (8,331 MB/s)             17,493   32.45%
sherlock::quotes                    947,223 (628 MB/s)      1,065,966 (558 MB/s)           118,743   12.54%
sherlock::the_whitespace            1,634,006 (364 MB/s)    2,651,996 (224 MB/s)         1,017,990   62.30%
sherlock::words                     12,970,070 (45 MB/s)    32,089,109 (18 MB/s)        19,119,039  147.41%

I note that the comparisons above seem entirely expected to me. Outside of noise, synchronization makes matching universally slower. Moreover, the benchmarks that actually show up in the list (which is a subset of all benchmarks) correspond to benchmarks with many searches over short haystacks, or many short matches over long haystacks. This is exactly the case where constant overhead will make a difference.

And, to make it easier to read, a comparison between Mutex<Vec<T>> and TreiberStack<T>:

$ cargo-benchcmp rust-mutex.5 rust-treiber.5 --threshold 10
name                                rust-mutex.5 ns/iter  rust-treiber.5 ns/iter    diff ns/iter   diff %
misc::anchored_literal_long_match   79 (4,936 MB/s)       129 (3,023 MB/s)                    50   63.29%
misc::anchored_literal_short_match  77 (337 MB/s)         129 (201 MB/s)                      52   67.53%
misc::easy0_1K                      275 (3,821 MB/s)      319 (3,294 MB/s)                    44   16.00%
misc::easy0_32                      225 (262 MB/s)        268 (220 MB/s)                      43   19.11%
misc::easy1_1K                      298 (3,503 MB/s)      342 (3,052 MB/s)                    44   14.77%
misc::easy1_32                      193 (269 MB/s)        241 (215 MB/s)                      48   24.87%
misc::hard_32                       334 (176 MB/s)        382 (154 MB/s)                      48   14.37%
misc::medium_32                     289 (207 MB/s)        339 (176 MB/s)                      50   17.30%
misc::no_exponential                520 (192 MB/s)        576 (173 MB/s)                      56   10.77%
misc::not_literal                   297 (171 MB/s)        350 (145 MB/s)                      53   17.85%
misc::one_pass_long_prefix          218 (119 MB/s)        268 (97 MB/s)                       50   22.94%
misc::one_pass_long_prefix_not      189 (137 MB/s)        234 (111 MB/s)                      45   23.81%
misc::one_pass_short                147 (115 MB/s)        191 (89 MB/s)                       44   29.93%
misc::one_pass_short_not            146 (116 MB/s)        190 (89 MB/s)                       44   30.14%
sherlock::everything_greedy         8,261,531 (72 MB/s)   10,548,218 (56 MB/s)         2,286,687   27.68%
sherlock::everything_greedy_nl      6,761,252 (87 MB/s)   5,485,594 (108 MB/s)        -1,275,658  -18.87%
sherlock::ing_suffix                3,210,605 (185 MB/s)  3,548,306 (167 MB/s)           337,701   10.52%
sherlock::ing_suffix_limited_space  3,117,710 (190 MB/s)  3,444,708 (172 MB/s)           326,998   10.49%
sherlock::letters                   65,399,440 (9 MB/s)   109,569,267 (5 MB/s)        44,169,827   67.54%
sherlock::letters_lower             60,289,303 (9 MB/s)   106,638,088 (5 MB/s)        46,348,785   76.88%
sherlock::letters_upper             4,650,326 (127 MB/s)  6,230,733 (95 MB/s)          1,580,407   33.98%
sherlock::name_alt4                 295,505 (2,013 MB/s)  363,242 (1,637 MB/s)            67,737   22.92%
sherlock::name_whitespace           60,149 (9,890 MB/s)   71,406 (8,331 MB/s)             11,257   18.72%
sherlock::the_whitespace            2,056,631 (289 MB/s)  2,651,996 (224 MB/s)           595,365   28.95%
sherlock::words                     20,320,171 (29 MB/s)  32,089,109 (18 MB/s)        11,768,938   57.92%

I admit that I found it somewhat surprising that a lock free stack was being beaten by a mutex, but this is probably definitely due to my complete ignorance of lock free algorithms. (Hence, why I'm writing this ticket.)

Other things

When using a TreiberStack<T>, perf top reports mem::epoch::participant::Participant::enter as a potential hotspot.

When using a Mutex<Vec<T>>, perf top reports pthread_mutex_lock and __pthread_mutex_unlock_usercnt as potential hot spots.

In both cases, other hotspots also of course appear, such as methods in the DFA, memchr, etc...

Another thing I've noticed in profiling is how much time is being spent in the Drop impl for PoolGuard. I tracked this down to time spent in memcpy, so I moved the representation of scratch spaces to a Box, which definitely helped, especially for the DFA whose cache struct isn't tiny. I'm not sure what can be done about this though.

???

So what can we do here? Is a TreiberStack in crossbeam the best lock free algorithm we can use? Does crossbeam have overhead that we could eliminate with a custom implementation? Other ideas?

Scope

In my view, the overhead of synchronization is holding us back from being more competitive with PCRE on very short matches/haystacks.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-03-21T15:25:58Z

cc @aturon I think I'm out of my depth here. Your feedback on this would be amazing. :-)

BurntSushi · 2016-03-21T15:30:27Z

One idea to at least help find_iter might be to fetch scratch space at the beginning of iteration and only release it back to the pool after iteration is complete. This would require relying on a Drop impl for the public facing iterator type in order to reuse scratch space, but I think that's OK. This still doesn't help, e.g., is_match on short haystacks though.

BurntSushi · 2016-03-21T16:02:05Z

cc @carllerche because you wrote http://carllerche.github.io/pool/pool/

alexcrichton · 2016-03-21T18:36:17Z

To clarify, this is basically just a cache for regexes which is pulled from every time a match is executed, and pushed back onto whenever the match is done?

Reading the other information, however, I agree with your constraints for now, and ideally Regex would indeed be Sync. Another possibility (which may be faster) is to use some form of an off-the-shell multi-producer multi-consumer queue. That's basically what's happening here as lots of people are pulling from the cache and lots are pushing into it (and the cache is just some global data).

I don't personally know of any unbounded lock-free mpmc queues, but I've seen a bounded version which I think used to be implemented in the standard library at some point? I think there's a translation into Rust lying around somewhere. Perhaps that'd be faster than the trieber stack if the memory management in crossbeam is the overhead here? I suspect it'd still be slower than RefCell<Vec<T>> but perhaps not that much slower? Also the fixed-size-ness here may not be so bad as you can just allocate more if it's empty or not push on of it's full (as it's just a cache).

It does sound to me, though, like a good idea for many matches to cache their data locally (such as find_iter) rather than push/pull from the pool so often.

BurntSushi · 2016-03-21T18:43:33Z

@alexcrichton Thanks for the helpful reply!

To clarify, this is basically just a cache for regexes which is pulled from every time a match is executed, and pushed back onto whenever the match is done?

Right.

Another possibility (which may be faster) is to use some form of an off-the-shell multi-producer multi-consumer queue.

It looks like @carllerche has a bounded mpmc queue here (based on the implementation in your link, it seems): http://carllerche.github.io/syncbox/syncbox/struct.ArrayQueue.html --- That should at least be straight-forward enough to try.

I guess I assumed that a queue might have more overhead than necessary since we don't actually need a queue here, but certainly, the fact that an implementation exists means it's at least worth a try. I'd also be surprised if it was faster than a trieber stack!

I've also been looking at Go's sync.Pool, which I believe has the exact semantics that I want. It appears to have a lock free fast path, but I don't grok the internals yet.

alexcrichton · 2016-03-21T18:47:18Z

Whoa! Looks like this was implemented way back when in rust-lang/rust@5876e21.

BurntSushi · 2016-03-21T21:06:36Z

Ding ding ding! Looks like we have a winner. Using ArrayQueue from syncbox leads to much better results. There are only a few regressions outside of noise, and they are quite a bit more more bearable:

$ cargo-benchcmp rust-refcell.5 rust-mpmc.5 --threshold 10
name                               rust-refcell.5 ns/iter  rust-mpmc.5 ns/iter     diff ns/iter  diff %
misc::anchored_literal_long_match  49 (7,959 MB/s)         54 (7,222 MB/s)                    5  10.20%
sherlock::letters                  34,884,851 (17 MB/s)    41,339,695 (14 MB/s)       6,454,844  18.50%
sherlock::letters_lower            33,173,868 (17 MB/s)    40,122,369 (14 MB/s)       6,948,501  20.95%
sherlock::words                    12,970,070 (45 MB/s)    14,417,004 (41 MB/s)       1,446,934  11.16%

OK, I am going to sink my teeth into this direction. I may end up with a more general pool crate that regex will depend on, since it will likely need unsafe.

To your point @alexcrichton, a bounded queue is fine, because we can fall back to the "dumb" queue with a mutex.

BurntSushi · 2016-03-22T02:49:33Z

OK, I think I've mostly solved this satisfactorily with a spin lock. It's still unsafe, but the implementation is quite a bit simpler and works well when there's low contention:

test bench::mempool_get_put   ... bench:          27 ns/iter (+/- 0)
test bench::mpmc_get_put      ... bench:          32 ns/iter (+/- 0)
test bench::mutex_get_put     ... bench:          45 ns/iter (+/- 0)
test bench::refcell_get_put   ... bench:          17 ns/iter (+/- 0)
test bench::treiber_get_put   ... bench:          95 ns/iter (+/- 1)

Crate is here: https://github.com/BurntSushi/mempool

And finally, the benchmarks as done in the other experiments:

$ cargo-benchcmp rust-refcell.5 rust-mempool.5 --threshold 10
name                                rust-refcell.5 ns/iter  rust-mempool.5 ns/iter    diff ns/iter  diff %
misc::anchored_literal_long_match   49 (7,959 MB/s)         59 (6,610 MB/s)                     10  20.41%
misc::anchored_literal_short_match  48 (541 MB/s)           56 (464 MB/s)                        8  16.67%
sherlock::letters                   34,884,851 (17 MB/s)    41,854,383 (14 MB/s)         6,969,532  19.98%
sherlock::letters_lower             33,173,868 (17 MB/s)    42,482,455 (14 MB/s)         9,308,587  28.06%
sherlock::words                     12,970,070 (45 MB/s)    14,700,706 (40 MB/s)         1,730,636  13.34%

Which looks pretty good!

BurntSushi · 2016-03-22T03:05:11Z

Overall before/after:

$ cargo-benchcmp rust-mutex.5 rust-mempool.5 --threshold 10
name                                rust-mutex.5 ns/iter  rust-mempool.5 ns/iter    diff ns/iter   diff %
misc::anchored_literal_long_match   79 (4,936 MB/s)       59 (6,610 MB/s)                    -20  -25.32%
misc::anchored_literal_short_match  77 (337 MB/s)         56 (464 MB/s)                      -21  -27.27%
misc::easy0_1K                      275 (3,821 MB/s)      246 (4,272 MB/s)                   -29  -10.55%
misc::easy0_32                      225 (262 MB/s)        187 (315 MB/s)                     -38  -16.89%
misc::easy1_32                      193 (269 MB/s)        168 (309 MB/s)                     -25  -12.95%
misc::one_pass_long_prefix          218 (119 MB/s)        187 (139 MB/s)                     -31  -14.22%
misc::one_pass_long_prefix_not      189 (137 MB/s)        159 (163 MB/s)                     -30  -15.87%
misc::one_pass_short                147 (115 MB/s)        118 (144 MB/s)                     -29  -19.73%
misc::one_pass_short_not            146 (116 MB/s)        118 (144 MB/s)                     -28  -19.18%
sherlock::everything_greedy         8,261,531 (72 MB/s)   7,130,155 (83 MB/s)         -1,131,376  -13.69%
sherlock::everything_greedy_nl      6,761,252 (87 MB/s)   5,482,379 (108 MB/s)        -1,278,873  -18.91%
sherlock::letters                   65,399,440 (9 MB/s)   41,854,383 (14 MB/s)       -23,545,057  -36.00%
sherlock::letters_lower             60,289,303 (9 MB/s)   42,482,455 (14 MB/s)       -17,806,848  -29.54%
sherlock::letters_upper             4,650,326 (127 MB/s)  4,011,615 (148 MB/s)          -638,711  -13.73%
sherlock::the_whitespace            2,056,631 (289 MB/s)  1,735,027 (342 MB/s)          -321,604  -15.64%
sherlock::words                     20,320,171 (29 MB/s)  14,700,706 (40 MB/s)        -5,619,465  -27.65%

BurntSushi · 2016-03-22T03:06:26Z

So I think some of those (particularly the sherlock benchmarks) should benefit a bit if we can hang on to the scratch space throughout the entire iteration, but that's a bigger refactoring unfortunately.

aturon · 2016-03-22T03:19:38Z

@BurntSushi Glad to see you've found something reasonably workable. I'm actually eager to use this as a more realistic benchmark for crossbeam, and see what if any tuning can be done. I haven't had a chance to dig into the details yet, though -- but is there an easy link to the whole benchmarking setup?

aturon · 2016-03-22T03:23:36Z

(And I should mention that if you can get away with a queue, either of the queues is going to be much better than the stack contention-wise. But the fact that enter was the bottleneck suggests that the right fix is probably to keep a guard "open" across operations. I'll have to look at the benchmark to know whether that actually makes sense, though.)

BurntSushi · 2016-03-22T03:31:11Z

@aturon Currently, the regex work is in a branch lits (which is pushed to this repo). You can then run benchmarks with ./run-bench rust. With that said, that will probably give you too much data. If you do go that route, the {easy0,easy1,medium,hard}_32 benchmarks are worth keeping an eye on, since they use very short haystacks. Similarly, sherlock::letters benchmarks many short matches in a big haystack.

I attempted to encapsulate the specific thing I cared about in the benchmarks for the mempool crate here: https://github.com/BurntSushi/mempool/blob/master/src/bench.rs --- There are several alternative implementations, one which includes the TreiberStack from crossbeam. You may find this more useful for playing around. Run these with cargo bench --features nightly.

I think keeping a guard open across operations is only half the picture unfortunately, since it isn't always possible. For example, Regex::is_match doesn't really provide an opportunity to do that. (find_iter and captures_iter seem easier.) With that said, maybe there's another way to keep at most one guard open, and only go to the pool when there's contention. Hmmm. Still seems tricky, given that with the spin lock I'm only paying for a single atomic exchange operation when there's no contention. How do you beat that?

One thing I did notice in the implementation of the TreiberStack is that there's always an allocation for every push. I don't know exactly how much that's hurting things, but avoiding that seems tricky (in general).

aturon · 2016-03-22T03:40:38Z

@BurntSushi The SegmentedStack data structure batches allocations (in the way its name suggests). It's not yet highly-tuned in crossbeam, but has a lot of potential. One nice thing is that most of the time it only needs a fetch-and-inc rather than a compare-and-swap, which is often faster (since multiple such operations can be executed in parallel by the hardware).

BurntSushi · 2016-03-22T03:51:43Z

@aturon Neat, I missed that. I added benchmarks for both queues:

test bench::crossbeam_ms_get_put      ... bench:         108 ns/iter (+/- 3)
test bench::crossbeam_seg_get_put     ... bench:          83 ns/iter (+/- 4)
test bench::crossbeam_treiber_get_put ... bench:          97 ns/iter (+/- 2)
test bench::mempool_get_put           ... bench:          27 ns/iter (+/- 0)
test bench::mpmc_get_put              ... bench:          33 ns/iter (+/- 0)
test bench::mutex_get_put             ... bench:          49 ns/iter (+/- 0)
test bench::refcell_get_put           ... bench:          17 ns/iter (+/- 1)

Definitely looks like SegQueue is a bit faster than a TreiberStack for this use case!

BurntSushi · 2016-03-23T01:52:36Z

I might be dead in the water. Passing a guard around means that iterators like FindMatches grow a destructor, and I end up running straight into things like this rust-lang/rust#29813 (I think), which breaks existing code. (For example, the regex-dna benchmark. sigh) I could use an Arc in the guard to avoid the lifetime, but that alone negates most of the perf benefits.

BurntSushi · 2016-03-23T01:54:47Z

Even simpler constructs like this will fail to compile:

for pos in Regex::new(r"\b\w{13}\b").unwrap().find_iter(text) {
...
}

with

<anon>:4:1: 6:2 note: reference must be valid for the destruction scope surrounding statement at 4:0...
<anon>:4 for pos in Regex::new(r"\b\w{13}\b").unwrap().find_iter(text) {
<anon>:5     println!("{:?}", pos);
<anon>:6 }
<anon>:4:1: 6:2 note: ...but borrowed value is only valid for the statement at 4:0
<anon>:4 for pos in Regex::new(r"\b\w{13}\b").unwrap().find_iter(text) {
<anon>:5     println!("{:?}", pos);
<anon>:6 }
<anon>:4:1: 6:2 help: consider using a `let` binding to increase its lifetime
<anon>:4 for pos in Regex::new(r"\b\w{13}\b").unwrap().find_iter(text) {
<anon>:5     println!("{:?}", pos);
<anon>:6 }

BurntSushi · 2016-03-23T01:55:39Z

In this case, find_iter returns FindMatches, which now contains a PoolGuard with a lifetime associated with the Regex. A PoolGuard has a destructor that causes the value to be put back into the pool for reuse.

The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189

BurntSushi · 2016-03-28T20:50:26Z

The faster memory pool is implemented in commit 31a317e

Making this even better for find_iter/captures_iter is issue #192

BurntSushi added enhancement help wanted question labels Mar 21, 2016

BurntSushi mentioned this issue Mar 27, 2016

Major literal optimization refactoring. #190

Merged

BurntSushi mentioned this issue Mar 28, 2016

avoid excessive cache retrieval for find_iter and captures_iter #192

Closed

BurntSushi closed this as completed Mar 28, 2016

dizlv mentioned this issue Oct 20, 2022

Extensive regex matching leads to spinning locks contention vectordotdev/vector#14898

Open

implement a faster memory pool #188

implement a faster memory pool #188

Comments

BurntSushi commented Mar 21, 2016

Problem

Current solution

Benchmarks

Other things

???

Scope

BurntSushi commented Mar 21, 2016

Uh oh!

BurntSushi commented Mar 21, 2016

Uh oh!

BurntSushi commented Mar 21, 2016

Uh oh!

alexcrichton commented Mar 21, 2016

Uh oh!

BurntSushi commented Mar 21, 2016

Uh oh!

alexcrichton commented Mar 21, 2016

Uh oh!

BurntSushi commented Mar 21, 2016

Uh oh!

BurntSushi commented Mar 22, 2016

Uh oh!

BurntSushi commented Mar 22, 2016

Uh oh!

BurntSushi commented Mar 22, 2016

Uh oh!

aturon commented Mar 22, 2016

Uh oh!

aturon commented Mar 22, 2016

Uh oh!

BurntSushi commented Mar 22, 2016

Uh oh!

aturon commented Mar 22, 2016

Uh oh!

BurntSushi commented Mar 22, 2016

Uh oh!

BurntSushi commented Mar 23, 2016

Uh oh!

BurntSushi commented Mar 23, 2016

Uh oh!

BurntSushi commented Mar 23, 2016

Uh oh!

BurntSushi commented Mar 28, 2016

Uh oh!