-
Notifications
You must be signed in to change notification settings - Fork 466
implement a faster memory pool #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @aturon I think I'm out of my depth here. Your feedback on this would be amazing. :-) |
One idea to at least help |
cc @carllerche because you wrote http://carllerche.github.io/pool/pool/ |
To clarify, this is basically just a cache for regexes which is pulled from every time a match is executed, and pushed back onto whenever the match is done? Reading the other information, however, I agree with your constraints for now, and ideally I don't personally know of any unbounded lock-free mpmc queues, but I've seen a bounded version which I think used to be implemented in the standard library at some point? I think there's a translation into Rust lying around somewhere. Perhaps that'd be faster than the trieber stack if the memory management in crossbeam is the overhead here? I suspect it'd still be slower than It does sound to me, though, like a good idea for many matches to cache their data locally (such as |
@alexcrichton Thanks for the helpful reply!
Right.
It looks like @carllerche has a bounded mpmc queue here (based on the implementation in your link, it seems): http://carllerche.github.io/syncbox/syncbox/struct.ArrayQueue.html --- That should at least be straight-forward enough to try. I guess I assumed that a queue might have more overhead than necessary since we don't actually need a queue here, but certainly, the fact that an implementation exists means it's at least worth a try. I'd also be surprised if it was faster than a trieber stack! I've also been looking at Go's |
Whoa! Looks like this was implemented way back when in rust-lang/rust@5876e21. |
Ding ding ding! Looks like we have a winner. Using
OK, I am going to sink my teeth into this direction. I may end up with a more general pool crate that To your point @alexcrichton, a bounded queue is fine, because we can fall back to the "dumb" queue with a mutex. |
OK, I think I've mostly solved this satisfactorily with a spin lock. It's still unsafe, but the implementation is quite a bit simpler and works well when there's low contention:
Crate is here: https://github.com/BurntSushi/mempool And finally, the benchmarks as done in the other experiments:
Which looks pretty good! |
Overall before/after:
|
So I think some of those (particularly the |
@BurntSushi Glad to see you've found something reasonably workable. I'm actually eager to use this as a more realistic benchmark for crossbeam, and see what if any tuning can be done. I haven't had a chance to dig into the details yet, though -- but is there an easy link to the whole benchmarking setup? |
(And I should mention that if you can get away with a queue, either of the queues is going to be much better than the stack contention-wise. But the fact that |
@aturon Currently, the regex work is in a branch I attempted to encapsulate the specific thing I cared about in the benchmarks for the I think keeping a guard open across operations is only half the picture unfortunately, since it isn't always possible. For example, One thing I did notice in the implementation of the TreiberStack is that there's always an allocation for every |
@BurntSushi The SegmentedStack data structure batches allocations (in the way its name suggests). It's not yet highly-tuned in crossbeam, but has a lot of potential. One nice thing is that most of the time it only needs a fetch-and-inc rather than a compare-and-swap, which is often faster (since multiple such operations can be executed in parallel by the hardware). |
@aturon Neat, I missed that. I added benchmarks for both queues:
Definitely looks like SegQueue is a bit faster than a TreiberStack for this use case! |
I might be dead in the water. Passing a guard around means that iterators like |
Even simpler constructs like this will fail to compile: for pos in Regex::new(r"\b\w{13}\b").unwrap().find_iter(text) {
...
} with
|
In this case, |
The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189
The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189
The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189
Problem
When a regex search executes, it has to choose a matching engine (sometimes more than one) to carry out the search. Each matching engine needs some amount of fixed mutable space on the heap to carry out a search, which I'll call "scratch space." In general, this space is reusable and reusing the space leads to significant performance benefits when using a regular expression to carry out multiple searches. (For example, the scratch space may contain computed DFA states.) Scratch space is used every time a regular expression executes a search. For example, calling
re.find_iter("...")
will execute possibly many searches, depending on how many matches it finds.Here are some constraints I've been working with:
Regex
must beSend
andSync
. This permits one to share a regex across multiple threads without any external synchronization.regex
crate should never spawn a thread.Constraint (1) is the killer, because it means synchronizing concurrent access to mutable state. For example, one might have a
Arc<Regex>
where theRegex
is used simultaneously among multiple threads. If we gave up on theSync
bound, then callers would need to eitherClone
regular expressions to use them across multiple threads, or put them in aMutex
. The latter is a bit surprising sinceRegex
doesn't have any public mutable methods. The latter is also very poor performance wise because one search will completely block all other searches. The former, cloning, is somewhat acceptable if a little wasteful. That is, if we dropped theSync
bound, I'd expect users to clone regular expressions if they want to use them across multiple threads.Constraint (2) just seems like good sense. To me, a thread spawned as a result of running a regex violates the principle of least surprise.
Constraint (3) permits the implementation to be simple and should make contention be a non-factor. If we were more memory conscious, we'd never copy the scratch space, which would mean that each of the regex engines would be forced to do their own type of synchronization. Not only is this much more complex, but it means contention could be a real problem while searching, which seems unfortunate.
Given all of this, it is my belief that the key thing worth optimizing is the overhead of synchronization itself. The cost of copying the scratch space should be amortized through reuse, and contention should be extremely limited since synchronization only needs to occur at the start and end of every search. (I suppose contention could become an issue if a regex that matches very often on very short spans is used simultaneously across multiple threads. The good thing about this is that the caller could work around this by simply cloning the regex, avoiding contention altogether.)
Current solution
The most straight-forward solution to this problem is to wrap some collection data structure in a
Mutex
. The data structure could trivially be a queue, stack or (singly) linked list. The current implementation is so simple that it can be digested in a quick skim. In particular, the only operations we need to support areget
andpop
. No ordering invariants are necessary since all copies of scratch space are equally usable for every regex search.Benchmarks
I have three sets of benchmarks representing three pretty easy-to-implement strategies. The first is the base line, which uses a
RefCell<Vec<T>>
. This obviously does no synchronization, so in this world,Regex
doesn't implSync
. The second is the current solution, which usesMutex<Vec<T>>
. The third uses a lock free stack fromcrossbeam
, which usesTreiberStack<T>
.Here is a comparison between the base line and
Mutex<Vec<T>>
, showing only benchmarks with more than 10% difference (positive percent indicates how much slower mutexes are than refcells in this case):And here is a comparison between the base line and
TreiberStack<T>
:I note that the comparisons above seem entirely expected to me. Outside of noise, synchronization makes matching universally slower. Moreover, the benchmarks that actually show up in the list (which is a subset of all benchmarks) correspond to benchmarks with many searches over short haystacks, or many short matches over long haystacks. This is exactly the case where constant overhead will make a difference.
And, to make it easier to read, a comparison between
Mutex<Vec<T>>
andTreiberStack<T>
:I admit that I found it somewhat surprising that a lock free stack was being beaten by a mutex, but this is probably definitely due to my complete ignorance of lock free algorithms. (Hence, why I'm writing this ticket.)
Other things
When using a
TreiberStack<T>
,perf top
reportsmem::epoch::participant::Participant::enter
as a potential hotspot.When using a
Mutex<Vec<T>>
,perf top
reportspthread_mutex_lock
and__pthread_mutex_unlock_usercnt
as potential hot spots.In both cases, other hotspots also of course appear, such as methods in the DFA,
memchr
, etc...Another thing I've noticed in profiling is how much time is being spent in the
Drop
impl forPoolGuard
. I tracked this down to time spent inmemcpy
, so I moved the representation of scratch spaces to aBox
, which definitely helped, especially for the DFA whose cache struct isn't tiny. I'm not sure what can be done about this though.???
So what can we do here? Is a
TreiberStack
incrossbeam
the best lock free algorithm we can use? Doescrossbeam
have overhead that we could eliminate with a custom implementation? Other ideas?Scope
In my view, the overhead of synchronization is holding us back from being more competitive with PCRE on very short matches/haystacks.
The text was updated successfully, but these errors were encountered: