Potential performance improvement, maybe #83

itamarst · 2021-06-14T15:02:22Z

At https://github.com/BurntSushi/aho-corasick/blob/master/src/classes.rs#L46, the alphabet_len is calculated each time the function is called. This function is called in (what appears to me to be) a hot inner function, next_state(). It seems like this value doesn't change once the BytesClasses are constructed, so there's no need to recalculate each time, it could be recalculated and cached only when the ByteClasses changes.

(I imagine the compiler might be smart enough to figure this out? But then again maybe not.)

The text was updated successfully, but these errors were encountered:

BurntSushi · 2021-06-14T23:07:17Z

It is plausible that this is a problem, it's plausible that the compiler might be smart enough to work-around this and it's even plausible that the compiler is not smart enough and it still might not be a problem. These state machine loops tend to be limited by how quickly one can shovel bytes into CPU registers, since DFAs don't tend to have great locality properties.

I think the thing to do here is to scrutinize the generated codegen and see if the +1 is getting optimized out or not. Regardless of that, the next step I think would be to very rigorously measure the perf difference of storing the alphabet length vs computing it. I've always found it somewhat difficult to reliably observe perf differences with small tweaks like this to the DFA matching loop.

itamarst · 2021-06-20T00:14:14Z

So I tried this, and... benchmark diffs are all over the place, and in enough places that I suspect it's just noise 😢 So I can't really tell if it makes a difference or not.

One potential approach is something like https://github.com/bheisler/iai, which would give consistent results. I've been using the equivalent for some of my projects, and it's really nice to be able to tell if a microoptimization worked or not (presuming that the emulated CPU is actually the same as a real CPU, which of course it is not).

BurntSushi · 2021-06-20T11:36:56Z

@itamarst Yeah, in those cases, I tend to revert to ad hoc benchmarks using the aho-corasick-debug tool. That is, I run it on much bigger inputs on an otherwise idle system. This doesn't necessarily get rid of noise, but it does tend to help.

iai definitely looks worth a try. I hadn't heard of it before. It will likely be a bit before I get around to that though.

This commit nearly wipes the slate clean with respect to the Aho-Corasick implementations and rebuilds things from scratch. The main thing that motivated this was actually to make it possible for Aho-Corasick automata to support both unanchored and anchored searches, without having to build two distinct automata for that purpose. (And the reason why I wanted that was to support such a configuration from within regex-automata.) However, it quickly turned into something else entirely. This commit: * Splits the old NFA into two: a noncontiguous NFA that always uses a sparse representation and a contiguous NFA that uses both dense and sparse representations. The old NFA was always noncontiguous but also used dense and sparse representations for state transitions. This split makes it possible to better optimize the NFA. Namely, its contiguous nature has better cache effects and allows us to *substantially* reduce memory usage (sometimes by an order of magnitude) when compared to the old NFA. * Both of the NFAs now unconditionally support both anchored and unanchored searches with essentially no additional cost. * The DFA remains mostly the same as before, except its stride is now always a power of 2 and it can be made to support both unanchored and anchored searches by copying the transition table and tweaking it. (The copy is why only unanchored searches are supported by default.) * We borrow a stripped down version of the 'Input' abstraction from my work-in-progress on the 'regex-automata' crate. This reduces the combinatorial explosion of different search routines while making common cases ergonomic. For example, both 'anchored' and 'earliest' are now configuration options on an 'Input'. * The main AhoCorasick type now has fallible APIs in addition to infallible APIs. It is still reasonable to use infallible APIs, as they panic only based on incorrect combinations of build and search configurations. Both of which may be static properties of a program, and thus, an improper configuration is quite likely (but not necessarily) to be a programmer error. * The underlying Aho-Corasick implementations are now exposed through sub-modules, and an 'Automaton' trait unifies their APIs. This now makes it possible for callers to walk the automaton directly like any old finite state machine. * A no-std but with-alloc mode was added. This essentially just gets rid of the streaming APIs (that require std::io) and the std::error::Error trait impls. * The old (and awkward) 'auto_configure' API is now gone. Instead, automatic configuration is now the default. * The old 'premultiply' option is now gone, and is now always enabled. * The old 'byte_classes' option remains, but is enabled by default and disabling it no longer confers any performance benefit. Instead, it only provides a debuggability benefit by making state transitions easier to understand. * 'AhoCorasick' is no longer generic over the representation for its state identifiers. All Aho-Corasick automatons now unconditionally use 'u32' for this. This is in practice big enough for most use cases, but small enough to provide reasonable memory usage. The real star of this show is the contiguous NFA. It not only significantly reduces heap memory used, but is also quite a bit faster than the old NFA while being only slightly slower to build. It doesn't quite match the speed of a DFA, but it is quite close. These properties combined make it an excellently balanced choice. Indeed, 'AhoCorasick::new' will use it for almost all cases, excepting when the number of patterns is very small. Other than that, the central type of this crate remains 'AhoCorasick', and its most common use cases will work almost unchanged from 'aho-corasick 0.7'. Closes #57, Closes #83

BurntSushi mentioned this issue Jan 5, 2023

api: rewrite most of the crate and give an API refresh #99

Merged

BurntSushi closed this as completed in #99 Jan 5, 2023

BurntSushi closed this as completed in f32d32e Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential performance improvement, maybe #83

Potential performance improvement, maybe #83

itamarst commented Jun 14, 2021

BurntSushi commented Jun 14, 2021

itamarst commented Jun 20, 2021

BurntSushi commented Jun 20, 2021

Potential performance improvement, maybe #83

Potential performance improvement, maybe #83

Comments

itamarst commented Jun 14, 2021

BurntSushi commented Jun 14, 2021

itamarst commented Jun 20, 2021

BurntSushi commented Jun 20, 2021