Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance improvement, maybe #83

Closed
itamarst opened this issue Jun 14, 2021 · 3 comments · Fixed by #99
Closed

Potential performance improvement, maybe #83

itamarst opened this issue Jun 14, 2021 · 3 comments · Fixed by #99

Comments

@itamarst
Copy link

At https://github.com/BurntSushi/aho-corasick/blob/master/src/classes.rs#L46, the alphabet_len is calculated each time the function is called. This function is called in (what appears to me to be) a hot inner function, next_state(). It seems like this value doesn't change once the BytesClasses are constructed, so there's no need to recalculate each time, it could be recalculated and cached only when the ByteClasses changes.

(I imagine the compiler might be smart enough to figure this out? But then again maybe not.)

@BurntSushi
Copy link
Owner

It is plausible that this is a problem, it's plausible that the compiler might be smart enough to work-around this and it's even plausible that the compiler is not smart enough and it still might not be a problem. These state machine loops tend to be limited by how quickly one can shovel bytes into CPU registers, since DFAs don't tend to have great locality properties.

I think the thing to do here is to scrutinize the generated codegen and see if the +1 is getting optimized out or not. Regardless of that, the next step I think would be to very rigorously measure the perf difference of storing the alphabet length vs computing it. I've always found it somewhat difficult to reliably observe perf differences with small tweaks like this to the DFA matching loop.

@itamarst
Copy link
Author

So I tried this, and... benchmark diffs are all over the place, and in enough places that I suspect it's just noise 😢 So I can't really tell if it makes a difference or not.

One potential approach is something like https://github.com/bheisler/iai, which would give consistent results. I've been using the equivalent for some of my projects, and it's really nice to be able to tell if a microoptimization worked or not (presuming that the emulated CPU is actually the same as a real CPU, which of course it is not).

@BurntSushi
Copy link
Owner

@itamarst Yeah, in those cases, I tend to revert to ad hoc benchmarks using the aho-corasick-debug tool. That is, I run it on much bigger inputs on an otherwise idle system. This doesn't necessarily get rid of noise, but it does tend to help.

iai definitely looks worth a try. I hadn't heard of it before. It will likely be a bit before I get around to that though.

BurntSushi added a commit that referenced this issue Jan 5, 2023
This commit nearly wipes the slate clean with respect to the
Aho-Corasick implementations and rebuilds things from scratch. The main
thing that motivated this was actually to make it possible for
Aho-Corasick automata to support both unanchored and anchored searches,
without having to build two distinct automata for that purpose. (And the
reason why I wanted that was to support such a configuration from
within regex-automata.)

However, it quickly turned into something else entirely. This commit:

* Splits the old NFA into two: a noncontiguous NFA that always uses a
sparse representation and a contiguous NFA that uses both dense and
sparse representations. The old NFA was always noncontiguous but also
used dense and sparse representations for state transitions. This split
makes it possible to better optimize the NFA. Namely, its contiguous
nature has better cache effects and allows us to *substantially* reduce
memory usage (sometimes by an order of magnitude) when compared to the
old NFA.
* Both of the NFAs now unconditionally support both anchored and
unanchored searches with essentially no additional cost.
* The DFA remains mostly the same as before, except its stride is now
always a power of 2 and it can be made to support both unanchored and
anchored searches by copying the transition table and tweaking it. (The
copy is why only unanchored searches are supported by default.)
* We borrow a stripped down version of the 'Input' abstraction from my
work-in-progress on the 'regex-automata' crate. This reduces the
combinatorial explosion of different search routines while making common
cases ergonomic. For example, both 'anchored' and 'earliest' are now
configuration options on an 'Input'.
* The main AhoCorasick type now has fallible APIs in addition to
infallible APIs. It is still reasonable to use infallible APIs, as they
panic only based on incorrect combinations of build and search
configurations. Both of which may be static properties of a program, and
thus, an improper configuration is quite likely (but not necessarily) to
be a programmer error.
* The underlying Aho-Corasick implementations are now exposed through
sub-modules, and an 'Automaton' trait unifies their APIs. This now makes
it possible for callers to walk the automaton directly like any old
finite state machine.
* A no-std but with-alloc mode was added. This essentially just gets rid
of the streaming APIs (that require std::io) and the std::error::Error
trait impls.
* The old (and awkward) 'auto_configure' API is now gone. Instead,
automatic configuration is now the default.
* The old 'premultiply' option is now gone, and is now always enabled.
* The old 'byte_classes' option remains, but is enabled by default and
disabling it no longer confers any performance benefit. Instead, it only
provides a debuggability benefit by making state transitions easier to
understand.
* 'AhoCorasick' is no longer generic over the representation for its
state identifiers. All Aho-Corasick automatons now unconditionally use
'u32' for this. This is in practice big enough for most use cases, but
small enough to provide reasonable memory usage.

The real star of this show is the contiguous NFA. It not only
significantly reduces heap memory used, but is also quite a bit faster
than the old NFA while being only slightly slower to build. It doesn't
quite match the speed of a DFA, but it is quite close. These properties
combined make it an excellently balanced choice. Indeed,
'AhoCorasick::new' will use it for almost all cases, excepting when the
number of patterns is very small.

Other than that, the central type of this crate remains 'AhoCorasick',
and its most common use cases will work almost unchanged from
'aho-corasick 0.7'.

Closes #57, Closes #83
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants