-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: remove configuration knobs #57
Comments
I'm not sure how many users of |
Yeah I should have updated this. I'm not holding myself up here as a result of the lack of feedback. My plan is to just to go forward with this. Just haven't got a chance to do it. It's not super high priority. |
Deprecating them sounds smart though. Might be a good forcing function. |
Sure, just removing it in the next breaking release also works. I just thought that deprecating would be the most reliable way to get potential objections. FWIW, I cannot find any code on GitHub setting |
@vks Yeah, there are some cases in which disabling byte classes can make the resulting search faster, but I've only ever seen slight improvements. (I've also seen regressions due to the increased size in the automaton.) |
These options aren't really carrying their weight. In a future release, aho-corasick will make both options enabled by default all the time with the impossibility of disabling them. The main reason for this is that these options require a quadrupling in the amount of code in this crate. While it's possible to see a performance hit when using byte classes, it should generally be very small. The improvement, if one exists, just doesn't see worth it. Please see #57 for more discussion. This is meant to mirror a similar decision occurring in regex-automata: BurntSushi/regex-automata#7.
These options aren't really carrying their weight. In a future release, aho-corasick will make both options enabled by default all the time with the impossibility of disabling them. The main reason for this is that these options require a quadrupling in the amount of code in this crate. While it's possible to see a performance hit when using byte classes, it should generally be very small. The improvement, if one exists, just doesn't see worth it. Please see #57 for more discussion. This is meant to mirror a similar decision occurring in regex-automata: BurntSushi/regex-automata#7.
In the next semver incompatible release, I am also planning to remove the ability to construct automatons using a smaller state ID representation. Instead, it will be fixed to |
This commit nearly wipes the slate clean with respect to the Aho-Corasick implementations and rebuilds things from scratch. The main thing that motivated this was actually to make it possible for Aho-Corasick automata to support both unanchored and anchored searches, without having to build two distinct automata for that purpose. (And the reason why I wanted that was to support such a configuration from within regex-automata.) However, it quickly turned into something else entirely. This commit: * Splits the old NFA into two: a noncontiguous NFA that always uses a sparse representation and a contiguous NFA that uses both dense and sparse representations. The old NFA was always noncontiguous but also used dense and sparse representations for state transitions. This split makes it possible to better optimize the NFA. Namely, its contiguous nature has better cache effects and allows us to *substantially* reduce memory usage (sometimes by an order of magnitude) when compared to the old NFA. * Both of the NFAs now unconditionally support both anchored and unanchored searches with essentially no additional cost. * The DFA remains mostly the same as before, except its stride is now always a power of 2 and it can be made to support both unanchored and anchored searches by copying the transition table and tweaking it. (The copy is why only unanchored searches are supported by default.) * We borrow a stripped down version of the 'Input' abstraction from my work-in-progress on the 'regex-automata' crate. This reduces the combinatorial explosion of different search routines while making common cases ergonomic. For example, both 'anchored' and 'earliest' are now configuration options on an 'Input'. * The main AhoCorasick type now has fallible APIs in addition to infallible APIs. It is still reasonable to use infallible APIs, as they panic only based on incorrect combinations of build and search configurations. Both of which may be static properties of a program, and thus, an improper configuration is quite likely (but not necessarily) to be a programmer error. * The underlying Aho-Corasick implementations are now exposed through sub-modules, and an 'Automaton' trait unifies their APIs. This now makes it possible for callers to walk the automaton directly like any old finite state machine. * A no-std but with-alloc mode was added. This essentially just gets rid of the streaming APIs (that require std::io) and the std::error::Error trait impls. * The old (and awkward) 'auto_configure' API is now gone. Instead, automatic configuration is now the default. * The old 'premultiply' option is now gone, and is now always enabled. * The old 'byte_classes' option remains, but is enabled by default and disabling it no longer confers any performance benefit. Instead, it only provides a debuggability benefit by making state transitions easier to understand. * 'AhoCorasick' is no longer generic over the representation for its state identifiers. All Aho-Corasick automatons now unconditionally use 'u32' for this. This is in practice big enough for most use cases, but small enough to provide reasonable memory usage. The real star of this show is the contiguous NFA. It not only significantly reduces heap memory used, but is also quite a bit faster than the old NFA while being only slightly slower to build. It doesn't quite match the speed of a DFA, but it is quite close. These properties combined make it an excellently balanced choice. Indeed, 'AhoCorasick::new' will use it for almost all cases, excepting when the number of patterns is very small. Other than that, the central type of this crate remains 'AhoCorasick', and its most common use cases will work almost unchanged from 'aho-corasick 0.7'. Closes #57, Closes #83
This commit nearly wipes the slate clean with respect to the Aho-Corasick implementations and rebuilds things from scratch. The main thing that motivated this was actually to make it possible for Aho-Corasick automata to support both unanchored and anchored searches, without having to build two distinct automata for that purpose. (And the reason why I wanted that was to support such a configuration from within regex-automata.) However, it quickly turned into something else entirely. This commit: * Splits the old NFA into two: a noncontiguous NFA that always uses a sparse representation and a contiguous NFA that uses both dense and sparse representations. The old NFA was always noncontiguous but also used dense and sparse representations for state transitions. This split makes it possible to better optimize the NFA. Namely, its contiguous nature has better cache effects and allows us to *substantially* reduce memory usage (sometimes by an order of magnitude) when compared to the old NFA. * Both of the NFAs now unconditionally support both anchored and unanchored searches with essentially no additional cost. * The DFA remains mostly the same as before, except its stride is now always a power of 2 and it can be made to support both unanchored and anchored searches by copying the transition table and tweaking it. (The copy is why only unanchored searches are supported by default.) * We borrow a stripped down version of the 'Input' abstraction from my work-in-progress on the 'regex-automata' crate. This reduces the combinatorial explosion of different search routines while making common cases ergonomic. For example, both 'anchored' and 'earliest' are now configuration options on an 'Input'. * The main AhoCorasick type now has fallible APIs in addition to infallible APIs. It is still reasonable to use infallible APIs, as they panic only based on incorrect combinations of build and search configurations. Both of which may be static properties of a program, and thus, an improper configuration is quite likely (but not necessarily) to be a programmer error. * The underlying Aho-Corasick implementations are now exposed through sub-modules, and an 'Automaton' trait unifies their APIs. This now makes it possible for callers to walk the automaton directly like any old finite state machine. * A no-std but with-alloc mode was added. This essentially just gets rid of the streaming APIs (that require std::io) and the std::error::Error trait impls. * The old (and awkward) 'auto_configure' API is now gone. Instead, automatic configuration is now the default. * The old 'premultiply' option is now gone, and is now always enabled. * The old 'byte_classes' option remains, but is enabled by default and disabling it no longer confers any performance benefit. Instead, it only provides a debuggability benefit by making state transitions easier to understand. * 'AhoCorasick' is no longer generic over the representation for its state identifiers. All Aho-Corasick automatons now unconditionally use 'u32' for this. This is in practice big enough for most use cases, but small enough to provide reasonable memory usage. The real star of this show is the contiguous NFA. It not only significantly reduces heap memory used, but is also quite a bit faster than the old NFA while being only slightly slower to build. It doesn't quite match the speed of a DFA, but it is quite close. These properties combined make it an excellently balanced choice. Indeed, 'AhoCorasick::new' will use it for almost all cases, excepting when the number of patterns is very small. Other than that, the central type of this crate remains 'AhoCorasick', and its most common use cases will work almost unchanged from 'aho-corasick 0.7'. Closes #57, Closes #83
This RFC is basically a mirror of BurntSushi/regex-automata#7
The idea here is that I would like to make
premultiply=true, byte_classes=true
be the only option that one can choose for DFAs. (It is currently the default for DFAs.) I think it was a mistake to expose these knobs, as disabling them doesn't really confer benefits (if any). But they are quite costly in terms of code size and clarity.Are there any strenuous objections to removing these options?
The text was updated successfully, but these errors were encountered: