-
Notifications
You must be signed in to change notification settings - Fork 26
RFC: removing configuration knobs and ability to choose state ID representation #7
Comments
I considered using First off, my use case is probably not representative at all, being a joke project, but maybe it's still better than nothing. Basically, I serialize the DFA into a fat32 filesystem, where the directories are essentially the states and the directory entries, referencing other directories, transitions. One reason I considered using In the end, I decided against using |
Interesting! Thanks for the feedback. And I think that is perhaps the most creative implementation of regexes that I've seen. Wow. :-) |
These options aren't really carrying their weight. In a future release, aho-corasick will make both options enabled by default all the time with the impossibility of disabling them. The main reason for this is that these options require a quadrupling in the amount of code in this crate. While it's possible to see a performance hit when using byte classes, it should generally be very small. The improvement, if one exists, just doesn't see worth it. Please see #57 for more discussion. This is meant to mirror a similar decision occurring in regex-automata: BurntSushi/regex-automata#7.
These options aren't really carrying their weight. In a future release, aho-corasick will make both options enabled by default all the time with the impossibility of disabling them. The main reason for this is that these options require a quadrupling in the amount of code in this crate. While it's possible to see a performance hit when using byte classes, it should generally be very small. The improvement, if one exists, just doesn't see worth it. Please see #57 for more discussion. This is meant to mirror a similar decision occurring in regex-automata: BurntSushi/regex-automata#7.
I've also decided to remove the u32 is really the sweet spot. u8 is generally too small for most things. u16 is plausible to use in some cases, but it doesn't take much to blow its budget, particularly for DFAs. u32 uses double the storage as a u16, but doubling a small number is still pretty small. By the time you start going beyond what a u32 can store, your memory and time requirements are so large as to be impractical, so a u64 is never really needed in practice. Making the state ID generic was making the code a bit too complex, and I was finding it difficult to reason about its guarantees despite trying my best to button them down in a trait. I'll also be fixing pattern IDs to u32 as well, such that they will share the same representation. |
This was done. Everything is just |
I'd really like to hear from folks if they are using anything but the default configuration when it comes to building DFAs. That is, are you setting either
premultiply
orbyte_classes
to false? If so, could you say what motivated it?I am strongly considering removing everything but the
premultiply=true, byte_classes=true
andbyte_classes=false
configurations for dense and sparse DFAs, respectively. This will permit me to remove a lot of code and the grossenum
that switches over different types of DFAs.There is occasionally some performance difference between these options, but it is typically extremely small. In general,
premultiply=true, byte_classes=true
is the "best" option because it shrinks the size of a DFA considerable (via byte classes) while also keeping the hot loop light on instructions due to the premultiplication.The text was updated successfully, but these errors were encountered: