Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

RFC: removing configuration knobs and ability to choose state ID representation #7

Closed
BurntSushi opened this issue Mar 26, 2020 · 4 comments

Comments

@BurntSushi
Copy link
Owner

I'd really like to hear from folks if they are using anything but the default configuration when it comes to building DFAs. That is, are you setting either premultiply or byte_classes to false? If so, could you say what motivated it?

I am strongly considering removing everything but the premultiply=true, byte_classes=true and byte_classes=false configurations for dense and sparse DFAs, respectively. This will permit me to remove a lot of code and the gross enum that switches over different types of DFAs.

There is occasionally some performance difference between these options, but it is typically extremely small. In general, premultiply=true, byte_classes=true is the "best" option because it shrinks the size of a DFA considerable (via byte classes) while also keeping the hot loop light on instructions due to the premultiplication.

@8051Enthusiast
Copy link

I considered using premultiply=false, but didn't in the end.

First off, my use case is probably not representative at all, being a joke project, but maybe it's still better than nothing.

Basically, I serialize the DFA into a fat32 filesystem, where the directories are essentially the states and the directory entries, referencing other directories, transitions.

One reason I considered using premultiply=false was because I thought of using u32 state ids since fat32 only has 32-bit references.
Another reason was that mapping the IDs to references of blocks inside the fat32 filesystem would have been easier (with just state_dir_block = state_id*dir_size + 2).

In the end, I decided against using u32 because the benefits would have probably been negligible, and I just used usize.
I also decided not to make any assumption on the distribution/contiguousness of the state ids (or all directories having the same size) and just made a hashmap that maps state ids to the location of the state dir, generated in a separate pass by a BFS on the DFA (being specifically a BFS was not really necessary, I just needed a way to iterate over all states).

@BurntSushi
Copy link
Owner Author

Interesting! Thanks for the feedback.

And I think that is perhaps the most creative implementation of regexes that I've seen. Wow. :-)

BurntSushi added a commit to BurntSushi/aho-corasick that referenced this issue Apr 30, 2021
These options aren't really carrying their weight. In a future release,
aho-corasick will make both options enabled by default all the time with
the impossibility of disabling them. The main reason for this is that
these options require a quadrupling in the amount of code in this crate.

While it's possible to see a performance hit when using byte classes, it
should generally be very small. The improvement, if one exists, just
doesn't see worth it.

Please see #57 for more
discussion.

This is meant to mirror a similar decision occurring in regex-automata:
BurntSushi/regex-automata#7.
BurntSushi added a commit to BurntSushi/aho-corasick that referenced this issue Apr 30, 2021
These options aren't really carrying their weight. In a future release,
aho-corasick will make both options enabled by default all the time with
the impossibility of disabling them. The main reason for this is that
these options require a quadrupling in the amount of code in this crate.

While it's possible to see a performance hit when using byte classes, it
should generally be very small. The improvement, if one exists, just
doesn't see worth it.

Please see #57 for more
discussion.

This is meant to mirror a similar decision occurring in regex-automata:
BurntSushi/regex-automata#7.
@BurntSushi BurntSushi changed the title RFC: removing configuration knobs RFC: removing configuration knobs and ability to choose state ID representation May 23, 2021
@BurntSushi
Copy link
Owner Author

I've also decided to remove the StateID abstraction, such that one no longer has the ability to choose the representation used for state IDs. It will be fixed to u32.

u32 is really the sweet spot. u8 is generally too small for most things. u16 is plausible to use in some cases, but it doesn't take much to blow its budget, particularly for DFAs. u32 uses double the storage as a u16, but doubling a small number is still pretty small. By the time you start going beyond what a u32 can store, your memory and time requirements are so large as to be impractical, so a u64 is never really needed in practice.

Making the state ID generic was making the code a bit too complex, and I was finding it difficult to reason about its guarantees despite trying my best to button them down in a trait.

I'll also be fixing pattern IDs to u32 as well, such that they will share the same representation.

@BurntSushi
Copy link
Owner Author

This was done. Everything is just u32 (or rather, a newtype around u32).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants