-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decrease memory usage of DFA with variable width delta encoding of instruction pointers #199
Comments
Hey @BurntSushi, I may be interested in working on this! I've been meaning to work on regex (because it's really interesting), but I haven't had much time. I haven't done much work with finite automata (outside of my theoretical CS courses), so it'll probably take me some time to get started. (If you want to get this done quickly, I'll gladly let someone else take over, of course.) |
@killercup No need to have it done quickly! I'm happy to mentor any experience level. :-) |
@killercup To adjust the difficulty level a bit: I think the actual optimization work should not be that bad, since it's really just a simple encoding/decoding scheme. The reason why I ranked the difficulty higher than that is because the DFA code is dense and it may take a bit of time to grok it! |
Note that PR #202 changes the representation of struct State {
insts: Box<[InstPtr]>,
flags: StateFlags,
} I think everything I said above still applies. Another thought: if |
Very nice! Thanks for the heads up. I'll keep that in mind 😃 |
@killercup Did you end up deciding whether you wanted to tackle this? |
@BurntSushi Yes, I still want to do this! I have spend some time experimenting and thinking about this, but haven't had enough time yet to do something meaningful. I hope that changes in the next weeks. I'll keep you up-to-date :) |
@killercup Great! Look forward to it. I tried assigning you to this issue, but I guess I can only assign project members. Sorry. :-( |
@killercup It turns out that this doesn't help the |
(This is a ticket that I feel has limited scope, and as such, I'd be happy to mentor it. I think I'd consider this "medium" if one has worked with finite automata before, and probably "hard" if not.)
The lazy DFA works by computing states as they are visited during search. At most one state is created per byte of the haystack. In the vast majority of cases, the time it takes to do state construction is negligible because they are reused throughout search. However, the DFA does have pathological cases that can lead to exponential blow up in the number states. In particular, the worst case is when every byte of input results in a new state being created. For this reason, the DFA puts a bound on how many states are stored. If that bound is reached, all existing states are deleted and the DFA continues on (possibly recomputing states that were previously deleted). Since creating states can be slow, if the DFA clears the state cache too frequently, then the DFA will give up and a slower matcher will run in its place.
Therefore, given some fixed memory requirement of
N
bytes, it makes sense to try and figure out how we can store as many states as possible inN
bytes. The more states we can fit intoN
bytes, the less frequently we flush the cache, which in turn increases the number of regexes and haystacks that we can process with the DFA.Let's take a look at how
State
is defined:next
corresponds to the state's transitions.insts
corresponds to the set of NFA states that make up this single DFA state.flags
indicates some facts about the state, such as whether it is a match state, whether it observes word characters (to support word boundary assertion lookaround) and whether any of its NFA states correspond to empty assertions.One somewhat straight-forward way to reduce the memory requirements is to optimize this representation.
flags
is probably as small as it's going to get: it's a single byte.next
is also unfortunately as small as it will get since it must support fast random access. Notably, it is used inside the DFA's inner loop.insts
however is not used in the DFA fast path and is only used during state construction. State construction is permitted to be somewhat slow, which means we should be able to optimize for space at the expense of time.Currently,
insts
is a sequence of 32 bit pointers to instructions in an NFA program. This means that every NFA state in a DFA state occupies 4 bytes. Some DFA states can contain many NFA states, especially regexes that contain large Unicode character classes like\pL
. It is also my hypothesis that the set of instruction pointers in each DFA state probably occur close together in the NFA graph. This means that the overall delta between the pointers is probably small.This means that we should be able to take advantage of delta compression. That is,
insts
would change its representation from aBox<[InstPtr]>
to aBox<[u8]>
, where we would write delta encoded pointers. For example, consider this sequence of instruction pointers:These are real instruction pointers extracted from the
(a|b|c|d)
portion of the regex\pL(a|b|c|d)
. The actual program can be seen using theregex-debug
tool:Notably, each of these instruction pointers could be represented with two bytes, but they are always represented by four bytes today. In fact, we can do better with a delta encoding. After applying a delta encoding, here's what the new instruction pointers look like:
Now, only the first pointer requires two bytes while the remaining require a mere one byte. All we need to do is apply this delta encoding to our
insts
list, then encode the resulting pointers using a variable-length encoding, like the one used in protocol buffers. We would then need a way to decode and iterate over these pointers, which is needed here: https://github.com/rust-lang-nursery/regex/blob/2ab7be3d043a1ef640dc58ec4a4038d166ba1acd/src/dfa.rs#L644And that should do it.
Here are some other thoughts:
sherlock::repeated_class_negation
benchmark, although I think it might not be quite enough. (I think its DFA requires around 5MB of space, and I don't think this optimization will eliminate more than half of the memory required, but maybe it will.)unsafe
and don't use any additional dependencies. I think the encoding scheme is simple enough that we can hand roll it. We shouldn't need anyunsafe
because we just don't care that much about CPU cycles spent here.dfa_handles_pathological_case
intests/crazy.rs
, but it's not clear how that can be used to tese this optimization in particular. My feeling is that the most important tests for this addition already exist. The benchmarks will also catch us if we do something really dumb like increase the memory requirements.The text was updated successfully, but these errors were encountered: