Description
(This is a ticket that I feel has limited scope, and as such, I'd be happy to mentor it. I think I'd consider this "medium" if one has worked with finite automata before, and probably "hard" if not.)
The lazy DFA works by computing states as they are visited during search. At most one state is created per byte of the haystack. In the vast majority of cases, the time it takes to do state construction is negligible because they are reused throughout search. However, the DFA does have pathological cases that can lead to exponential blow up in the number states. In particular, the worst case is when every byte of input results in a new state being created. For this reason, the DFA puts a bound on how many states are stored. If that bound is reached, all existing states are deleted and the DFA continues on (possibly recomputing states that were previously deleted). Since creating states can be slow, if the DFA clears the state cache too frequently, then the DFA will give up and a slower matcher will run in its place.
Therefore, given some fixed memory requirement of N
bytes, it makes sense to try and figure out how we can store as many states as possible in N
bytes. The more states we can fit into N
bytes, the less frequently we flush the cache, which in turn increases the number of regexes and haystacks that we can process with the DFA.
Let's take a look at how State
is defined:
struct State {
next: Box<[StatePtr]>,
insts: Box<[InstPtr]>,
flags: StateFlags,
}
next
corresponds to the state's transitions. insts
corresponds to the set of NFA states that make up this single DFA state. flags
indicates some facts about the state, such as whether it is a match state, whether it observes word characters (to support word boundary assertion lookaround) and whether any of its NFA states correspond to empty assertions.
One somewhat straight-forward way to reduce the memory requirements is to optimize this representation. flags
is probably as small as it's going to get: it's a single byte. next
is also unfortunately as small as it will get since it must support fast random access. Notably, it is used inside the DFA's inner loop. insts
however is not used in the DFA fast path and is only used during state construction. State construction is permitted to be somewhat slow, which means we should be able to optimize for space at the expense of time.
Currently, insts
is a sequence of 32 bit pointers to instructions in an NFA program. This means that every NFA state in a DFA state occupies 4 bytes. Some DFA states can contain many NFA states, especially regexes that contain large Unicode character classes like \pL
. It is also my hypothesis that the set of instruction pointers in each DFA state probably occur close together in the NFA graph. This means that the overall delta between the pointers is probably small.
This means that we should be able to take advantage of delta compression. That is, insts
would change its representation from a Box<[InstPtr]>
to a Box<[u8]>
, where we would write delta encoded pointers. For example, consider this sequence of instruction pointers:
[2595, 2596, 2597, 2598, 2599, 2600, 2601]
These are real instruction pointers extracted from the (a|b|c|d)
portion of the regex \pL(a|b|c|d)
. The actual program can be seen using the regex-debug
tool:
$ regex-debug compile '\pL(a|b|c|d)' --dfa
...
2595 Split(2596, 2597)
2596 Bytes(a, a) (goto: 2602)
2597 Split(2598, 2599)
2598 Bytes(b, b) (goto: 2602)
2599 Split(2600, 2601)
2600 Bytes(c, c) (goto: 2602)
2601 Bytes(d, d)
2602 Match(0)
Notably, each of these instruction pointers could be represented with two bytes, but they are always represented by four bytes today. In fact, we can do better with a delta encoding. After applying a delta encoding, here's what the new instruction pointers look like:
[2595, 1, 1, 1, 1, 1, 1]
Now, only the first pointer requires two bytes while the remaining require a mere one byte. All we need to do is apply this delta encoding to our insts
list, then encode the resulting pointers using a variable-length encoding, like the one used in protocol buffers. We would then need a way to decode and iterate over these pointers, which is needed here: https://github.com/rust-lang-nursery/regex/blob/2ab7be3d043a1ef640dc58ec4a4038d166ba1acd/src/dfa.rs#L644
And that should do it.
Here are some other thoughts:
- It would be cool to get the memory usage down low enough to improve the
sherlock::repeated_class_negation
benchmark, although I think it might not be quite enough. (I think its DFA requires around 5MB of space, and I don't think this optimization will eliminate more than half of the memory required, but maybe it will.) - Ideally, we don't use
unsafe
and don't use any additional dependencies. I think the encoding scheme is simple enough that we can hand roll it. We shouldn't need anyunsafe
because we just don't care that much about CPU cycles spent here. - Tests may be hard. Currently, the only real test that is even close to this issue is
dfa_handles_pathological_case
intests/crazy.rs
, but it's not clear how that can be used to tese this optimization in particular. My feeling is that the most important tests for this addition already exist. The benchmarks will also catch us if we do something really dumb like increase the memory requirements.