Skip to content

decrease memory usage of DFA with variable width delta encoding of instruction pointers #199

Closed
@BurntSushi

Description

@BurntSushi

(This is a ticket that I feel has limited scope, and as such, I'd be happy to mentor it. I think I'd consider this "medium" if one has worked with finite automata before, and probably "hard" if not.)

The lazy DFA works by computing states as they are visited during search. At most one state is created per byte of the haystack. In the vast majority of cases, the time it takes to do state construction is negligible because they are reused throughout search. However, the DFA does have pathological cases that can lead to exponential blow up in the number states. In particular, the worst case is when every byte of input results in a new state being created. For this reason, the DFA puts a bound on how many states are stored. If that bound is reached, all existing states are deleted and the DFA continues on (possibly recomputing states that were previously deleted). Since creating states can be slow, if the DFA clears the state cache too frequently, then the DFA will give up and a slower matcher will run in its place.

Therefore, given some fixed memory requirement of N bytes, it makes sense to try and figure out how we can store as many states as possible in N bytes. The more states we can fit into N bytes, the less frequently we flush the cache, which in turn increases the number of regexes and haystacks that we can process with the DFA.

Let's take a look at how State is defined:

struct State {
    next: Box<[StatePtr]>,
    insts: Box<[InstPtr]>,
    flags: StateFlags,
}

next corresponds to the state's transitions. insts corresponds to the set of NFA states that make up this single DFA state. flags indicates some facts about the state, such as whether it is a match state, whether it observes word characters (to support word boundary assertion lookaround) and whether any of its NFA states correspond to empty assertions.

One somewhat straight-forward way to reduce the memory requirements is to optimize this representation. flags is probably as small as it's going to get: it's a single byte. next is also unfortunately as small as it will get since it must support fast random access. Notably, it is used inside the DFA's inner loop. insts however is not used in the DFA fast path and is only used during state construction. State construction is permitted to be somewhat slow, which means we should be able to optimize for space at the expense of time.

Currently, insts is a sequence of 32 bit pointers to instructions in an NFA program. This means that every NFA state in a DFA state occupies 4 bytes. Some DFA states can contain many NFA states, especially regexes that contain large Unicode character classes like \pL. It is also my hypothesis that the set of instruction pointers in each DFA state probably occur close together in the NFA graph. This means that the overall delta between the pointers is probably small.

This means that we should be able to take advantage of delta compression. That is, insts would change its representation from a Box<[InstPtr]> to a Box<[u8]>, where we would write delta encoded pointers. For example, consider this sequence of instruction pointers:

[2595, 2596, 2597, 2598, 2599, 2600, 2601]

These are real instruction pointers extracted from the (a|b|c|d) portion of the regex \pL(a|b|c|d). The actual program can be seen using the regex-debug tool:

$ regex-debug compile '\pL(a|b|c|d)' --dfa
...
2595 Split(2596, 2597)
2596 Bytes(a, a) (goto: 2602)
2597 Split(2598, 2599)
2598 Bytes(b, b) (goto: 2602)
2599 Split(2600, 2601)
2600 Bytes(c, c) (goto: 2602)
2601 Bytes(d, d)
2602 Match(0)

Notably, each of these instruction pointers could be represented with two bytes, but they are always represented by four bytes today. In fact, we can do better with a delta encoding. After applying a delta encoding, here's what the new instruction pointers look like:

[2595, 1, 1, 1, 1, 1, 1]

Now, only the first pointer requires two bytes while the remaining require a mere one byte. All we need to do is apply this delta encoding to our insts list, then encode the resulting pointers using a variable-length encoding, like the one used in protocol buffers. We would then need a way to decode and iterate over these pointers, which is needed here: https://github.com/rust-lang-nursery/regex/blob/2ab7be3d043a1ef640dc58ec4a4038d166ba1acd/src/dfa.rs#L644

And that should do it.

Here are some other thoughts:

  1. It would be cool to get the memory usage down low enough to improve the sherlock::repeated_class_negation benchmark, although I think it might not be quite enough. (I think its DFA requires around 5MB of space, and I don't think this optimization will eliminate more than half of the memory required, but maybe it will.)
  2. Ideally, we don't use unsafe and don't use any additional dependencies. I think the encoding scheme is simple enough that we can hand roll it. We shouldn't need any unsafe because we just don't care that much about CPU cycles spent here.
  3. Tests may be hard. Currently, the only real test that is even close to this issue is dfa_handles_pathological_case in tests/crazy.rs, but it's not clear how that can be used to tese this optimization in particular. My feeling is that the most important tests for this addition already exist. The benchmarks will also catch us if we do something really dumb like increase the memory requirements.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions