UTF-8 parsing with state machine #59399

jocutajar · 2019-03-24T15:30:04Z

Hi, I'd like to propose a cleaner approach (IMHO) to parse UTF-8 in core::str:from_utf8 and friends. Not confident to optimize the ASCII fast forward in unsafe, but I think a state machine would describe the problem better and is more readable.

As a bonus, the state machine could be made available for other more complex parsers. My case is for ANSI escape sequences which may not be valid UTF-8 but are embedded in UTF-8 streams. The machine exposes the number of incomplete() and needed() bytes, it can count the valid chars seen(). As such it would be very useful for general byte stream processing. It works with no_std, depending only on core::str.

The essence is quite simple:


impl Utf8Machine {
    pub fn turn(&mut self, input: &u8) -> Utf8Act {
        /*
         * legal utf-8 byte sequence
         * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
         *
         *  Code Points        1st       2s       3s       4s
         * U+0000..U+007F     00..7F
         * U+0080..U+07FF     C2..DF   80..BF
         * U+0800..U+0FFF     E0       A0..BF   80..BF
         * U+1000..U+CFFF     E1..EC   80..BF   80..BF
         * U+D000..U+D7FF     ED       80..9F   80..BF
         * U+E000..U+FFFF     EE..EF   80..BF   80..BF
         * U+10000..U+3FFFF   F0       90..BF   80..BF   80..BF
         * U+40000..U+FFFFF   F1..F3   80..BF   80..BF   80..BF
         * U+100000..U+10FFFF F4       80..8F   80..BF   80..BF
         */
        use Utf8Act::*;
        use Utf8State::*;
        let seen = self.seen;
        let done = (self.seen + 1, Normal, Done);
        let (seen, state, act) = match (&self.state, input) {
            (Normal, 0x00...0x7F) => done,
            (Normal, 0x80...0xC1) => (seen, Normal, Invalid(1)),
            (Normal, 0xC2...0xDF) => (seen, C___x_, NeedMore(1)),
            (Normal, 0xE0...0xE0) => (seen, C__0__, NeedMore(2)),
            (Normal, 0xE1...0xEC) => (seen, C__x__, NeedMore(2)),
            (Normal, 0xED...0xED) => (seen, C__D__, NeedMore(2)),
            (Normal, 0xEE...0xEF) => (seen, C__x__, NeedMore(2)),
            (Normal, 0xF0...0xF0) => (seen, C_0___, NeedMore(3)),
            (Normal, 0xF1...0xF3) => (seen, C_x___, NeedMore(3)),
            (Normal, 0xF4...0xF4) => (seen, C_4___, NeedMore(3)),
            (Normal, 0xF5...0xFF) => (seen, Normal, Invalid(1)),

            (C___x_, 0x80...0xBF) => done,
            (C__xx_, 0x80...0xBF) => done,
            (C_xxx_, 0x80...0xBF) => done,

            (C__0__, 0xA0...0xBF) => (seen, C__xx_, NeedMore(1)),
            (C__D__, 0x80...0x9F) => (seen, C__xx_, NeedMore(1)),
            (C__x__, 0x80...0xBF) => (seen, C__xx_, NeedMore(1)),
            (C_0___, 0x90...0xBF) => (seen, C_xx__, NeedMore(2)),
            (C_4___, 0x80...0x8F) => (seen, C_xx__, NeedMore(2)),
            (C_x___, 0x80...0xBF) => (seen, C_xx__, NeedMore(2)),
            (C_xx__, 0x80...0xBF) => (seen, C_xxx_, NeedMore(1)),

            (C___x_, _) => (seen, Normal, Invalid(2)),
            (C__0__, _) => (seen, Normal, Invalid(2)),
            (C__D__, _) => (seen, Normal, Invalid(2)),
            (C__x__, _) => (seen, Normal, Invalid(2)),
            (C__xx_, _) => (seen, Normal, Invalid(3)),
            (C_0___, _) => (seen, Normal, Invalid(2)),
            (C_4___, _) => (seen, Normal, Invalid(2)),
            (C_x___, _) => (seen, Normal, Invalid(2)),
            (C_xx__, _) => (seen, Normal, Invalid(3)),
            (C_xxx_, _) => (seen, Normal, Invalid(4)),
        };
        self.seen = seen;
        self.state = state;
        act
    }
}

The playground includes applications of the state machine where one could implement the fast forward ASCII optimization.

The text was updated successfully, but these errors were encountered:

nagisa · 2019-03-24T21:28:42Z

run_utf8_validation has hard constraints on its performance, given how likely it is to end up being a hot code path. Sadly, a naive state machine is unlikely to live up to those requirements.

That being said, if it does improve throughput characteristics, we will gladly accept an implementation.

jocutajar · 2019-03-25T16:24:17Z

Is there a standard benchmark for this, @nagisa?

nagisa · 2019-03-26T11:05:57Z

#30740 used large documents as a test and contains a link to the benchmarks used there. In general what matters is delta between old and new implementations, so any sufficiently diverse (in terms of the text used) benchmark will do.

cc @bluss

jocutajar · 2019-03-29T01:42:15Z

Lovely. My state machine including the fast forward trick is still roughly 5x slower on mixed utf-8. I couldn't bring it anywhere near your mark and I've tortured it a good deal. I don't understand why, but I figured pattern matching has some effect. Just adding or removing one option had noticeable impact. I had hopes Rust would optimize those a lot. Well I fought and I surrender :)

Mark-Simulacrum · 2019-03-30T13:35:53Z

I'm going to close this - if progress is made then a PR would be good to discuss the specific changes. Further discussion about the viability of this change would be best suited for internals.

jerch · 2019-04-26T08:35:57Z

If speed is a concern, Bob Steagall showed that a state machine based version in C++ can greatly enhance UTF8 decoding throughput. You can find the implementation here and a great video explaining his efforts here.

jocutajar mentioned this issue Mar 24, 2019

UTF-8 decoding should reject overlong sequences and surrogates JuliaLang/julia#25452

Closed

estebank added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Mar 25, 2019

Mark-Simulacrum closed this as completed Mar 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 parsing with state machine #59399

UTF-8 parsing with state machine #59399

jocutajar commented Mar 24, 2019 •

edited

Loading

nagisa commented Mar 24, 2019 •

edited

Loading

jocutajar commented Mar 25, 2019

nagisa commented Mar 26, 2019

jocutajar commented Mar 29, 2019 •

edited

Loading

Mark-Simulacrum commented Mar 30, 2019

jerch commented Apr 26, 2019

UTF-8 parsing with state machine #59399

UTF-8 parsing with state machine #59399

Comments

jocutajar commented Mar 24, 2019 • edited Loading

nagisa commented Mar 24, 2019 • edited Loading

jocutajar commented Mar 25, 2019

nagisa commented Mar 26, 2019

jocutajar commented Mar 29, 2019 • edited Loading

Mark-Simulacrum commented Mar 30, 2019

jerch commented Apr 26, 2019

jocutajar commented Mar 24, 2019 •

edited

Loading

nagisa commented Mar 24, 2019 •

edited

Loading

jocutajar commented Mar 29, 2019 •

edited

Loading