-
Notifications
You must be signed in to change notification settings - Fork 13.3k
UTF-8 parsing with state machine #59399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That being said, if it does improve throughput characteristics, we will gladly accept an implementation. |
Is there a standard benchmark for this, @nagisa? |
Lovely. My state machine including the fast forward trick is still roughly 5x slower on mixed utf-8. I couldn't bring it anywhere near your mark and I've tortured it a good deal. I don't understand why, but I figured pattern matching has some effect. Just adding or removing one option had noticeable impact. I had hopes Rust would optimize those a lot. Well I fought and I surrender :) |
I'm going to close this - if progress is made then a PR would be good to discuss the specific changes. Further discussion about the viability of this change would be best suited for internals. |
Hi, I'd like to propose a cleaner approach (IMHO) to parse UTF-8 in
core::str:from_utf8
and friends. Not confident to optimize the ASCII fast forward in unsafe, but I think a state machine would describe the problem better and is more readable.As a bonus, the state machine could be made available for other more complex parsers. My case is for ANSI escape sequences which may not be valid UTF-8 but are embedded in UTF-8 streams. The machine exposes the number of
incomplete()
andneeded()
bytes, it can count the valid charsseen()
. As such it would be very useful for general byte stream processing. It works withno_std
, depending only oncore::str
.The essence is quite simple:
The playground includes applications of the state machine where one could implement the fast forward ASCII optimization.
The text was updated successfully, but these errors were encountered: