-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the reflect path parser utf-8-unaware #9371
Conversation
Example |
a1f8c4a
to
66b44e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a huge fan of the unsafe
, but it's reasonably trivial and i trust the performance improvements are compelling!
This works because the delimiter symbols '.[]#' are all ASCII, and the parser really only needs to care about delimiters, so we can avoid the overhead of handling properly at all steps utf-8 strings. This is a major improvement when comparing to the previous parser. 1. `access_following` and `next_token` now inline in PathParser::next 2. Benchmarking show a 20% performance increase
Head branch was pushed to by a user without write access
6c0ccc4
to
f512e79
Compare
ooops. Well I rebased to latest HEAD, expecting to fix CI. But instead I disabled auto-merge. |
Objective
All delimiter symbols used by the path parser are ASCII, this means we can entirely ignore UTF8 handling. This may improve performance.
Solution
Instead of storing the path as an
&str
+ the parser offset, and reading the path using&self.path[self.offset..]
, we store the parser state in a&[u8]
. This allows two optimizations:&self.path[self.offset..]
&[u8]
's reference metadata, and is assumed valid by the compiler.This is a major improvement when comparing to the previous parser.
access_following
andnext_token
now inline inPathParser::next
Please note that while we ignore UTF-8 handling, utf-8 is still supported. This is because we only handle "at the edges" what happens exactly before and after a recognized
SYMBOL
. utf-8 is handled transparently beyond that.