-
Notifications
You must be signed in to change notification settings - Fork 225
Faster lexical analyzer #2665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster lexical analyzer #2665
Conversation
clang-format 20 formats the lines with new-lines, but we don't want it.
| rbs_position_t current; /* The current position: just before the current_character */ | ||
| rbs_position_t start; /* The start position of the current token */ | ||
|
|
||
| unsigned int current_code_point; /* Current character code point */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lexer data structure now stores the code point of next character, so that peeking next character can be implemented really faster than reading next character from buffer.
| return lexer->current_code_point; | ||
| } | ||
|
|
||
| bool rbs_next_char(rbs_lexer_t *lexer, unsigned int *codepoint, size_t *byte_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function assigns the next codepoint in the buffer and it's byte length.
| const char *start = lexer->string.start + lexer->current.byte_pos; | ||
|
|
||
| // Fast path for ASCII (single-byte) characters | ||
| if ((unsigned int) *start < 128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We assume the character encoding of RBS files is ASCII compatible, like Ruby source file.
If it is ASCII character, it is a single-byte character.
| unsigned int c = rbs_utf8_string_to_codepoint(str); | ||
| lexer->last_char = c; | ||
| return c; | ||
| *codepoint = 12523; // Dummy data for "ル" from "ルビー" (Ruby) in Unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another hack to support encoding other than utf-8.
It doesn't know the exact unicode code point of the next character in other encoding, and it returns a random code point instead. Lexer reads the character, but because the random character doesn't have any meaning for lexer, it works perfectly.
We may want to return a upper case character to support multi-byte class/constant names.
Extracted from #2652
This PR improves the data structure of lexer in RBS.
It improves the parsing performance from ~
14 i/sto ~16 i/smeasured bybin/benchmark-parse.rb.Baseline
Fix lexer