Skip to content

Conversation

@soutaro
Copy link
Member

@soutaro soutaro commented Sep 26, 2025

Extracted from #2652

This PR improves the data structure of lexer in RBS.

It improves the parsing performance from ~14 i/s to ~16 i/s measured by bin/benchmark-parse.rb.

I have `gem_rbs_collection` repository too to load `activerecord` rbs files.

Baseline

➜  rbs git:(38724282) bundle exec ruby bin/benchmark-parse.rb core/**/*.rbs ../../ruby/gem_rbs_collection/gems/activerecord/8.0/*.rbs sig/**/*.rbs
Benchmarking parsing 177 files...
✅ 14.506 i/s (±0.000%)

Fix lexer

➜  rbs git:(fix-lexer) bundle exec ruby bin/benchmark-parse.rb core/**/*.rbs ../../ruby/gem_rbs_collection/gems/activerecord/8.0/*.rbs sig/**/*.rbs
Benchmarking parsing 177 files...
✅ 16.667 i/s (±0.000%)

@soutaro soutaro added this to the RBS 4.0 milestone Sep 26, 2025
rbs_position_t current; /* The current position: just before the current_character */
rbs_position_t start; /* The start position of the current token */

unsigned int current_code_point; /* Current character code point */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lexer data structure now stores the code point of next character, so that peeking next character can be implemented really faster than reading next character from buffer.

return lexer->current_code_point;
}

bool rbs_next_char(rbs_lexer_t *lexer, unsigned int *codepoint, size_t *byte_len) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function assigns the next codepoint in the buffer and it's byte length.

const char *start = lexer->string.start + lexer->current.byte_pos;

// Fast path for ASCII (single-byte) characters
if ((unsigned int) *start < 128) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume the character encoding of RBS files is ASCII compatible, like Ruby source file.
If it is ASCII character, it is a single-byte character.

unsigned int c = rbs_utf8_string_to_codepoint(str);
lexer->last_char = c;
return c;
*codepoint = 12523; // Dummy data for "ル" from "ルビー" (Ruby) in Unicode
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another hack to support encoding other than utf-8.
It doesn't know the exact unicode code point of the next character in other encoding, and it returns a random code point instead. Lexer reads the character, but because the random character doesn't have any meaning for lexer, it works perfectly.

We may want to return a upper case character to support multi-byte class/constant names.

@soutaro soutaro changed the title Fix lexer Faster lexical analyzer Sep 26, 2025
@soutaro soutaro added this pull request to the merge queue Sep 26, 2025
Merged via the queue into master with commit ffcb7e2 Sep 26, 2025
22 checks passed
@soutaro soutaro deleted the fix-lexer branch September 26, 2025 07:41
soutaro added a commit that referenced this pull request Oct 6, 2025
@soutaro soutaro mentioned this pull request Oct 6, 2025
soutaro added a commit that referenced this pull request Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants