Faster lexical analyzer #2665

soutaro · 2025-09-26T07:22:37Z

Extracted from #2652

This PR improves the data structure of lexer in RBS.

It improves the parsing performance from ~14 i/s to ~16 i/s measured by bin/benchmark-parse.rb.

I have `gem_rbs_collection` repository too to load `activerecord` rbs files.

Baseline

➜  rbs git:(38724282) bundle exec ruby bin/benchmark-parse.rb core/**/*.rbs ../../ruby/gem_rbs_collection/gems/activerecord/8.0/*.rbs sig/**/*.rbs
Benchmarking parsing 177 files...
✅ 14.506 i/s (±0.000%)

Fix lexer

➜  rbs git:(fix-lexer) bundle exec ruby bin/benchmark-parse.rb core/**/*.rbs ../../ruby/gem_rbs_collection/gems/activerecord/8.0/*.rbs sig/**/*.rbs
Benchmarking parsing 177 files...
✅ 16.667 i/s (±0.000%)

clang-format 20 formats the lines with new-lines, but we don't want it.

soutaro · 2025-09-26T07:23:43Z

include/rbs/lexer.h

+    rbs_position_t current; /* The current position: just before the current_character */
+    rbs_position_t start;   /* The start position of the current token */
+
+    unsigned int current_code_point; /* Current character code point */


The lexer data structure now stores the code point of next character, so that peeking next character can be implemented really faster than reading next character from buffer.

soutaro · 2025-09-26T07:24:52Z

src/lexstate.c

+    return lexer->current_code_point;
+}
+
+bool rbs_next_char(rbs_lexer_t *lexer, unsigned int *codepoint, size_t *byte_len) {


This function assigns the next codepoint in the buffer and it's byte length.

soutaro · 2025-09-26T07:25:54Z

src/lexstate.c

+    const char *start = lexer->string.start + lexer->current.byte_pos;
+
+    // Fast path for ASCII (single-byte) characters
+    if ((unsigned int) *start < 128) {


We assume the character encoding of RBS files is ASCII compatible, like Ruby source file.
If it is ASCII character, it is a single-byte character.

soutaro · 2025-09-26T07:29:28Z

src/lexstate.c

-        unsigned int c = rbs_utf8_string_to_codepoint(str);
-        lexer->last_char = c;
-        return c;
+        *codepoint = 12523; // Dummy data for "ル" from "ルビー" (Ruby) in Unicode


Another hack to support encoding other than utf-8.
It doesn't know the exact unicode code point of the next character in other encoding, and it returns a random code point instead. Lexer reads the character, but because the random character doesn't have any meaning for lexer, it works perfectly.

We may want to return a upper case character to support multi-byte class/constant names.

Faster lexical analyzer

soutaro added 7 commits September 26, 2025 11:51

Add -p cflag if $DEBUG is present

b1e9cb5

Add prepare_bench and prepare_profiling tasks

0e910dc

Disable formatting rbs_utf_8_dfa

e7bcf64

clang-format 20 formats the lines with new-lines, but we don't want it.

Add benchmarking/profiling scripts

3872428

Add RBS_LIKELY and RBS_UNLIKELY

70b3c11

Fix lexer data structure

52d1de6

Update assertion message

cd148f2

soutaro added this to the RBS 4.0 milestone Sep 26, 2025

soutaro commented Sep 26, 2025

View reviewed changes

soutaro changed the title ~~Fix lexer~~ Faster lexical analyzer Sep 26, 2025

soutaro added this pull request to the merge queue Sep 26, 2025

Merged via the queue into master with commit ffcb7e2 Sep 26, 2025
22 checks passed

soutaro deleted the fix-lexer branch September 26, 2025 07:41

soutaro added a commit that referenced this pull request Oct 6, 2025

Merge pull request #2665 from ruby/fix-lexer

a10410e

Faster lexical analyzer

soutaro mentioned this pull request Oct 6, 2025

Backport pure-C parser #2671

Merged

soutaro added a commit that referenced this pull request Oct 6, 2025

Merge pull request #2665 from ruby/fix-lexer

4bd271d

Faster lexical analyzer

amomchilov mentioned this pull request Oct 24, 2025

Degradation of parsing performance with v4.0.0.dev.4 #2563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster lexical analyzer #2665

Faster lexical analyzer #2665

Uh oh!

soutaro commented Sep 26, 2025 •

edited

Loading

Uh oh!

soutaro Sep 26, 2025

Uh oh!

soutaro Sep 26, 2025

Uh oh!

soutaro Sep 26, 2025

Uh oh!

soutaro Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Faster lexical analyzer #2665

Faster lexical analyzer #2665

Uh oh!

Conversation

soutaro commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Baseline

Fix lexer

Uh oh!

soutaro Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

soutaro Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

soutaro Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

soutaro Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

soutaro commented Sep 26, 2025 •

edited

Loading