perf(parser): lexer replace `Chars` with `Source` #2288

overlookmotel · 2024-02-04T02:14:09Z

This PR replaces the Chars iterator in the lexer with a new structure Source.

What it does

Source holds the source text, and allows:

Iterating through source text char-by-char (same as Chars did).
Iterating byte-by-byte.
Getting a SourcePosition for current position, which can be used later to rewind to that position, without having to clone the entire Source struct.

Source has the same invariants as Chars - cursor must always be positioned on a UTF-8 character boundary (i.e. not in the middle of a multi-byte Unicode character).

However, unsafe APIs are provided to allow a caller to temporarily break that invariant, as long as they satisfy it again before they pass control back to safe code. This will be useful for processing batches of bytes.

Why

I envisage most of the Lexer migrating to byte-by-byte iteration, and I believe it'll make a significant impact on performance.

It will allow efficiently processing batches of bytes (e.g. consuming identifiers or whitespace) without the overhead of calculating code points for every character. It should also make all the many peek(), next_char() and next_eq() calls faster.

Source is also more performant than Chars in itself. This wasn't my intent, but seems to be a pleasant side-effect of it being less opaque to the compiler than Chars, so it can apply more optimizations.

In addition, because checkpoints don't need to store the entire Source struct, but only a SourcePosition (8 bytes), was able to reduce the size of LexerCheckpoint and ParserCheckpoint, and make them both Copy.

Notes on implementation

Source is heavily based on Rust's std::str::Chars and std::slice::Iter iterators and I've copied the code/concepts from them as much as possible.

As it's a low-level primitive, it uses raw pointers and contains a lot of unsafe code. I think I've crossed the T's and dotted the I's, and I've commented the code extensively, but I'd appreciate a close review if anyone has time.

I've split it into 2 commits.

First commit is all the substantive changes.
2nd commit just does away with lexer.current which is no longer needed, and replaces lexer.current.token with lexer.token everywhere.

Hopefully looking just at the 1st commit will reduce the noise and make it easier to review.

`SourcePosition`

There is one annoyance with the API which I haven't been able solve:

SourcePosition is a wrapper around a pointer, which can only be created from the current position of Source. Due to the invariant mentioned above, therefore SourcePosition is always in bounds of the source text, and points to a UTF-8 character boundary. So Source can be rewound to a SourcePosition cheaply, without any checks. I had originally envisaged Source::set_position being a safe function, as SourcePosition enforces the necessary invariants itself.

The fly in the ointment is that a SourcePosition could theoretically have been created from another Source. If that was the case, it would be out of bounds, and it would be instant UB. Consequently, Source::set_position has to be an unsafe function.

This feels rather ridiculous. Of course the parser won't create 2 Lexers at the same time. But still it's possible, so I think better to take the strict approach and make it unsafe until can find a way to statically prove the safety by some other means. Any ideas?

Oddity in the benchmarks

There's something really odd going on with the semantic benchmark for pdf.mjs.

While I was developing this, small and seemingly irrelevant changes would flip that benchmark from +0.5% or so to -4%, and then another small change would flip it back.

What I don't understand is that parsing happens outside of the measurement loop in the semantic benchmark, so the parser shouldn't have any effect either way on semantic's benchmarks.

If CodSpeed's flame graph is to be believed, most of the negative effect appears to be a large Vec reallocation happening somewhere in semantic.

I've ruled out a few things: The AST produced by the parser for pdf.mjs after this PR is identical to what it was before. And semantic's nodes and scopes Vecs are same length as they were before. Nothing seems to have changed!

I really am at a loss to explain it. Have you seen anything like this before?

One possibility is a fault in my unsafe code which is manifesting only with pdf.mjs, and it's triggering UB, which I guess could explain the weird effects. I'm running the parser on pdf.mjs in Miri now and will see if it finds anything (Miri doesn't find any problem running the tests). It's been running for over an hour now. Hopefully it'll be done by morning!

I feel like this shouldn't merged until that question is resolved, so marking this as draft in the meantime.

netlify · 2024-02-04T02:14:13Z

✅ Deploy Preview for oxc canceled.

Name	Link
🔨 Latest commit	`5d1402c`
🔍 Latest deploy log	https://app.netlify.com/sites/oxc/deploys/65c04c7df3beeb00072128d9

codspeed-hq · 2024-02-04T02:37:17Z

CodSpeed Performance Report

Merging #2288 will improve performances by 9.41%

_{⚠️ No base runs were found}

_{Falling back to comparing overlookmotel:lexer-source4 (51da40b) with main (b27079c)}

Summary

⚡ 8 improvements
✅ 19 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`overlookmotel:lexer-source4`	Change
⚡	`lexer[pdf.mjs]`	37.5 ms	34.7 ms	+7.81%
⚡	`lexer[antd.js]`	246 ms	227.1 ms	+8.28%
⚡	`lexer[RadixUIAdoptionSection.jsx]`	203.4 µs	193.1 µs	+5.31%
⚡	`parser[antd.js]`	792.9 ms	768.7 ms	+3.15%
⚡	`parser[checker.ts]`	434 ms	417.3 ms	+4%
⚡	`lexer[cal.com.tsx]`	63.2 ms	58.9 ms	+7.21%
⚡	`parser[cal.com.tsx]`	182.4 ms	175.6 ms	+3.83%
⚡	`lexer[checker.ts]`	157.7 ms	144.2 ms	+9.41%

Boshen · 2024-02-04T05:48:18Z

Interesting to see most of the speed ups come from identifier_tail, which is the only function that shows up from linter performance profiles.

crates/oxc_parser/src/lexer/source.rs

Boshen · 2024-02-04T05:49:41Z

The semantic performance change is flaky, we can safely ignore it.

Boshen · 2024-02-04T12:02:49Z

Don't worry about the codspeed report, feel free to merge once you're satisfied with code since this is still marked as a draft.

overlookmotel · 2024-02-04T12:11:28Z

Interesting to see most of the speed ups come from identifier_tail, which is the only function that shows up from linter performance profiles.

I'm a bit dubious about CodSpeed's flame graph diffs. I think the BYTE_HANDLERS jump table makes it difficult for it to "match up" the functions before and after, perhaps because they're all called core::ops::function::FnOnce::call_once. In this case, it shows one byte handler function as being 6 x faster (the one in green) but then another one is 99.76% slower (the one in red). I don't think either of those results is plausible, so likely it's comparing one byte handler before to a different one after in both cases.

The semantic performance change is flaky, we can safely ignore it.

I'm still a bit concerned it may possibly indicate UB. Miri run completed last night but the output was unreadable because it found 1150 memory leaks! (the ones we already know about) Am running it again with memory leak detection disabled. Will report back.

Can I ask: Have you ever seen flaky results with such a large margin (4%) before?

Also, now I've pushed the extra commit which added a comment, it's got much worse! (though that may be in part due to unrelated changes on main since last night)

Boshen · 2024-02-04T14:12:50Z

I rebased and the report is correct now.

codspeed sometimes get flaky when there is a large memory allocation, don't worry about it.

with such a large margin (4%)

Don't forget to look at the absolute numbers.

Remeber, it's only a regression when all of the benchmarks come back with a bad number.

overlookmotel · 2024-02-04T19:55:11Z

Have rebased on top of #2301 to see if the lexer flame graph with named functions sheds any more light on the weird semantic benchmark.

Am also running Miri again (4 hours and counting!). If it doesn't find anything, I think we're all good. But I am nervous of merging this until Miri gives the all clear. I'm reassured to hear that the semantic benchmarks are known to be flaky, but still there's so much unsafe raw pointer action in this PR, it'd be easy for me to have made a mistake which causes UB. So I think best to err on cautious side.

overlookmotel · 2024-02-04T21:18:32Z

Miri gives it a pass. Seems I was worrying too much.

So I think now ready to merge. But @Boshen could you please merge #2301 and #2304 first? (if you're happy with them)

And if you didn't already, would really appreciate a fairly close review as there's so much that could go wrong here.

Boshen · 2024-02-04T22:51:27Z

I'll run all the fuzzers when I wake up.

Boshen · 2024-02-05T05:11:03Z

I'll run all the fuzzers when I wake up.

~~I don't have the time today.~~

Here's what you can try:

AST fuzzer: https://github.com/oxc-project/oxc-fuzz-parser/tree/main
random bytes fuzzer: https://github.com/oxc-project/oxc/tree/main/fuzz

Boshen · 2024-02-05T06:36:31Z

random bytes fuzzer: the parser timed out on $g< g<g<(b<b<(r=b<(r=g<(i<b<(r=b=g<(i<b<(r=b<(gb=g<(i<b<(r=b<(g<(g<g<(i<b<fh?.<[h<(/(/, which is a know issue around rewind.

AST fuzzer ran for 30k iterations and found no issues.

I think we are good to go after the merge conflict is resolved.

overlookmotel · 2024-02-05T13:31:41Z

Thanks very much for running the fuzzer. It feels like we can be pretty confident now that no problem with the changes introduced in this PR.

Have rebased on main, and now ready for merge once CI has passed.

Weirdly semantic[pdf.mjs] benchmark is now +2%!

overlookmotel · 2024-02-05T13:52:56Z

CI passes. Have merged. Hooray! Thanks loads for all your help on this @Boshen.

Make `Source::set_position` a safe function. This addresses a shortcoming of #2288. Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level. `Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway. So, while splitting the parser into `Parser` and `ParserImpl` (#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.

This PR replaces the `Chars` iterator in the lexer with a new structure `Source`. ## What it does `Source` holds the source text, and allows: * Iterating through source text char-by-char (same as `Chars` did). * Iterating byte-by-byte. * Getting a `SourcePosition` for current position, which can be used later to rewind to that position, without having to clone the entire `Source` struct. `Source` has the same invariants as `Chars` - cursor must always be positioned on a UTF-8 character boundary (i.e. not in the middle of a multi-byte Unicode character). However, unsafe APIs are provided to allow a caller to temporarily break that invariant, as long as they satisfy it again before they pass control back to safe code. This will be useful for processing batches of bytes. ## Why I envisage most of the Lexer migrating to byte-by-byte iteration, and I believe it'll make a significant impact on performance. It will allow efficiently processing batches of bytes (e.g. consuming identifiers or whitespace) without the overhead of calculating code points for every character. It should also make all the many `peek()`, `next_char()` and `next_eq()` calls faster. `Source` is also more performant than `Chars` in itself. This wasn't my intent, but seems to be a pleasant side-effect of it being less opaque to the compiler than `Chars`, so it can apply more optimizations. In addition, because checkpoints don't need to store the entire `Source` struct, but only a `SourcePosition` (8 bytes), was able to reduce the size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both `Copy`. ## Notes on implementation `Source` is heavily based on Rust's `std::str::Chars` and `std::slice::Iter` iterators and I've copied the code/concepts from them as much as possible. As it's a low-level primitive, it uses raw pointers and contains a *lot* of unsafe code. I *think* I've crossed the T's and dotted the I's, and I've commented the code extensively, but I'd appreciate a close review if anyone has time. I've split it into 2 commits. * First commit is all the substantive changes. * 2nd commit just does away with `lexer.current` which is no longer needed, and replaces `lexer.current.token` with `lexer.token` everywhere. Hopefully looking just at the 1st commit will reduce the noise and make it easier to review. ### `SourcePosition` There is one annoyance with the API which I haven't been able solve: `SourcePosition` is a wrapper around a pointer, which can only be created from the current position of `Source`. Due to the invariant mentioned above, therefore `SourcePosition` is always in bounds of the source text, and points to a UTF-8 character boundary. So `Source` can be rewound to a `SourcePosition` cheaply, without any checks. I had originally envisaged `Source::set_position` being a safe function, as `SourcePosition` enforces the necessary invariants itself. The fly in the ointment is that a `SourcePosition` could theoretically have been created from *another* `Source`. If that was the case, it would be out of bounds, and it would be instant UB. Consequently, `Source::set_position` has to be an unsafe function. This feels rather ridiculous. *Of course* the parser won't create 2 Lexers at the same time. But still it's *possible*, so I think better to take the strict approach and make it unsafe until can find a way to statically prove the safety by some other means. Any ideas? ## Oddity in the benchmarks There's something really odd going on with the semantic benchmark for `pdf.mjs`. While I was developing this, small and seemingly irrelevant changes would flip that benchmark from +0.5% or so to -4%, and then another small change would flip it back. What I don't understand is that parsing happens outside of the measurement loop in the semantic benchmark, so the parser shouldn't have *any* effect either way on semantic's benchmarks. If CodSpeed's flame graph is to be believed, most of the negative effect appears to be a large Vec reallocation happening somewhere in semantic. I've ruled out a few things: The AST produced by the parser for `pdf.mjs` after this PR is identical to what it was before. And semantic's `nodes` and `scopes` Vecs are same length as they were before. Nothing seems to have changed! I really am at a loss to explain it. Have you seen anything like this before? One possibility is a fault in my unsafe code which is manifesting only with `pdf.mjs`, and it's triggering UB, which I guess could explain the weird effects. I'm running the parser on `pdf.mjs` in Miri now and will see if it finds anything (Miri doesn't find any problem running the tests). It's been running for over an hour now. Hopefully it'll be done by morning! I feel like this shouldn't merged until that question is resolved, so marking this as draft in the meantime.

Make `Source::set_position` a safe function. This addresses a shortcoming of oxc-project#2288. Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level. `Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway. So, while splitting the parser into `Parser` and `ParserImpl` (oxc-project#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.

github-actions bot added the A-parser Area - Parser label Feb 4, 2024

overlookmotel marked this pull request as draft February 4, 2024 02:14

Boshen approved these changes Feb 4, 2024

View reviewed changes

crates/oxc_parser/src/lexer/source.rs Show resolved Hide resolved

Boshen force-pushed the lexer-source4 branch from 5be897d to dfcfbec Compare February 4, 2024 13:08

This was referenced Feb 4, 2024

Make Atom safe and faster #2295

Closed

SIMD in the lexer #2296

Closed

overlookmotel force-pushed the lexer-source4 branch from dfcfbec to 77cd9b8 Compare February 4, 2024 19:50

overlookmotel marked this pull request as ready for review February 4, 2024 21:16

overlookmotel marked this pull request as draft February 4, 2024 21:16

overlookmotel force-pushed the lexer-source4 branch from 77cd9b8 to 5d1402c Compare February 5, 2024 02:48

overlookmotel force-pushed the lexer-source4 branch from 5d1402c to fb1ee12 Compare February 5, 2024 12:36

overlookmotel added 2 commits February 5, 2024 13:28

refactor(parser): lexer replace Chars with Source

1389441

refactor(parser): remove lexer.current

51da40b

overlookmotel force-pushed the lexer-source4 branch from 1c62665 to 51da40b Compare February 5, 2024 13:28

overlookmotel marked this pull request as ready for review February 5, 2024 13:29

overlookmotel merged commit cdef41d into oxc-project:main Feb 5, 2024
21 checks passed

overlookmotel deleted the lexer-source4 branch February 5, 2024 13:53

Boshen changed the title ~~refactor(parser): lexer replace Chars with Source~~ perf(parser): lexer replace Chars with Source Feb 5, 2024

overlookmotel mentioned this pull request Feb 7, 2024

refactor(parser): make Source::set_position safe #2341

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parser): lexer replace `Chars` with `Source` #2288

perf(parser): lexer replace `Chars` with `Source` #2288

overlookmotel commented Feb 4, 2024 •

edited

Loading

netlify bot commented Feb 4, 2024 •

edited

Loading

codspeed-hq bot commented Feb 4, 2024 •

edited

Loading

Boshen commented Feb 4, 2024

Boshen commented Feb 4, 2024

Boshen commented Feb 4, 2024

overlookmotel commented Feb 4, 2024

Boshen commented Feb 4, 2024 •

edited

Loading

overlookmotel commented Feb 4, 2024 •

edited

Loading

overlookmotel commented Feb 4, 2024 •

edited

Loading

Boshen commented Feb 4, 2024

Boshen commented Feb 5, 2024 •

edited

Loading

Boshen commented Feb 5, 2024 •

edited

Loading

overlookmotel commented Feb 5, 2024 •

edited

Loading

overlookmotel commented Feb 5, 2024

perf(parser): lexer replace Chars with Source #2288

perf(parser): lexer replace Chars with Source #2288

Conversation

overlookmotel commented Feb 4, 2024 • edited Loading

What it does

Why

Notes on implementation

SourcePosition

Oddity in the benchmarks

netlify bot commented Feb 4, 2024 • edited Loading

✅ Deploy Preview for oxc canceled.

codspeed-hq bot commented Feb 4, 2024 • edited Loading

CodSpeed Performance Report

Merging #2288 will improve performances by 9.41%

Summary

Benchmarks breakdown

Boshen commented Feb 4, 2024

Boshen commented Feb 4, 2024

Boshen commented Feb 4, 2024

overlookmotel commented Feb 4, 2024

Boshen commented Feb 4, 2024 • edited Loading

overlookmotel commented Feb 4, 2024 • edited Loading

overlookmotel commented Feb 4, 2024 • edited Loading

Boshen commented Feb 4, 2024

Boshen commented Feb 5, 2024 • edited Loading

Boshen commented Feb 5, 2024 • edited Loading

overlookmotel commented Feb 5, 2024 • edited Loading

overlookmotel commented Feb 5, 2024

perf(parser): lexer replace `Chars` with `Source` #2288

perf(parser): lexer replace `Chars` with `Source` #2288

overlookmotel commented Feb 4, 2024 •

edited

Loading

`SourcePosition`

netlify bot commented Feb 4, 2024 •

edited

Loading

codspeed-hq bot commented Feb 4, 2024 •

edited

Loading

Boshen commented Feb 4, 2024 •

edited

Loading

overlookmotel commented Feb 4, 2024 •

edited

Loading

overlookmotel commented Feb 4, 2024 •

edited

Loading

Boshen commented Feb 5, 2024 •

edited

Loading

Boshen commented Feb 5, 2024 •

edited

Loading

overlookmotel commented Feb 5, 2024 •

edited

Loading