Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(parser): lexer replace Chars with Source #2288

Merged
merged 2 commits into from
Feb 5, 2024

Conversation

overlookmotel
Copy link
Collaborator

@overlookmotel overlookmotel commented Feb 4, 2024

This PR replaces the Chars iterator in the lexer with a new structure Source.

What it does

Source holds the source text, and allows:

  • Iterating through source text char-by-char (same as Chars did).
  • Iterating byte-by-byte.
  • Getting a SourcePosition for current position, which can be used later to rewind to that position, without having to clone the entire Source struct.

Source has the same invariants as Chars - cursor must always be positioned on a UTF-8 character boundary (i.e. not in the middle of a multi-byte Unicode character).

However, unsafe APIs are provided to allow a caller to temporarily break that invariant, as long as they satisfy it again before they pass control back to safe code. This will be useful for processing batches of bytes.

Why

I envisage most of the Lexer migrating to byte-by-byte iteration, and I believe it'll make a significant impact on performance.

It will allow efficiently processing batches of bytes (e.g. consuming identifiers or whitespace) without the overhead of calculating code points for every character. It should also make all the many peek(), next_char() and next_eq() calls faster.

Source is also more performant than Chars in itself. This wasn't my intent, but seems to be a pleasant side-effect of it being less opaque to the compiler than Chars, so it can apply more optimizations.

In addition, because checkpoints don't need to store the entire Source struct, but only a SourcePosition (8 bytes), was able to reduce the size of LexerCheckpoint and ParserCheckpoint, and make them both Copy.

Notes on implementation

Source is heavily based on Rust's std::str::Chars and std::slice::Iter iterators and I've copied the code/concepts from them as much as possible.

As it's a low-level primitive, it uses raw pointers and contains a lot of unsafe code. I think I've crossed the T's and dotted the I's, and I've commented the code extensively, but I'd appreciate a close review if anyone has time.

I've split it into 2 commits.

  • First commit is all the substantive changes.
  • 2nd commit just does away with lexer.current which is no longer needed, and replaces lexer.current.token with lexer.token everywhere.

Hopefully looking just at the 1st commit will reduce the noise and make it easier to review.

SourcePosition

There is one annoyance with the API which I haven't been able solve:

SourcePosition is a wrapper around a pointer, which can only be created from the current position of Source. Due to the invariant mentioned above, therefore SourcePosition is always in bounds of the source text, and points to a UTF-8 character boundary. So Source can be rewound to a SourcePosition cheaply, without any checks. I had originally envisaged Source::set_position being a safe function, as SourcePosition enforces the necessary invariants itself.

The fly in the ointment is that a SourcePosition could theoretically have been created from another Source. If that was the case, it would be out of bounds, and it would be instant UB. Consequently, Source::set_position has to be an unsafe function.

This feels rather ridiculous. Of course the parser won't create 2 Lexers at the same time. But still it's possible, so I think better to take the strict approach and make it unsafe until can find a way to statically prove the safety by some other means. Any ideas?

Oddity in the benchmarks

There's something really odd going on with the semantic benchmark for pdf.mjs.

While I was developing this, small and seemingly irrelevant changes would flip that benchmark from +0.5% or so to -4%, and then another small change would flip it back.

What I don't understand is that parsing happens outside of the measurement loop in the semantic benchmark, so the parser shouldn't have any effect either way on semantic's benchmarks.

If CodSpeed's flame graph is to be believed, most of the negative effect appears to be a large Vec reallocation happening somewhere in semantic.

I've ruled out a few things: The AST produced by the parser for pdf.mjs after this PR is identical to what it was before. And semantic's nodes and scopes Vecs are same length as they were before. Nothing seems to have changed!

I really am at a loss to explain it. Have you seen anything like this before?

One possibility is a fault in my unsafe code which is manifesting only with pdf.mjs, and it's triggering UB, which I guess could explain the weird effects. I'm running the parser on pdf.mjs in Miri now and will see if it finds anything (Miri doesn't find any problem running the tests). It's been running for over an hour now. Hopefully it'll be done by morning!

I feel like this shouldn't merged until that question is resolved, so marking this as draft in the meantime.

Copy link

netlify bot commented Feb 4, 2024

Deploy Preview for oxc canceled.

Name Link
🔨 Latest commit 5d1402c
🔍 Latest deploy log https://app.netlify.com/sites/oxc/deploys/65c04c7df3beeb00072128d9

@github-actions github-actions bot added the A-parser Area - Parser label Feb 4, 2024
@overlookmotel overlookmotel marked this pull request as draft February 4, 2024 02:14
Copy link

codspeed-hq bot commented Feb 4, 2024

CodSpeed Performance Report

Merging #2288 will improve performances by 9.41%

⚠️ No base runs were found

Falling back to comparing overlookmotel:lexer-source4 (51da40b) with main (b27079c)

Summary

⚡ 8 improvements
✅ 19 untouched benchmarks

Benchmarks breakdown

Benchmark main overlookmotel:lexer-source4 Change
lexer[pdf.mjs] 37.5 ms 34.7 ms +7.81%
lexer[antd.js] 246 ms 227.1 ms +8.28%
lexer[RadixUIAdoptionSection.jsx] 203.4 µs 193.1 µs +5.31%
parser[antd.js] 792.9 ms 768.7 ms +3.15%
parser[checker.ts] 434 ms 417.3 ms +4%
lexer[cal.com.tsx] 63.2 ms 58.9 ms +7.21%
parser[cal.com.tsx] 182.4 ms 175.6 ms +3.83%
lexer[checker.ts] 157.7 ms 144.2 ms +9.41%

@Boshen
Copy link
Member

Boshen commented Feb 4, 2024

image

Interesting to see most of the speed ups come from identifier_tail, which is the only function that shows up from linter performance profiles.

@Boshen
Copy link
Member

Boshen commented Feb 4, 2024

The semantic performance change is flaky, we can safely ignore it.

@Boshen
Copy link
Member

Boshen commented Feb 4, 2024

Don't worry about the codspeed report, feel free to merge once you're satisfied with code since this is still marked as a draft.

@overlookmotel
Copy link
Collaborator Author

Interesting to see most of the speed ups come from identifier_tail, which is the only function that shows up from linter performance profiles.

I'm a bit dubious about CodSpeed's flame graph diffs. I think the BYTE_HANDLERS jump table makes it difficult for it to "match up" the functions before and after, perhaps because they're all called core::ops::function::FnOnce::call_once. In this case, it shows one byte handler function as being 6 x faster (the one in green) but then another one is 99.76% slower (the one in red). I don't think either of those results is plausible, so likely it's comparing one byte handler before to a different one after in both cases.

The semantic performance change is flaky, we can safely ignore it.

I'm still a bit concerned it may possibly indicate UB. Miri run completed last night but the output was unreadable because it found 1150 memory leaks! (the ones we already know about) Am running it again with memory leak detection disabled. Will report back.

Can I ask: Have you ever seen flaky results with such a large margin (4%) before?

Also, now I've pushed the extra commit which added a comment, it's got much worse! (though that may be in part due to unrelated changes on main since last night)

@Boshen
Copy link
Member

Boshen commented Feb 4, 2024

I rebased and the report is correct now.

codspeed sometimes get flaky when there is a large memory allocation, don't worry about it.

with such a large margin (4%)

Don't forget to look at the absolute numbers.

Remeber, it's only a regression when all of the benchmarks come back with a bad number.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Feb 4, 2024

Have rebased on top of #2301 to see if the lexer flame graph with named functions sheds any more light on the weird semantic benchmark.

Am also running Miri again (4 hours and counting!). If it doesn't find anything, I think we're all good. But I am nervous of merging this until Miri gives the all clear. I'm reassured to hear that the semantic benchmarks are known to be flaky, but still there's so much unsafe raw pointer action in this PR, it'd be easy for me to have made a mistake which causes UB. So I think best to err on cautious side.

@overlookmotel overlookmotel marked this pull request as ready for review February 4, 2024 21:16
@overlookmotel overlookmotel marked this pull request as draft February 4, 2024 21:16
@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Feb 4, 2024

Miri gives it a pass. Seems I was worrying too much.

So I think now ready to merge. But @Boshen could you please merge #2301 and #2304 first? (if you're happy with them)

And if you didn't already, would really appreciate a fairly close review as there's so much that could go wrong here.

@Boshen
Copy link
Member

Boshen commented Feb 4, 2024

I'll run all the fuzzers when I wake up.

@Boshen
Copy link
Member

Boshen commented Feb 5, 2024

I'll run all the fuzzers when I wake up.

I don't have the time today.

Here's what you can try:

@Boshen
Copy link
Member

Boshen commented Feb 5, 2024

random bytes fuzzer: the parser timed out on $g< g<g<(b<b<(r=b<(r=g<(i<b<(r=b=g<(i<b<(r=b<(gb=g<(i<b<(r=b<(g<(g<g<(i<b<fh?.<[h<(/(/, which is a know issue around rewind.

AST fuzzer ran for 30k iterations and found no issues.

I think we are good to go after the merge conflict is resolved.

@overlookmotel overlookmotel marked this pull request as ready for review February 5, 2024 13:29
@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Feb 5, 2024

Thanks very much for running the fuzzer. It feels like we can be pretty confident now that no problem with the changes introduced in this PR.

Have rebased on main, and now ready for merge once CI has passed.

Weirdly semantic[pdf.mjs] benchmark is now +2%!

@overlookmotel overlookmotel merged commit cdef41d into oxc-project:main Feb 5, 2024
21 checks passed
@overlookmotel
Copy link
Collaborator Author

CI passes. Have merged. Hooray! Thanks loads for all your help on this @Boshen.

@overlookmotel overlookmotel deleted the lexer-source4 branch February 5, 2024 13:53
@Boshen Boshen changed the title refactor(parser): lexer replace Chars with Source perf(parser): lexer replace Chars with Source Feb 5, 2024
Boshen pushed a commit that referenced this pull request Feb 8, 2024
Make `Source::set_position` a safe function.

This addresses a shortcoming of #2288.

Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level.

`Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway.

So, while splitting the parser into `Parser` and `ParserImpl` (#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
This PR replaces the `Chars` iterator in the lexer with a new structure
`Source`.

## What it does

`Source` holds the source text, and allows:

* Iterating through source text char-by-char (same as `Chars` did).
* Iterating byte-by-byte.
* Getting a `SourcePosition` for current position, which can be used
later to rewind to that position, without having to clone the entire
`Source` struct.

`Source` has the same invariants as `Chars` - cursor must always be
positioned on a UTF-8 character boundary (i.e. not in the middle of a
multi-byte Unicode character).

However, unsafe APIs are provided to allow a caller to temporarily break
that invariant, as long as they satisfy it again before they pass
control back to safe code. This will be useful for processing batches of
bytes.

## Why

I envisage most of the Lexer migrating to byte-by-byte iteration, and I
believe it'll make a significant impact on performance.

It will allow efficiently processing batches of bytes (e.g. consuming
identifiers or whitespace) without the overhead of calculating code
points for every character. It should also make all the many `peek()`,
`next_char()` and `next_eq()` calls faster.

`Source` is also more performant than `Chars` in itself. This wasn't my
intent, but seems to be a pleasant side-effect of it being less opaque
to the compiler than `Chars`, so it can apply more optimizations.

In addition, because checkpoints don't need to store the entire `Source`
struct, but only a `SourcePosition` (8 bytes), was able to reduce the
size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both
`Copy`.

## Notes on implementation

`Source` is heavily based on Rust's `std::str::Chars` and
`std::slice::Iter` iterators and I've copied the code/concepts from them
as much as possible.

As it's a low-level primitive, it uses raw pointers and contains a *lot*
of unsafe code. I *think* I've crossed the T's and dotted the I's, and
I've commented the code extensively, but I'd appreciate a close review
if anyone has time.

I've split it into 2 commits.

* First commit is all the substantive changes.
* 2nd commit just does away with `lexer.current` which is no longer
needed, and replaces `lexer.current.token` with `lexer.token`
everywhere.

Hopefully looking just at the 1st commit will reduce the noise and make
it easier to review.

### `SourcePosition`

There is one annoyance with the API which I haven't been able solve:

`SourcePosition` is a wrapper around a pointer, which can only be
created from the current position of `Source`. Due to the invariant
mentioned above, therefore `SourcePosition` is always in bounds of the
source text, and points to a UTF-8 character boundary. So `Source` can
be rewound to a `SourcePosition` cheaply, without any checks. I had
originally envisaged `Source::set_position` being a safe function, as
`SourcePosition` enforces the necessary invariants itself.

The fly in the ointment is that a `SourcePosition` could theoretically
have been created from *another* `Source`. If that was the case, it
would be out of bounds, and it would be instant UB. Consequently,
`Source::set_position` has to be an unsafe function.

This feels rather ridiculous. *Of course* the parser won't create 2
Lexers at the same time. But still it's *possible*, so I think better to
take the strict approach and make it unsafe until can find a way to
statically prove the safety by some other means. Any ideas?

## Oddity in the benchmarks

There's something really odd going on with the semantic benchmark for
`pdf.mjs`.

While I was developing this, small and seemingly irrelevant changes
would flip that benchmark from +0.5% or so to -4%, and then another
small change would flip it back.

What I don't understand is that parsing happens outside of the
measurement loop in the semantic benchmark, so the parser shouldn't have
*any* effect either way on semantic's benchmarks.

If CodSpeed's flame graph is to be believed, most of the negative effect
appears to be a large Vec reallocation happening somewhere in semantic.

I've ruled out a few things: The AST produced by the parser for
`pdf.mjs` after this PR is identical to what it was before. And
semantic's `nodes` and `scopes` Vecs are same length as they were
before. Nothing seems to have changed!

I really am at a loss to explain it. Have you seen anything like this
before?

One possibility is a fault in my unsafe code which is manifesting only
with `pdf.mjs`, and it's triggering UB, which I guess could explain the
weird effects. I'm running the parser on `pdf.mjs` in Miri now and will
see if it finds anything (Miri doesn't find any problem running the
tests). It's been running for over an hour now. Hopefully it'll be done
by morning!

I feel like this shouldn't merged until that question is resolved, so
marking this as draft in the meantime.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
Make `Source::set_position` a safe function.

This addresses a shortcoming of oxc-project#2288.

Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level.

`Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway.

So, while splitting the parser into `Parser` and `ParserImpl` (oxc-project#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants