Remove bounds checks in Lexer by padding source

### Discussed in https://github.com/oxc-project/backlog/discussions/19

<div type='discussions-op-text'>

<sup>Originally posted by **overlookmotel** February  9, 2024</sup>
This came up on thread about SIMD in the Lexer https://github.com/oxc-project/oxc/issues/2296#issuecomment-1929343366, but opening a separate issue for it here, as it could benefit the Lexer's performance, even without SIMD.

Currently bounds checks (and the resulting branches) occur on every single call to `Lexer::peek`, `Lexer::next_eq` etc.

The idea is to allow removing all bounds checks by padding the source text with sentinel bytes indicating end of file. Instead of bounds checks, the lexer knows it's reached EOF when the EOF sentinel is found.

As the lexer is constantly checking byte values anyway, this *should* be more performant than checking byte values *and* doing bounds checks.

[ratel](https://github.com/ratel-rust/ratel-core) uses this technique (code [here](https://github.com/ratel-rust/ratel-core/blob/master/ratel/src/lexer/mod.rs#L630-L653) and [here](https://github.com/ratel-rust/toolshed/blob/master/src/arena.rs#L266-L280)).

In addition, to make SIMD both faster and easier to implement, could extend this approach and pad the source with `SIMD_LANES` (16/32/64) bytes. Then a SIMD block read is always valid, so no need to implement scalar fallbacks for towards end of the file.

This has been tried before in OXC and was removed:

* https://github.com/oxc-project/oxc/issues/43
* https://github.com/oxc-project/oxc/pull/90

It is unclear to me at present if the disadvantages which lead to its removal still apply or not. The new `Source` API may make it more ergonomic than it was previously. In any case, if it enables SIMD, that may change the cost/benefit calculus.

My questions about this approach are:

* Does the cost of copying the whole source text at the start outweigh the gain of removing bounds checks everywhere else?
* Could we add an entry point to the parser which takes a mutable `String` as source text? If that String already has excess capacity, could apply this approach without the cost of copying the source text.
* For immutable `&str` input, could we use page table tricks like [slice_deque](https://github.com/gnzlbg/slice_deque) does to create an extended buffer, while only copying the last chunk of the source?
* Is there a way to build a safe abstraction which statically forces every part of the lexer to check for and handle the EOF sentinel? (rather than making the entire lexer a pit of unsafe)

The last is my greatest concern. It would be really nice to use the type system to enforce safety.</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove bounds checks in Lexer by padding source #20

Discussed in https://github.com/oxc-project/backlog/discussions/19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remove bounds checks in Lexer by padding source #20

Description

Discussed in https://github.com/oxc-project/backlog/discussions/19

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions