Skip to content

Remove bounds checks in Lexer by padding source #20

@Boshen

Description

@Boshen

Discussed in https://github.com/oxc-project/backlog/discussions/19

Originally posted by overlookmotel February 9, 2024
This came up on thread about SIMD in the Lexer oxc-project/oxc#2296 (comment), but opening a separate issue for it here, as it could benefit the Lexer's performance, even without SIMD.

Currently bounds checks (and the resulting branches) occur on every single call to Lexer::peek, Lexer::next_eq etc.

The idea is to allow removing all bounds checks by padding the source text with sentinel bytes indicating end of file. Instead of bounds checks, the lexer knows it's reached EOF when the EOF sentinel is found.

As the lexer is constantly checking byte values anyway, this should be more performant than checking byte values and doing bounds checks.

ratel uses this technique (code here and here).

In addition, to make SIMD both faster and easier to implement, could extend this approach and pad the source with SIMD_LANES (16/32/64) bytes. Then a SIMD block read is always valid, so no need to implement scalar fallbacks for towards end of the file.

This has been tried before in OXC and was removed:

It is unclear to me at present if the disadvantages which lead to its removal still apply or not. The new Source API may make it more ergonomic than it was previously. In any case, if it enables SIMD, that may change the cost/benefit calculus.

My questions about this approach are:

  • Does the cost of copying the whole source text at the start outweigh the gain of removing bounds checks everywhere else?
  • Could we add an entry point to the parser which takes a mutable String as source text? If that String already has excess capacity, could apply this approach without the cost of copying the source text.
  • For immutable &str input, could we use page table tricks like slice_deque does to create an extended buffer, while only copying the last chunk of the source?
  • Is there a way to build a safe abstraction which statically forces every part of the lexer to check for and handle the EOF sentinel? (rather than making the entire lexer a pit of unsafe)

The last is my greatest concern. It would be really nice to use the type system to enforce safety.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions