-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Discussed in https://github.com/oxc-project/backlog/discussions/19
Originally posted by overlookmotel February 9, 2024
This came up on thread about SIMD in the Lexer oxc-project/oxc#2296 (comment), but opening a separate issue for it here, as it could benefit the Lexer's performance, even without SIMD.
Currently bounds checks (and the resulting branches) occur on every single call to Lexer::peek, Lexer::next_eq etc.
The idea is to allow removing all bounds checks by padding the source text with sentinel bytes indicating end of file. Instead of bounds checks, the lexer knows it's reached EOF when the EOF sentinel is found.
As the lexer is constantly checking byte values anyway, this should be more performant than checking byte values and doing bounds checks.
ratel uses this technique (code here and here).
In addition, to make SIMD both faster and easier to implement, could extend this approach and pad the source with SIMD_LANES (16/32/64) bytes. Then a SIMD block read is always valid, so no need to implement scalar fallbacks for towards end of the file.
This has been tried before in OXC and was removed:
- Refactor: try aligned string for simd in lexer oxc#43
- refactor: pad raw content to parse with 16 empty bytes oxc#90
It is unclear to me at present if the disadvantages which lead to its removal still apply or not. The new Source API may make it more ergonomic than it was previously. In any case, if it enables SIMD, that may change the cost/benefit calculus.
My questions about this approach are:
- Does the cost of copying the whole source text at the start outweigh the gain of removing bounds checks everywhere else?
- Could we add an entry point to the parser which takes a mutable
Stringas source text? If that String already has excess capacity, could apply this approach without the cost of copying the source text. - For immutable
&strinput, could we use page table tricks like slice_deque does to create an extended buffer, while only copying the last chunk of the source? - Is there a way to build a safe abstraction which statically forces every part of the lexer to check for and handle the EOF sentinel? (rather than making the entire lexer a pit of unsafe)
The last is my greatest concern. It would be really nice to use the type system to enforce safety.