Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the "temporary buffer" #112

Open
untitaker opened this issue Nov 6, 2024 · 0 comments
Open

Optimize the "temporary buffer" #112

untitaker opened this issue Nov 6, 2024 · 0 comments

Comments

@untitaker
Copy link
Owner

untitaker commented Nov 6, 2024

I tried to get rid of the temporary buffer today. It's an allocation that is done per-tokenizer, and that bugs me because it's the only memory allocation not controllable (and reusable by the user). This didn't work out, but there is some opportunity to make it faster and/or use fewer allocations.

The naive approach would be to add a bunch of operations onto Emitter trait so that the temporary buffer moves as-is into the emitter, from the tokenizer/machine_helper. However, this trait is already too complicated and has too many mandatory methods. I would also like to avoid adding more things onto Emitter whose correctness can influence the correctness of the tokenizer itself.

Sometimes it is only used to compare its value against either a static string ("script"), or against the most recently emitted start tag ("is appropriate end tag token"). Those situations can be individually optimized:

  • For the purpose of pushing repeatedly into it, then comparing its value to "script", a dedicated state machine can be built instead of a buffer.
  • For the purpose of comparing to the most recently emitted start tag, I believe there's even potential to move state from the emitter back into the tokenizer, simplifying the trait.

In a few other cases, I believe the temporary buffer is unavoidable and must be allocated (character references)

There is also the open question of whether the temporary buffer should be able to grow unbounded. Right now it can, like this:

echo '<textarea></asdadasdasdasdasdasdasdasd' | cargo run --example tokenize_with_state_switches

(you need to patch all uses of temporary_buffer.extend/push to print the len to understand)

This situation could be prevented with some buffer limit, however, I don't think it's necessary to push into the buffer in this situation at all (instead the state machine should bail out early when it sees the end tag name does not start with t for texarea)

(aside: Many other buffers in the emitter can also grow unbounded given pathological input. this seems to be the case in html5ever and many other parsers too. golang seems to expose an opt-in mitigation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant