Optimize the "temporary buffer" #112

untitaker · 2024-11-06T02:00:08Z

I tried to get rid of the temporary buffer today. It's an allocation that is done per-tokenizer, and that bugs me because it's the only memory allocation not controllable (and reusable by the user). This didn't work out, but there is some opportunity to make it faster and/or use fewer allocations.

The naive approach would be to add a bunch of operations onto Emitter trait so that the temporary buffer moves as-is into the emitter, from the tokenizer/machine_helper. However, this trait is already too complicated and has too many mandatory methods. I would also like to avoid adding more things onto Emitter whose correctness can influence the correctness of the tokenizer itself.

Sometimes it is only used to compare its value against either a static string ("script"), or against the most recently emitted start tag ("is appropriate end tag token"). Those situations can be individually optimized:

For the purpose of pushing repeatedly into it, then comparing its value to "script", a dedicated state machine can be built instead of a buffer.
For the purpose of comparing to the most recently emitted start tag, I believe there's even potential to move state from the emitter back into the tokenizer, simplifying the trait.

In a few other cases, I believe the temporary buffer is unavoidable and must be allocated (character references)

There is also the open question of whether the temporary buffer should be able to grow unbounded. Right now it can, like this:

echo '<textarea></asdadasdasdasdasdasdasdasd' | cargo run --example tokenize_with_state_switches

(you need to patch all uses of temporary_buffer.extend/push to print the len to understand)

This situation could be prevented with some buffer limit, however, I don't think it's necessary to push into the buffer in this situation at all (instead the state machine should bail out early when it sees the end tag name does not start with t for texarea)

(aside: Many other buffers in the emitter can also grow unbounded given pathological input. this seems to be the case in html5ever and many other parsers too. golang seems to expose an opt-in mitigation)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the "temporary buffer" #112

Optimize the "temporary buffer" #112

untitaker commented Nov 6, 2024 •

edited

Loading

Optimize the "temporary buffer" #112

Optimize the "temporary buffer" #112

Comments

untitaker commented Nov 6, 2024 • edited Loading

untitaker commented Nov 6, 2024 •

edited

Loading