HTML API: avoid processing incomplete tokens #5725

dmsnell · 2023-12-01T20:34:35Z

Depends on #5721

Currently the Tag Processor assumes that an input document is a full HTML document. Because of this, if there's lingering content after the last tag match it will treat that content as plaintext and skip over it. This is fine for the Tag Processor because if there is lingering content that isn't a valid tag then there's nothing for next_tag() to match.

However, in order to support a number of feature expansions it is important to recognize that the remaining content may involve partial syntax elements, such as incomplete tags, attributes, or comments.

In this patch we're adding a mode inside the Tag Processor which will flip when we start parsing HTML syntax but the document finishes before the token does. This will provide the ability to:

extend the input document
avoid misinterpreting syntax as text
guess if we have a complete document, know if we have an incomplete document

In the process of building this patch a few fixes were identified and fixed in the Tag Processor, namely in the handling of incomplete syntax elements.

The HTML Tag Processor is able to know if it starts parsing a syntax element and reaches the end of the document before it reaches the end of the element. In these cases, after this patch, the processor will indicate this condition. For example, when processing `<div><input type="te` there is an incomplete INPUT element. The processor will fail to find the INPUT, it will pause right after the DIV, and `paused_at_incomplete_token()` will return `true`. This patch doesn't change any existing behaviors, but it adds the new method to report on the final failure condition. It provides a mechanism for later use to add chunked parsing to the class, wherein it will be possible to process a document without having the entire document loaded in memory, for example when processing unbuffered output. This is also a necessary change for adding the ability to scan every token in the document. Currently the Tag Processor only exposes tags as tokens, but it will need to process `#text` nodes, HTML comments, and other markup in order to enable behaviors in the HTML Processor and in refactors of existing HTML processing in Core.

Rebuild of 8905faf after WordPress#5725. Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com> Co-authored-by: Bernie Reiter <bernhard-reiter@git.wordpress.org>

sirreal · 2023-12-20T13:52:47Z

I ran the tests from #5794 against this branch and it skips fewer tests, but all of those tests seem to fail. This may be a limitation with #5794 and not this PR. All of the new failures seem to be with test input that add <script> or <title> to the <head>. Were those tags unsupported before this PR? That would explain the problem.

I pulled the test data and the test suite from that branch (no fixes) and ran the tests with these results:

./vendor/bin/phpunit -c tests/phpunit/tests/html-api/phpunit.xml --filter Html5lib

Trunk:

Tests: 1693, Assertions: 306, Errors: 30, Failures: 2, Skipped: 1357.

This branch:

Tests: 1693, Assertions: 423, Errors: 30, Failures: 117, Skipped: 1240.

sirreal · 2023-12-20T14:27:59Z

I think the problem is that the tag processor always assumes we're in body, but there are tests that add tags to head. So it's a limitation of the way we're using the tests now.

It's not clear to me why those tags were unsupported before this PR but are supported now… was that intentional?

dmsnell · 2023-12-20T15:44:18Z

@sirreal those tests need to be setup differently, because the HTML Processor currently only supports the <body> context element inside a fragment parser. that is, if it encounters <html> it's not the document HTML tag but a parse error and we're free to bail.

it's not that it assumes we're IN BODY but rather the default parameter to create_fragment() is <body> and any other context node would cause it to abort. we could setup those tests to handle that.

Rebuild of 8905faf after WordPress#5725. Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com> Co-authored-by: Bernie Reiter <bernhard-reiter@git.wordpress.org>

github-actions · 2023-12-20T16:09:26Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

dmsnell · 2023-12-20T16:58:06Z

With the merge of 1270194 everything is now passing or skipped in the html5lib test suite, thanks again @sirreal.

There were a couple of fixes I incorporated from your work there, and then I updated that branch with some specific test exclusions based on either a design question or a test that tests behaviors we don't currently support (but will probably do so once we support scanning all tokens).

The necessary change to the html5lib tests was to check for $p->paused_at_incomplete_token()

Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>

ockham

LGTM

Thank you for taking the time to walk me through these changes!
As always, the test coverage also really helps with confidence in these changes 🙂

ockham · 2023-12-20T17:51:12Z

Committed to Core in https://core.trac.wordpress.org/changeset/57211.

Updates from WordPress/wordpress-develop at 54a09a7ec99c59b8e640bd5aacebfdbf03bb02cc - WordPress/wordpress-develop#5535 Adds support for H1 - H6 elements in the HTML Processor. - WordPress/wordpress-develop#5725 Pause the Tag Processor when reaching the end of the document inside an incomplete syntax element. - Incorporates linting changes from "TODO:" to "todo" This patch adds the blanket exclusion from the HTML API compatability layer after they were removed and started blocking updates from Core. The PHP files in the compatability layer are merged and maintained in the Core repo and all changes or updates need to happen first in Core and then be brought over to Gutenberg as built files. From Gutenberg's perspective they are no different than NPM packages. Co-authored-by: Anton Vlasenko <43744263+anton-vlasenko@users.noreply.github.com> Co-authored-by: Bernie Reiter <96308+ockham@users.noreply.github.com>

Updates from WordPress/wordpress-develop at 54a09a7ec99c59b8e640bd5aacebfdbf03bb02cc - WordPress/wordpress-develop#5535 Adds support for H1 - H6 elements in the HTML Processor. - WordPress/wordpress-develop#5725 Pause the Tag Processor when reaching the end of the document inside an incomplete syntax element. - Incorporates linting changes from "TODO:" to "todo". - WordPress/wordpress-develop#5762 Handles the "all other tags" rules in IN BODY insertion mode. - WordPress/wordpress-develop#5539 Adds support for list elements in the HTML Processor. This patch adds the blanket exclusion from the HTML API compatability layer after they were removed and started blocking updates from Core. The PHP files in the compatability layer are merged and maintained in the Core repo and all changes or updates need to happen first in Core and then be brought over to Gutenberg as built files. From Gutenberg's perspective they are no different than NPM packages. Co-authored-by: Anton Vlasenko <43744263+anton-vlasenko@users.noreply.github.com> Co-authored-by: Bernie Reiter <96308+ockham@users.noreply.github.com>

dmsnell marked this pull request as ready for review December 7, 2023 12:05

dmsnell mentioned this pull request Dec 8, 2023

HTML API: Provide mechanism to scan all tokens in an HTML document, not only the tags. #5683

Closed

15 tasks

dmsnell force-pushed the html-api/avoid-processing-incomplete-tokens branch 2 times, most recently from 603db00 to 2393b92 Compare December 14, 2023 18:28

dmsnell force-pushed the html-api/avoid-processing-incomplete-tokens branch from 8056376 to 35502ad Compare December 16, 2023 19:31

dmsnell changed the title ~~WIP: HTML API: avoid processing incomplete tokens~~ HTML API: avoid processing incomplete tokens Dec 18, 2023

Add ticket number

c3166be

dmsnell mentioned this pull request Dec 20, 2023

HTML Api: Fix out of bounds string access #5793

Closed

HTML API: Prevent out-of-bounds string access.

9e7167a

Rebuild of 8905faf after WordPress#5725. Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com> Co-authored-by: Bernie Reiter <bernhard-reiter@git.wordpress.org>

dmsnell force-pushed the html-api/avoid-processing-incomplete-tokens branch from da5f9d9 to 9e7167a Compare December 20, 2023 16:46

Fix bug found from HTML5Tests for incomplete SCRIPT tag closer

8488317

Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>

dmsnell force-pushed the html-api/avoid-processing-incomplete-tokens branch from 1270194 to 8488317 Compare December 20, 2023 17:03

dmsnell mentioned this pull request Dec 20, 2023

HTML API: Backport updates from Core WordPress/gutenberg#57022

Merged

ockham approved these changes Dec 20, 2023

View reviewed changes

ockham closed this Dec 20, 2023

dmsnell deleted the html-api/avoid-processing-incomplete-tokens branch December 20, 2023 19:49

dmsnell mentioned this pull request Feb 2, 2024

HTML API: Reset parser state after seeking to bookmark. #6021

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML API: avoid processing incomplete tokens #5725

HTML API: avoid processing incomplete tokens #5725

Uh oh!

dmsnell commented Dec 1, 2023 •

edited

Loading

Uh oh!

sirreal commented Dec 20, 2023 •

edited

Loading

Uh oh!

sirreal commented Dec 20, 2023

Uh oh!

dmsnell commented Dec 20, 2023

Uh oh!

github-actions bot commented Dec 20, 2023

Uh oh!

dmsnell commented Dec 20, 2023

Uh oh!

ockham left a comment

Uh oh!

ockham commented Dec 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HTML API: avoid processing incomplete tokens #5725

HTML API: avoid processing incomplete tokens #5725

Uh oh!

Conversation

dmsnell commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sirreal commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sirreal commented Dec 20, 2023

Uh oh!

dmsnell commented Dec 20, 2023

Uh oh!

github-actions bot commented Dec 20, 2023

Test using WordPress Playground

Some things to be aware of

Uh oh!

dmsnell commented Dec 20, 2023

Uh oh!

ockham left a comment

Choose a reason for hiding this comment

Uh oh!

ockham commented Dec 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmsnell commented Dec 1, 2023 •

edited

Loading

sirreal commented Dec 20, 2023 •

edited

Loading