Skip to content

Commit

Permalink
HTML API: Avoid processing incomplete syntax elements.
Browse files Browse the repository at this point in the history
The HTML Tag Processor is able to know if it starts parsing a syntax element
and reaches the end of the document before it reaches the end of the element.
In these cases, after this patch, the processor will indicate this condition.

For example, when processing `<div><input type="te` there is an incomplete INPUT
element. The processor will fail to find the INPUT, it will pause right after
the DIV, and `paused_at_incomplete_token()` will return `true`.

This patch doesn't change any existing behaviors, but it adds the new method
to report on the final failure condition. It provides a mechanism for later
use to add chunked parsing to the class, wherein it will be possible to process
a document without having the entire document loaded in memory, for example
when processing unbuffered output.

This is also a necessary change for adding the ability to scan every token in
the document. Currently the Tag Processor only exposes tags as tokens, but it
will need to process `#text` nodes, HTML comments, and other markup in order
to enable behaviors in the HTML Processor and in refactors of existing HTML
processing in Core.

Co-authored-by: David Herrera <mail@dlh01.info>
Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>
  • Loading branch information
3 people committed Jan 24, 2024
1 parent 6daf853 commit faf9cef
Show file tree
Hide file tree
Showing 5 changed files with 1,578 additions and 145 deletions.
39 changes: 27 additions & 12 deletions src/wp-includes/html-api/class-wp-html-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -149,17 +149,6 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
*/
const MAX_BOOKMARKS = 100;

/**
* Static query for instructing the Tag Processor to visit every token.
*
* @access private
*
* @since 6.4.0
*
* @var array
*/
const VISIT_EVERYTHING = array( 'tag_closers' => 'visit' );

/**
* Holds the working state of the parser, including the stack of
* open elements and the stack of active formatting elements.
Expand Down Expand Up @@ -424,6 +413,30 @@ public function next_tag( $query = null ) {
return false;
}

/**
* Ensures internal accounting is maintained for HTML semantic rules while
* the underlying Tag Processor class is seeking to a bookmark.
*
* This doesn't currently have a way to represent non-tags and doesn't process
* semantic rules for text nodes. For access to the raw tokens consider using
* WP_HTML_Tag_Processor instead.
*
* @since 6.5.0 Added for internal support; do not use.
*
* @access private
*
* @return bool
*/
public function next_token() {
$found_a_token = parent::next_token();

if ( '#tag' === $this->get_token_type() ) {
$this->step( self::REPROCESS_CURRENT_NODE );
}

return $found_a_token;
}

/**
* Indicates if the currently-matched tag matches the given breadcrumbs.
*
Expand Down Expand Up @@ -520,7 +533,9 @@ public function step( $node_to_process = self::PROCESS_NEXT_NODE ) {
$this->state->stack_of_open_elements->pop();
}

parent::next_tag( self::VISIT_EVERYTHING );
while ( parent::next_token() && '#tag' !== $this->get_token_type() ) {
continue;
}
}

// Finish stepping when there are no more tokens in the document.
Expand Down
Loading

0 comments on commit faf9cef

Please sign in to comment.