Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Html api/add html processor #6

Draft
wants to merge 55 commits into
base: trunk
Choose a base branch
from
Draft

Conversation

dmsnell
Copy link
Owner

@dmsnell dmsnell commented Mar 10, 2023

Trac ticket:


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) {
$tag_name = $p->get_tag();

if ( ! $p->is_tag_closer() ) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are void elements closers?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they're not, that's why this code branch works

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you probably realized, tag closers only exist when the tag name is preceded by a /, which is distinct from the self-closing flag which appears right before the > closing the tag opener

$element = WP_HTML_Spec::element_info( end( $this->open_elements ) );
// @TODO: Handle self-closing HTML foreign elements: must convey self-closing flag on stack.
if ( $element::is_void ) {
array_pop( $this->open_elements );

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it say if ( ! element::is_void )?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or not? I don't get why void elements have any interaction with the open elements stack

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, why are void elements landing on the open elements stack, though?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is something to document later and it appeared towards the end of the day. I think I've seen this in the HTML spec for tree-building.

the problem is that we have to push the void element to the open stack in order to get proper depth information and climb out of the nesting we're in. this looks funny here because we aren't currently tracking if the last element on the stack requires no closer (is void or self-closing foreign element). that means whenever we approach the next tag we have to immediately remove any void elements because they cannot enclose over other tags.

it's definitely unexpected until you reason through it, or as I did, get reminded by it with failing tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the spec, popping should happen in the same step as the insertion:

A start tag whose tag name is one of: "area", "br", "embed", "img", "keygen", "wbr"
Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

Set the frameset-ok flag to "not ok".


$starting_depth = count( $this->open_elements );

while ( $this->next_tag() ) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should bale out if between the $starting_depth and the child we see anything with $current_depth <= $starting_depth as that would imply the initial tag was closed

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the ensure_support() function that runs I don't think it's possible for this to happen, but I think I can defer some of the support function. I've added a test to make sure, though I haven't yet added any non-HTML elements in the tests.

$current_depth = count( $this->open_elements );

if ( $this->is_tag_closer() && $current_depth < $starting_depth ) {
return true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here for the depth check I mentioned above

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all garbage that will go away


$stack = array();

$p = new WP_HTML_Tag_Processor( $this->html );

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart! I kind of don't like that it throws away the progress we have made so far, though

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean? my vision for this is that it can grow with our added support.

as for running a first-pass through the document I'm not claiming this is certainly needed. I'm sure it's possible to work it in alongside the normal traversal.

for now though it helped quite a bit, and I figured the cost is meager given that it's a read-only scan.

@adamziel
Copy link

@luisherranz just pointed out that we may not need Bookmark "end" value. We don't seem to use it anywhere, and it generates problems with outer HTML logic. I like that. What do you think @dmsnell


$self_closes = $element::is_void || ( ! $element::is_html && $this->has_self_closing_flag() );
if ( $self_closes ) {
$this->open_elements[] = $tag_name;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this part, it's not really an open element. Even in the parsing spec these are only added to the stack for brevity of the document – they're always immediately popped in the same step and before ingesting the next token

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's essential: see above. may not be able to articulate it well yet, but I think one case we might examine is finding the next_sibling() of an <img>. if we don't push void elements onto the stack we'll be looking for a tag that's a parent of the img instead of a sibling.

I think this will matter also for CSS selection. we will need to know the full "tree path" (a term I'm making up here without intending to convey specific information about it) in order to match our current position against the selector.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw it's not what the spec says, is it? Depending on the insertion mode and element name, we may or may not have to push it to the stack of open elements. There may be other operations to do before pushing, too, and the order matters.

return $this->fully_supported_input;
}

public function next_tag( $query = null ) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could make it private in the descendant class. It doesn't even make sense to have it public, it can't support the same set of queries as the parent, and it's a pain to deal with PHP's idea of polymorphism where the parent class will call the child implementation of this method anyway.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funny enough I added _doing_it_wrong() here before realizing that I call it from within this tag processor.

is it the tag processor that's calling this?

the one thing I wish we could do is remove $query but it wouldn't let me. still some thinking about this one to do.

}
$this->set_bookmark( 'end' );

$start = $this->bookmarks['start']->end + 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one usage of the end bookmark property, but we could do without it by just storing this in a variable a few lines above. (I'm referring to the idea of ditching the end property)

@adamziel
Copy link

adamziel commented Mar 10, 2023

Oh this is great @dmsnell! A few more notes:

seek now needs to rewind the open_elements stack as well as the parsing cursor. Worse, you need to do it before seek calls next_tag – at one point I ended up just copying and pasting the entire seek body from the parent class.

next_tag is public, which I don't like – this class reasons about nodes and exposing a lower-level abstraction seems like a footgun. I don't know how to deal with it, though. We can't just return false and call parent::next_tag because that would break all $this->next_tag() calls in the Tag Processor class. There may be only one right now, but I don't like creating an implicit constraint on introducing more such calls.

$end = $this->bookmarks['end']->end + 1;
$this->lexical_updates[] = new WP_HTML_Text_Replacement( $start, $end, $new_html );
$this->get_updated_html();
$this->bookmarks['start']->start = $start;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done the same in one of my attempts! Why is this one not getting released in get_updated_html, though? I remember I had to create a new WP_HTML_Text_Span and re-insert it into $this->bookmarks

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite sure why it isn't released

@adamziel
Copy link

adamziel commented Mar 10, 2023

Oh, one other thing – be sure to test set_outer_content when there are some lexical updates already scheduled. I remember having to flush out everything at the beginning of that method, or else things got messy. It could be caused by my attempt to insert tag closers, though.

@dmsnell
Copy link
Owner Author

dmsnell commented Mar 21, 2023

@luisherranz just pointed out that we may not need Bookmark "end" value. We don't seem to use it anywhere, and it generates problems with outer HTML logic. I like that. What do you think @dmsnell

I would love to hear more about this, how it creates problems. it seems inherently essential for separating inside vs. outside of tag tokens, so I'm confused how it isn't necessary.

@dmsnell
Copy link
Owner Author

dmsnell commented Mar 22, 2023

seek now needs to rewind the open_elements stack as well as the parsing cursor. Worse, you need to do it before seek calls next_tag – at one point I ended up just copying and pasting the entire seek body from the parent class.

stewing on this I came up with an idea and then realized it's probably the same as your pinned bookmarks 🙃

private function enter_element( $element ) {
$this->depth++;

parent::set_bookmark( "{$this->depth}_{$element}" );
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why set a bookmark for each element?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a stub function for the stack of open elements. I think I'm going to rip it out and use a proper stack, but we have to have a bookmark because if we make edits we have to know that the stack might have changed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need a bookmark for the current element, though, right? Not for all the elements seen so far, nor for all parents.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the stack of open elements might actually shift and some may disappear. in any case, if we seek() we have to reset them, so we need a good way to reconstruct the stack when seeking.

in your PR I think you're doing that by storied copies of all the open elements for every bookmark, but also we have to make sure those are bookmarks because we expect changes to be applied to the document which could change things.

the main case that matters I think is wiping out a bookmark, but inserting nodes will shift them all.

<div><p>inside</p></div>
     ^^^ bookmark

unwrap

<p>inside</p>
^^^ stack of open elements here is no longer DIV > P but rather only P

@dmsnell dmsnell force-pushed the html-api/add-html-processor branch from d2791d0 to 0e3ada5 Compare May 17, 2023 17:03
default:
return self::NOT_IMPLEMENTED_YET;
}
} catch ( Exception $e ) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching all exceptions might hide legitimate errors, it would be safer to have a HTML_Exception class or so

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. at some point I wanted to go back and special-case known exceptions so we don't hide others, but I didn't do that yet because I wasn't entirely sure how I felt about the escaping mechanism as throwing Exceptions.

parent::__construct( $html );

$this->tag_openers = new WP_HTML_Element_Stack();
$this->tag_closers = new WP_HTML_Element_Stack();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would you push to this stack?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still working on an answer for this 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants