-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Add matches_breadcrumbs()
to HTML Processor for better querying
#5243
Conversation
ab6952c
to
48b631d
Compare
* At some point this function _may_ support a `**` syntax for matching any number | ||
* of unspecified tags in the breadcrumb stack. This has been intentionally left | ||
* out, however, to keep this function simple and to avoid introducing backtracking, | ||
* which could open up surprising performance breakdowns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say you did want to implement **
behavior but limited to 5 levels, where img
is, say, inside of main
. To do that, you currently would have to do:
for ( $asterisk_count = 0; $asterisk_count <= 5; $asterisk_count++ ) {
if (
$processor->matches_breadcrumbs(
array(
'main',
...array_fill( 0, $asterisk_count, '*' )
'img'
)
)
) {
return true;
}
}
return false;
Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha yeah this would work and accurately conveys the backtracking problem, though I believe that there are some more performant backtracking mechanisms that don't have to re-process as much as this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's still possible that I'm way overthinking this, confusing it with a different kind of match, where it would matter which of the inner or outer elements are matching.
for a pattern like main, **, div, **, img
though I want to make sure this doesn't break, or maybe it's better to compare that against main, *, div, **, img
for the first one, with two **
groups, I read it as "walk up the stack of open elements; if there is a DIV somewhere and also a MAIN somewhere then this matches."
for the second one though we have a tighter constraint above the simpler one. if we have this breadcrumb HTML > BODY > MAIN > SECTION > DIV > DIV > IMG
then we have to make sure that the **
group only matches the last DIV, otherwise main, *, div
will fail to match because we were too greedy.
This is where I think we must start backtracking: after that higher DIV fails the rest of the match. That's a very recursive problem too. If we start by being un-greedy though, we stop at the lower DIV and it fails the rest of the match, so we have to track that **
group and backtrack on the pattern vs. on the content. We have to remember each double wildcard group and go back to reattempt matches after adjusting our position in the stack.
The probably probably starts if *
allowed for a zero match, but when I tossed it in there it didn't seem to have any connotation of that since this is like a filesystem wildcard where those expectations are that *
in a path is something and not nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking all this through pushes me to leave this for userspace for now. we expose get_breadcrumbs()
and so someone is implement this on their own until we find a good solution for it.
function preg_matches_breadcrumbs( string $pattern, array $breadcrumbs ) {
return false !== preg_match( $pattern, implode( ' > ', $breadcrumbs ) );
}
if ( preg_matches_breadcrumbs( '~ > MAIN > .* IMG$~', $processor->get_breadcrumbs() ) ) {
…
}
this format is surprisingly useful now that I hack it up
d2c276e
to
90a2787
Compare
|
||
if ( '*' !== $crumb && $this->get_tag() !== strtoupper( $crumb ) ) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is bothering me because I don't currently understand why it's necessary. it seems like we should have the currently-matched tag on the stack of open elements, but we still need to check this. then, it seems like we'd be checking twice (in the first iteration of the loop below), but we aren't and all the tests are passing.
* May also contain the wildcard `*` which matches a single element, e.g. `array( 'SECTION', '*' )`. | ||
* @return bool Whether the currently-matched tag is found at the given nested structure. | ||
*/ | ||
public function matches_breadcrumbs( $breadcrumbs ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is out of scope for this PR, but could the $breadcrumbs
format be expanded beyond just a list of tag names to instead be a list of queries (or the like)? That is, it would be very useful to check not just that the tag name matches but more so that the element has a given ID or a class name. Otherwise, the issue is that tag name alone may not be specific enough to be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, let's say I have the HTML for the entire page and I need to add fetchpriority=high
to the featured image. In Chrome DevTools, the featured image has the selector #page > div.single-featured-image-header > img
. In tag name breadcrumbs, this is ['div', 'div', 'img']
which is very unspecific. It would be great to be able to pass in something like:
array(
array("id" => "page"),
array(
"tag" => "div",
"class" => "single-featured-image-header",
),
array("tag" => "img"),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's definitely out of scope but it's something I've been thinking about a lot recently. it extends to all sorts of challenges in querying and I've been exploring the use of bookmarks to accomplish these needs. one thing we probably don't want to do here is jump around in the document while jumping around, meaning either we track everything about every token (memory bloat) or find a way to only trap what we care about when.
the HTML Processor is using $current_token
to hold not only a tag name, but also a full reference to the location in the document for where something is. this is actually required to properly run the parsing, as certain operations imply closing tags of not only a given name, but a specific tag.
the idea would be something like this, where we still leave this up to userspace so that the performance of the processor remains more predictable, and additional processing/memory can be exposed explicitly in consuming code.
while ( $processor->next_tag() ) {
if (
'DIV' === $processor->get_tag() &&
$processor->has_class( 'single-featured-image-header' )
) {
$processor->set_bookmark( 'wrapper' );
continue;
}
if (
'IMG' === $processor->get_tag() &&
$processor->is_bookmark_open( 'wrapper' )
) {
$processor->set_attribute( 'fetchpriority', 'high' );
}
and I believe that we could create a variety of relationship functions like this: is_child_of( $bookmark_name )
, is_descendant_of( $bookmark_name )
, is_sibling_of( $bookmark_name )
.
internally there's a callback for when popping an element off of the stack of open elements, and I have been thinking of exposing that to userspace. I think it should be safe, but I have questions about how best to represent that. this is what led me to the is_
functions with bookmarks.
…erying Inside a `next_tag()` loop it can be challenging to use breadcrumbs because they are only exposed inside the call to `next_tag()` via the `$query` arg. In this patch a new method, `matches_breadcrumbs()` is exposed which allows for querying within the `next_tag()` loop for more complicated queries. This method exposes a wildcard `*` operator to allow matching _any HTML tag_ that the currently-matched tag is a child or descendant of.
- add wildcard to docblock description of breadcrumbs parameters - revert to scanning breadcrumb array with end/current/prev to handle sparse arrays Props: @westonruter
65759fc
to
cd79cdd
Compare
Rebased. |
Committed to Core in https://core.trac.wordpress.org/changeset/56702. |
Trac ticket: #59400
matches_breadcrumbs()
method gutenberg#54629Inside a
next_tag()
loop it can be challenging to use breadcrumbs because they are only exposed inside the call tonext_tag()
via the$query
arg.In this patch a new method,
matches_breadcrumbs()
is exposed which allows for querying within thenext_tag()
loop for more complicated queries.This method exposes a wildcard
*
operator to allow matching any HTML tag that the currently-matched tag is a child or descendant of.