-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag Processor: Add bookmark system for tracking semantic locations in document #46018
Conversation
Open in CodeSandbox Web Editor | VS Code | VS Code Insiders |
@dmsnell I drafted the rewind() method in e3eddb6 – it's very rough but seems to do the job. This draft of We'll probably go a long way from here, but it's a peg in the ground. It's good enough for the following test case to pass: public function test_bookmark()
{
$p = new WP_HTML_Tag_Processor('<ul><li>One</li><li>Two</li><li>Three</li></ul>');
$p->next_tag('li');
$p->set_bookmark('first li');
$p->next_tag('li');
$p->set_bookmark('second li');
$p->set_attribute('foo-2', 'bar-2');
$p->rewind('first li');
$p->set_attribute('foo-1', 'bar-1');
$p->rewind('second li');
$p->next_tag('li');
$p->set_attribute('foo-3', 'bar-3');
$this->assertEquals(
'<ul><li foo-1="bar-1">One</li><li foo-2="bar-2">Two</li><li foo-3="bar-3">Three</li></ul>',
$p->get_updated_html()
);
} Oh and I apologize for the formatting mess – my VisualStudio was too eager to use different formatting strategy. Also, I renamed the test file only temporarily to get these tests to run without Docker. Let's rollback my dev changes before we merge this PR. |
8b73d46
to
b05e975
Compare
e3d7177
to
152556b
Compare
$this->max_seek_calls = $new_limit; | ||
} | ||
return $this->max_seek_calls; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given an optional value I would expect this function to reset the limit to the original default value.
return $this->max_seek_calls = is_int( $new_limit ) ? $new_limit : self::MAX_SEEK_OPERATIONS;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default argument seems to be confusing, how about we do set_max_seek_calls( $limit )
and get_max_seek_calls()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a problem simply leaving it exposed as a public property? no need for get_
and set_
functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A public property sounds great
* @param int $new_limit Optional. The maximum allowed number of seek() calls. | ||
* @return number The allowed number of seek() calls. | ||
*/ | ||
public function seek_limit( $new_limit = null ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name here reads to me like this is another seek
operation. maybe we can think of something that more clearly sets a limit, such as set_max_seek_count()
, or skip the function and let people modify the public $max_seek_calls
directly.
initially my thought was to add something more systematic, like a define
, to prevent people from willy-nilly setting this to some arbitrarily-large value. that is, say you have a plugin that wants to seek all over; you'd need to modify the system-wide behavior.
maybe that's not great either because once it's up you can't bring it back down, but then again I don't expect any normal circumstances where we hit the limit; this would be specifically to break out of unbounded loops or cycles.
thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not have a system-wide define – as you said, the plugin could change it, too, and then do not restore it later. +1 for a public property.
* | ||
* @since 6.2.0 | ||
*/ | ||
private function apply_html_updates_and_stay_on_current_tag() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather emphasise the "replace $this->html
with $this->updated_html
part" rather than the stay_on_current_tag
part. Any ideas for a better name?
* | ||
* @since 6.2.0 | ||
*/ | ||
private function apply_html_updates_and_stay_on_current_tag() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions about the need and use of this function.
can you share why we had to change it to what it is now?
I'm maybe having trouble seeing through the diff to what it's doing and why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My largest goal was to avoid calling get_updated_html()
in seek()
as that name does not communicate how we're applying the updates and then lifting the parser to operate on the updated HTML.
As for the logic itself – we can keep it as it was, I just looked at it and thought it could do the same thing in less steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that name does not communicate how we're applying the updates and then lifting the parser to operate on the updated HTML.
this doesn't seem problematic to me. that we have to update the internal state when seeking isn't something that's exposed to the calling code, and it's irrelevant. it's an implementation detail and we could change it.
to me seek()
only implies one thing: resetting the pointer to a different location, which is what it does. internally I'm okay with having this code acknowledge that in the particular approach we've taken to accomplish that goal we have to flush all the current known updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I meant the „get_updated_html” name, not the „seek” name. It quacks like a getter, but it isn’t one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not following. how is get_updated_html()
not like a getter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name suggests it simply returns a value, but in reality it mutates the object. That's typically a domain of functions with names like set_*
, do_*
, update_*
, mutate_*
etc. I've been battling this race condition for far too long and now I'd rather avoid using names that don't accurately describe what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really in agreement here; from the outside it doesn't matter because we don't expose that internal accounting. get_updated_html()
is idempotent and safe to call. If it isn't then that's a bug, and we haven't exposed a way to extract those internals so as to work around the system.
Do you think it's possible this is a reactive response to a different issue? Seems like making a database write via HTTP call is way different than simply flushing out queued updates on an internal structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I guess you’re right, let’s just keep it as it was then.
c5f3cc0
to
bc7221f
Compare
Rebased (to fix unit test errors, see #46093). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't claim to have the same level of expertise w.r.t. WP_HTML_Tag_Processor
, but the new code in this PR makes sense, and the unit test coverage illustrates it nicely (and gives me some more confidence approving it).
I had a question/suggestion w.r.t. the overall interface, but I'll ask that in a separate comment -- it's certainly non-blocking 😄
So my overall understanding of This PR will then add a collection of bookmarks and related methods to set, release, and seek them. Internally, they happen to be stored similarly to the changesets (using In terms of interface: Would it be possible and make sense to consider using separate instances of $p = new WP_HTML_Tag_Processor( '<ul><li>One</li><li>Two</li><li>Three</li></ul>' );
$p->next_tag( 'li' );
$p->set_bookmark( 'first li' );
$p->next_tag( 'li' );
$p->set_attribute( 'foo-2', 'bar-2' );
$p->seek( 'first li' );
$p->set_attribute( 'foo-1', 'bar-1' );
$this->assertEquals(
'<ul><li foo-1="bar-1">One</li><li foo-2="bar-2">Two</li><li>Three</li></ul>',
$p->get_updated_html()
); something like $p = new WP_HTML_Tag_Processor( '<ul><li>One</li><li>Two</li><li>Three</li></ul>' );
$first_li = $p->next_tag( 'li' );
$p->next_tag( 'li' )->set_attribute( 'foo-2', 'bar-2' );
$first_li->set_attribute( 'foo-1', 'bar-1' );
$this->assertEquals(
'<ul><li foo-1="bar-1">One</li><li foo-2="bar-2">Two</li><li>Three</li></ul>',
$p->get_updated_html()
); The main rationale being that I think the mental model is a bit easier if each We'd probably have to be a bit clever about referencing the same Or we might consider decoupling the "iterators" a bit more from the blob and changesets? $p = new WP_HTML_Tag_Processor( '<ul><li>One</li><li>Two</li><li>Three</li></ul>' );
$first_li = $p->first_tag( 'li' ); // first_tag() Returns a WP_HTML_Bookmark object.
$p->next_tag( $first_li, 'li' )->set_attribute( 'foo-2', 'bar-2' ); // next_tag() now takes two arguments; it also returns a WP_HTML_Bookmark object.
$p->set_attribute( $first_li, 'foo-1', 'bar-1' );
$this->assertEquals(
'<ul><li foo-1="bar-1">One</li><li foo-2="bar-2">Two</li><li>Three</li></ul>',
$p->get_updated_html()
); |
Thanks @ockham for this question. I don't think this is going to make sense here, for a couple reasons. As I think you were referencing, if we automatically assign a bookmark on We've already seen one bug where someone tried to
The potential proliferation of bookmarks is a major issue I think because it's so easy to accidentally fall into, and that's especially true for large documents, which could be something as benign as a long paragraph block. Do you have any ideas for reference counting that we could implement? So far I've not been able to think of any way it's possible given that we don't have access to usage information in application code. Obviously when the Tag Processor is done or when bookmarks invalidate we know we can release them, but we don't know if someone is holding on to a bookmark or has forgotten about it. I can definitely see where When we were in person we also discussed a bookmark class. I think at the time we were discussing holding index information in that class and moved away from it. However, I wonder if we could hold the resource id for the bookmark instead and rely on a class destructor to release the bookmark. This would make it more possible to do something like you're suggesting, though I'm still leery of it. Also the named parameter in the bookmark is something I added because I thought it would be more capable of pushing people to be lean in how many of these they create. if ( $p->next_tag( 'li' ) ) {
// tracks a copy of `$p` internally
$first_li = $p->bookmark();
}
…
// may fail if the bookmark has been deleted
$first_li->set_attribute( … );
…
// reassign to a new location?
$first_li->update(); though at this point I'm still questioning if the complexity is worth it. exposing these object links vs. string names with manual creation and releasing gives us more ways to confuse object lifetime and instance. rambling here so please feel welcome to engage with what I'm sharing. |
+1 to what @dmsnell says. While it would be technically possible to introduce an API like you proposed @ockham, it would also add a lot of complexity to the framework and, potentially, make some use-cases more complex for the user. Like for example what should happen when I update |
6026b5d
to
5fcf00d
Compare
Granted, there's some degree of subjectivity, but I think I find it easier to reason about bookmarks/pointers if they're represented by a class of their own, vs. by methods of another class (that already has another "pointer"-like thing). Separation of concerns as elaborated further below in my comment -- my rule-of-thumb would be something like: If I add a new method that does something different, or if I discover that the implementation of some internal state is flawed -- can I contain the changes I need to make to a subset of existing methods, or will I need to modify pretty much everything? E.g. encapsulating the read-only logic in a basic class, and adding the change tracking/"pointer arithmetics" in another layer (i.e. derived class).
Ah, thank you for pointing that out 👍
👍
Ah, thanks for enlightening -- I was still assuming we'd go with the "unsafe" class, but if we can avoid that, all the better!
For context, my thinking here is shaped by C++-style iterators. If memory serves (and we're talking ~15 years ago), they can be "output" or "input" (i.e. a dereferenced At the risk of seeing nails everywhere: I thought that a model like this could carry over to what we're dealing with. We could have different Tag Processor classes that offer different features and guarantees, at the price of more internal complexity (and potentially less performance). As expressed earlier, I was largely seeing three "layers" for the time being:
My hunch was that the logic that'd be needed to implement each layer's additional features would be fairly self-contained.
I don't currently have any features that I want but aren't there yet! If anything, I wanted to bring this up in case we end up running into any issues with the current implementation as we add more features. Again, I think it's perfectly fine to proceed with the current direction. I'll soon start experimenting with the Tag Processor for the Block Interactivity stuff -- this might inform my opinion much better than my admittedly academic thinking right now 😄
Fair enough 😅 I'll rephrase: We wouldn't even need the code that would do the bookkeeping in a class like that 😬 |
I'm looking for objective measures:
to this end I'm very skeptical of a bookmark class, particularly because I'm still unclear after reading your response what benefit they offer.
right now a bookmark is a name and leaves no residue in the calling code. you are suggesting we create a new class to couple into the tag processor that we have to track in our class, and on which we want to call methods that affect the internal state of the tag processor. what is easier about this? of note: the only viable mechanism for managing bookmarks IMO is within the tag processor, because that's exactly where all the logic for manipulating the input document and its string indexes occur. if we export this to another class we're going to need to add a lot of weak coupling and remember in all the places to account for those external bookmarks, plus we introduce the bigger question of what happens when bookmarks live on beyond the tag processor. today none of the hassle or accounting of bookmarks is an issue because you can't have bookmarks outlive a tag processor or get out of sync with one.
I'm still exploring the unsafe class, but given that I think we can eventually support spec-compliant HTML parsing I think Bookmarks themselves are fully safe from a syntax perspective, even if their use can lead to broken semantics (e.g. removing a section of the input HTML that removes a tag opener but not its closer).
If we don't have a compelling use-case to lead us, and we don't know what code we want to write, then I'm going to strongly advise that we lean on what we already know and understand before adding new facets to a currently-tiny API surface area.
It could be possible you're under-estimating how separable these operations are from each other, or over-estimating how unifiable they are. Early on we discovered that if we collapse the semantic operations into a low-level list of text replacements that a lot of complicated logic actually just falls apart into a relatively simple flow (queue text span replacements, flush them out, continue processing). Doing semantic HTML operations fits into the same bucket by queuing safe document changes within the tag processor. As long as the method we expose maintains the HTML semantics, there's no more risk in removing a node or replacing inner content than there is replacing an attribute, as long as we keep the logic inside the tag processor. |
A defect introduced in #46018 led to the tag processor backing up one index too far after flushing its queued changes on a document. For most operations this didn't cause any harm because when immediately moving forward after an update, the `next_tag()` returned to the same spot: it was backing up to one position before the current tag instead of at the start of the current tag. Unfortunately, when the current tag was the first in the document this would lead the processor to rewind to position `-1`, right before the start of the document, and lead to errors with `strpos()` when it received out-of-bounds indices. In this fix we're correcting the adjustment for the HTML tag's `<` and documenting the math in the file so that it's clearer why it's there and providing guidance should another fix be necessary. Props to @anton-vlasenko for finding this bug.
A defect introduced in #46018 led to the tag processor backing up one index too far after flushing its queued changes on a document. For most operations this didn't cause any harm because when immediately moving forward after an update, the `next_tag()` returned to the same spot: it was backing up to one position before the current tag instead of at the start of the current tag. Unfortunately, when the current tag was the first in the document this would lead the processor to rewind to position `-1`, right before the start of the document, and lead to errors with `strpos()` when it received out-of-bounds indices. In this fix we're correcting the adjustment for the HTML tag's `<` and documenting the math in the file so that it's clearer why it's there and providing guidance should another fix be necessary. Props to @anton-vlasenko for finding this bug.
A defect introduced in #46018 led to the tag processor backing up one index too far after flushing its queued changes on a document. For most operations this didn't cause any harm because when immediately moving forward after an update, the `next_tag()` returned to the same spot: it was backing up to one position before the current tag instead of at the start of the current tag. Unfortunately, when the current tag was the first in the document this would lead the processor to rewind to position `-1`, right before the start of the document, and lead to errors with `strpos()` when it received out-of-bounds indices. In this fix we're correcting the adjustment for the HTML tag's `<` and documenting the math in the file so that it's clearer why it's there and providing guidance should another fix be necessary. Props to @anton-vlasenko for finding this bug.
…46598) A defect introduced in #46018 led to the tag processor backing up one index too far after flushing its queued changes on a document. For most operations this didn't cause any harm because when immediately moving forward after an update, the `next_tag()` returned to the same spot: it was backing up to one position before the current tag instead of at the start of the current tag. Unfortunately, when the current tag was the first in the document this would lead the processor to rewind to position `-1`, right before the start of the document, and lead to errors with `strpos()` when it received out-of-bounds indices. In this fix we're correcting the adjustment for the HTML tag's `<` and documenting the math in the file so that it's clearer why it's there and providing guidance should another fix be necessary. As supporting work to this patch we're making the text replacement sort stable, inside the tag processor, for when determining the order in which to apply text replacements. This isn't necessary for the runtime but is a nuissance for testing because different PHP versions produce different unstable sort orderings and this prevents that from causing the unit tests to fail in one version but pass in another. Props to @anton-vlasenko for finding this bug. Enforce sort stability when flushing out text replacements
…46598) A defect introduced in #46018 led to the tag processor backing up one index too far after flushing its queued changes on a document. For most operations this didn't cause any harm because when immediately moving forward after an update, the `next_tag()` returned to the same spot: it was backing up to one position before the current tag instead of at the start of the current tag. Unfortunately, when the current tag was the first in the document this would lead the processor to rewind to position `-1`, right before the start of the document, and lead to errors with `strpos()` when it received out-of-bounds indices. In this fix we're correcting the adjustment for the HTML tag's `<` and documenting the math in the file so that it's clearer why it's there and providing guidance should another fix be necessary. As supporting work to this patch we're making the text replacement sort stable, inside the tag processor, for when determining the order in which to apply text replacements. This isn't necessary for the runtime but is a nuissance for testing because different PHP versions produce different unstable sort orderings and this prevents that from causing the unit tests to fail in one version but pass in another. Props to @anton-vlasenko for finding this bug. Enforce sort stability when flushing out text replacements
What?
Introduces "bookmarks" to the
WP_HTML_Tag_Processor
which allow seeking to previously-scanned parts of the document, if they still exist, while maintaining the integrity of the HTML syntax.Why?
It can be helpful to track a location in an HTML document while updates are being made to it such that we can instruct the Tag Processor to seek to the location of one of the bookmarks.
In this patch we're introducing a bookmarks system to do just that. Bookmarks are referenced by name and handled internally by a tracking object which will follow all updates made to the document. It will be possible to rewind or jump around a document by setting a bookmark and then calling
seek( $bookmark_name )
to move there.How?
The bookmarks are tracked internally to the tag processor. Every time we update the document through
set_attribute()
or related functions, the trackers are updated to follow those edits.Bookmarks are two-sided, tracking the start and end of an HTML token. We do this so that we can never accidentally reset a string offset into the middle of a tag or attribute.