Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag Processor: Add bookmark system for tracking semantic locations in document #46018

Merged
merged 1 commit into from
Dec 2, 2022

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Nov 23, 2022

What?

Introduces "bookmarks" to the WP_HTML_Tag_Processor which allow seeking to previously-scanned parts of the document, if they still exist, while maintaining the integrity of the HTML syntax.

Why?

It can be helpful to track a location in an HTML document while updates are being made to it such that we can instruct the Tag Processor to seek to the location of one of the bookmarks.

In this patch we're introducing a bookmarks system to do just that. Bookmarks are referenced by name and handled internally by a tracking object which will follow all updates made to the document. It will be possible to rewind or jump around a document by setting a bookmark and then calling seek( $bookmark_name ) to move there.

How?

The bookmarks are tracked internally to the tag processor. Every time we update the document through set_attribute() or related functions, the trackers are updated to follow those edits.

Bookmarks are two-sided, tracking the start and end of an HTML token. We do this so that we can never accidentally reset a string offset into the middle of a tag or attribute.

@codesandbox
Copy link

codesandbox bot commented Nov 23, 2022

CodeSandbox logoCodeSandbox logo  Open in CodeSandbox Web Editor | VS Code | VS Code Insiders

@ockham ockham assigned ockham and unassigned ockham Nov 24, 2022
@ockham ockham added [Type] Experimental Experimental feature or API. [Type] New API New API to be used by plugin developers or package users. labels Nov 24, 2022
@luisherranz luisherranz added [Feature] Patterns A collection of blocks that can be synced (previously reusable blocks) or unsynced [Type] New API New API to be used by plugin developers or package users. and removed [Type] New API New API to be used by plugin developers or package users. [Feature] Patterns A collection of blocks that can be synced (previously reusable blocks) or unsynced labels Nov 24, 2022
@DAreRodz DAreRodz added the [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f label Nov 24, 2022
@adamziel
Copy link
Contributor

adamziel commented Nov 24, 2022

@dmsnell I drafted the rewind() method in e3eddb6 – it's very rough but seems to do the job.

This draft of rewind() applies all the updates using the same reconcilation logic as in get_updated_html(). At this point, the html got replaced with updated_html. The only other thing we need to do is update the internal pointers: parsed_bytes, updated_bytes, and $this->updated_html to match the bookmark starting position.

We'll probably go a long way from here, but it's a peg in the ground. It's good enough for the following test case to pass:

	public function test_bookmark()
	{
		$p = new WP_HTML_Tag_Processor('<ul><li>One</li><li>Two</li><li>Three</li></ul>');
		$p->next_tag('li');
		$p->set_bookmark('first li');

		$p->next_tag('li');
		$p->set_bookmark('second li');
		$p->set_attribute('foo-2', 'bar-2');

		$p->rewind('first li');
		$p->set_attribute('foo-1', 'bar-1');

		$p->rewind('second li');
		$p->next_tag('li');
		$p->set_attribute('foo-3', 'bar-3');

		$this->assertEquals(
			'<ul><li foo-1="bar-1">One</li><li foo-2="bar-2">Two</li><li foo-3="bar-3">Three</li></ul>',
			$p->get_updated_html()
		);
	}

Oh and I apologize for the formatting mess – my VisualStudio was too eager to use different formatting strategy. Also, I renamed the test file only temporarily to get these tests to run without Docker. Let's rollback my dev changes before we merge this PR.

@dmsnell dmsnell force-pushed the tag-processor/add-bookmarks branch from 8b73d46 to b05e975 Compare November 25, 2022 03:34
@adamziel adamziel force-pushed the tag-processor/add-bookmarks branch from e3d7177 to 152556b Compare November 26, 2022 20:35
$this->max_seek_calls = $new_limit;
}
return $this->max_seek_calls;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given an optional value I would expect this function to reset the limit to the original default value.

return $this->max_seek_calls = is_int( $new_limit ) ? $new_limit : self::MAX_SEEK_OPERATIONS;

Copy link
Contributor

@adamziel adamziel Nov 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default argument seems to be confusing, how about we do set_max_seek_calls( $limit ) and get_max_seek_calls() instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a problem simply leaving it exposed as a public property? no need for get_ and set_ functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A public property sounds great

* @param int $new_limit Optional. The maximum allowed number of seek() calls.
* @return number The allowed number of seek() calls.
*/
public function seek_limit( $new_limit = null ) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name here reads to me like this is another seek operation. maybe we can think of something that more clearly sets a limit, such as set_max_seek_count(), or skip the function and let people modify the public $max_seek_calls directly.

initially my thought was to add something more systematic, like a define, to prevent people from willy-nilly setting this to some arbitrarily-large value. that is, say you have a plugin that wants to seek all over; you'd need to modify the system-wide behavior.

maybe that's not great either because once it's up you can't bring it back down, but then again I don't expect any normal circumstances where we hit the limit; this would be specifically to break out of unbounded loops or cycles.

thoughts?

Copy link
Contributor

@adamziel adamziel Nov 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have a system-wide define – as you said, the plugin could change it, too, and then do not restore it later. +1 for a public property.

*
* @since 6.2.0
*/
private function apply_html_updates_and_stay_on_current_tag() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather emphasise the "replace $this->html with $this->updated_html part" rather than the stay_on_current_tag part. Any ideas for a better name?

*
* @since 6.2.0
*/
private function apply_html_updates_and_stay_on_current_tag() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions about the need and use of this function.

can you share why we had to change it to what it is now?
I'm maybe having trouble seeing through the diff to what it's doing and why

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My largest goal was to avoid calling get_updated_html() in seek() as that name does not communicate how we're applying the updates and then lifting the parser to operate on the updated HTML.

As for the logic itself – we can keep it as it was, I just looked at it and thought it could do the same thing in less steps.

Copy link
Member Author

@dmsnell dmsnell Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that name does not communicate how we're applying the updates and then lifting the parser to operate on the updated HTML.

this doesn't seem problematic to me. that we have to update the internal state when seeking isn't something that's exposed to the calling code, and it's irrelevant. it's an implementation detail and we could change it.

to me seek() only implies one thing: resetting the pointer to a different location, which is what it does. internally I'm okay with having this code acknowledge that in the particular approach we've taken to accomplish that goal we have to flush all the current known updates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant the „get_updated_html” name, not the „seek” name. It quacks like a getter, but it isn’t one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not following. how is get_updated_html() not like a getter?

Copy link
Contributor

@adamziel adamziel Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name suggests it simply returns a value, but in reality it mutates the object. That's typically a domain of functions with names like set_*, do_*, update_*, mutate_* etc. I've been battling this race condition for far too long and now I'd rather avoid using names that don't accurately describe what's going on.

Copy link
Member Author

@dmsnell dmsnell Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really in agreement here; from the outside it doesn't matter because we don't expose that internal accounting. get_updated_html() is idempotent and safe to call. If it isn't then that's a bug, and we haven't exposed a way to extract those internals so as to work around the system.

Do you think it's possible this is a reactive response to a different issue? Seems like making a database write via HTTP call is way different than simply flushing out queued updates on an internal structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I guess you’re right, let’s just keep it as it was then.

@ockham ockham force-pushed the tag-processor/add-bookmarks branch from c5f3cc0 to bc7221f Compare November 29, 2022 12:58
@ockham
Copy link
Contributor

ockham commented Nov 29, 2022

Rebased (to fix unit test errors, see #46093).

Copy link
Contributor

@ockham ockham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I don't claim to have the same level of expertise w.r.t. WP_HTML_Tag_Processor, but the new code in this PR makes sense, and the unit test coverage illustrates it nicely (and gives me some more confidence approving it).

I had a question/suggestion w.r.t. the overall interface, but I'll ask that in a separate comment -- it's certainly non-blocking 😄

@ockham
Copy link
Contributor

ockham commented Nov 29, 2022

I had a question/suggestion w.r.t. the overall interface

So my overall understanding of WP_HTML_Tag_Processor is that it's basically an HTML blob plus a pointer to an (opening) HTML tag, plus a number of changesets (currently only for classnames and attributes IIUC).

This PR will then add a collection of bookmarks and related methods to set, release, and seek them. Internally, they happen to be stored similarly to the changesets (using WP_HTML_Text_Replacement), and also similar to the WP_HTML_Tag_Processor's objects own HTML tag pointer.

In terms of interface: Would it be possible and make sense to consider using separate instances of WP_HTML_Tag_Processor instead of the internal bookmarks array and methods? So instead of

		$p = new WP_HTML_Tag_Processor( '<ul><li>One</li><li>Two</li><li>Three</li></ul>' );
		$p->next_tag( 'li' );
		$p->set_bookmark( 'first li' );

		$p->next_tag( 'li' );
		$p->set_attribute( 'foo-2', 'bar-2' );

		$p->seek( 'first li' );
		$p->set_attribute( 'foo-1', 'bar-1' );

		$this->assertEquals(
			'<ul><li foo-1="bar-1">One</li><li foo-2="bar-2">Two</li><li>Three</li></ul>',
			$p->get_updated_html()
		);

something like

		$p = new WP_HTML_Tag_Processor( '<ul><li>One</li><li>Two</li><li>Three</li></ul>' );
		$first_li = $p->next_tag( 'li' );

		$p->next_tag( 'li' )->set_attribute( 'foo-2', 'bar-2' );

		$first_li->set_attribute( 'foo-1', 'bar-1' );

		$this->assertEquals(
			'<ul><li foo-1="bar-1">One</li><li foo-2="bar-2">Two</li><li>Three</li></ul>',
			$p->get_updated_html()
		);

The main rationale being that I think the mental model is a bit easier if each WP_HTML_Tag_Processor is thought of as just one pointer (rather than a number of pointers -- one of them somewhat superior than the others). The main inspiration being iterators.

We'd probably have to be a bit clever about referencing the same$html "blob" from different instances of WP_HTML_Tag_Processor, and how to deal with changesets; probably also need a reference counter to replace the max bookmarks mechanism.

Or we might consider decoupling the "iterators" a bit more from the blob and changesets?

		$p = new WP_HTML_Tag_Processor( '<ul><li>One</li><li>Two</li><li>Three</li></ul>' );
		$first_li = $p->first_tag( 'li' ); // first_tag() Returns a WP_HTML_Bookmark object.

		$p->next_tag( $first_li, 'li' )->set_attribute( 'foo-2', 'bar-2' ); // next_tag() now takes two arguments; it also returns a WP_HTML_Bookmark object.

		$p->set_attribute( $first_li, 'foo-1', 'bar-1' );

		$this->assertEquals(
			'<ul><li foo-1="bar-1">One</li><li foo-2="bar-2">Two</li><li>Three</li></ul>',
			$p->get_updated_html()
		);

@dmsnell
Copy link
Member Author

dmsnell commented Nov 29, 2022

Would it be possible and make sense to consider using separate instances of WP_HTML_Tag_Processor instead of the internal bookmarks array and methods?

Thanks @ockham for this question. I don't think this is going to make sense here, for a couple reasons.

As I think you were referencing, if we automatically assign a bookmark on next_tag() we will have to deal with the fact that those bookmarks might be invalidated. we've invited concurrency issues where none existed. Granted, these exist as-is in the PR except that seeking to a bookmark is a separate activity than updating a tag at a current pointer is.

We've already seen one bug where someone tried to ->set_attribute() when there was no tag, but right now at least next_tag() returns a boolean to indicate if it found one. Early on we explored a fluid API, but it seemed to introduce more complication with marginal gains. For example, what is the value-add of writing $p->next_tag()->set_attribute() over $p->next_tag(); $p->set_attribute();? It's not much, and I suspect that we'd be giving up a lot here for minimal gain if we eliminated the call to seek().

probably also need a reference counter to replace the max bookmarks mechanism.

The potential proliferation of bookmarks is a major issue I think because it's so easy to accidentally fall into, and that's especially true for large documents, which could be something as benign as a long paragraph block.

Do you have any ideas for reference counting that we could implement? So far I've not been able to think of any way it's possible given that we don't have access to usage information in application code. Obviously when the Tag Processor is done or when bookmarks invalidate we know we can release them, but we don't know if someone is holding on to a bookmark or has forgotten about it.


I can definitely see where $first_li->set_attribute( 'foo-1', 'bar-1' ); would be convenient, but it really scares me what implications things brings with it, of moving from a fairly straightforward sequential machine to that of a document with concurrent editors.

When we were in person we also discussed a bookmark class. I think at the time we were discussing holding index information in that class and moved away from it. However, I wonder if we could hold the resource id for the bookmark instead and rely on a class destructor to release the bookmark.

This would make it more possible to do something like you're suggesting, though I'm still leery of it. Also the named parameter in the bookmark is something I added because I thought it would be more capable of pushing people to be lean in how many of these they create.

if ( $p->next_tag( 'li' ) ) {
	// tracks a copy of `$p` internally
	$first_li = $p->bookmark();
}

…

// may fail if the bookmark has been deleted
$first_li->set_attribute( … );

…

// reassign to a new location?
$first_li->update();

though at this point I'm still questioning if the complexity is worth it. exposing these object links vs. string names with manual creation and releasing gives us more ways to confuse object lifetime and instance.

rambling here so please feel welcome to engage with what I'm sharing.

@adamziel
Copy link
Contributor

+1 to what @dmsnell says. While it would be technically possible to introduce an API like you proposed @ockham, it would also add a lot of complexity to the framework and, potentially, make some use-cases more complex for the user. Like for example what should happen when I update $first_li in a way that changes or destroys $second_li? At the same time, the main value is saving a few lines of code at best. I'd avoid adding that to the framework until it's absolutely necessary.

@dmsnell dmsnell force-pushed the tag-processor/add-bookmarks branch from 6026b5d to 5fcf00d Compare December 2, 2022 03:03
@dmsnell dmsnell merged commit 5fcf00d into trunk Dec 2, 2022
@dmsnell dmsnell deleted the tag-processor/add-bookmarks branch December 2, 2022 03:46
@ockham
Copy link
Contributor

ockham commented Dec 12, 2022

it's simplifying the mental model…The overall motivation being separation of concerns

Can you help by describing how you find it simpler? And what concerns are you trying to separate?

Granted, there's some degree of subjectivity, but I think I find it easier to reason about bookmarks/pointers if they're represented by a class of their own, vs. by methods of another class (that already has another "pointer"-like thing).

Separation of concerns as elaborated further below in my comment -- my rule-of-thumb would be something like: If I add a new method that does something different, or if I discover that the implementation of some internal state is flawed -- can I contain the changes I need to make to a subset of existing methods, or will I need to modify pretty much everything? E.g. encapsulating the read-only logic in a basic class, and adding the change tracking/"pointer arithmetics" in another layer (i.e. derived class).

removing the need for manually calling release_bookmark() would be cool

It probably worth noting that it isn't currently required in any sense. Once the tag processor is collected so will the bookmarks also be collected.

Ah, thank you for pointing that out 👍

The introduction of a release inside a destructor is a possible way to contain the additional complexity introduced by creating a class to expose the internals of the tag processor as a bookmark option.

👍

Any method for which we cannot guarantee the integrity of pointers (e.g. remove tag?) would thus have to go into the "unsafe" tag processor.

My goal, and baked into the current design, is that we don't have to concern ourselves with the safety of the pointers. Everything in WP_HTML_Tag_Processor should guarantee the safety of the HTML syntax. In the case of removing a tag or removing inner HTML, the pointers will remain safe and an attempt to seek to a pointer whose underlying content is gone will return false - the processor won't be able to seek to that spot and will stay where it is.

Ah, thanks for enlightening -- I was still assuming we'd go with the "unsafe" class, but if we can avoid that, all the better!

I'd like to float the idea of a different approach…The read-only tag processor could focus on tokenization and wouldn't need any change tracking and resulting "pointer arithmetics"

This is something where maybe I just need to hear you explain more clearly what the benefits would be, or what concerns you are trying to separate. Maintaining the pointers internally isn't hard, and while there are obvious use cases for read-only querying, I would predict that it'd be hard to try and treat that as a separate thing from the write-ability.

For context, my thinking here is shaped by C++-style iterators. If memory serves (and we're talking ~15 years ago), they can be "output" or "input" (i.e. a dereferenced *iterator can be used on the RHS or LHS of an assignment operator only, respectively), or bidir; and, orthogonal to this, they can be random access or... not? (It's been a while 😅) Algorithms are then defined to be available only for (pairs of) iterators that have the necessary traits -- e.g. a lot of "sort" algorithms will require random access. This means you will typically make a tradeoff between versatility and performance (or memory footprint) when selecting a type of iterator for the job.

At the risk of seeing nails everywhere: I thought that a model like this could carry over to what we're dealing with. We could have different Tag Processor classes that offer different features and guarantees, at the price of more internal complexity (and potentially less performance). As expressed earlier, I was largely seeing three "layers" for the time being:

  1. Read-only
  2. Write access to attributes
  3. Tag removal

My hunch was that the logic that'd be needed to implement each layer's additional features would be fairly self-contained.

It might be helpful if you could write out some example code for your idea to show what you want but don't have yet. Show me the API you wish you had and how that separation of concerns leads to a simpler model.

I don't currently have any features that I want but aren't there yet! If anything, I wanted to bring this up in case we end up running into any issues with the current implementation as we add more features. Again, I think it's perfectly fine to proceed with the current direction. I'll soon start experimenting with the Tag Processor for the Block Interactivity stuff -- this might inform my opinion much better than my admittedly academic thinking right now 😄

users would only pay for what they need? I.e. if they only need read access, there wouldn't be any of the overhead involved that results from adjusting bookmarks

This is how the current system works, so I'm not sure what we would want to change to get what we already have 😉

Fair enough 😅 I'll rephrase: We wouldn't even need the code that would do the bookkeeping in a class like that 😬

@dmsnell
Copy link
Member Author

dmsnell commented Dec 12, 2022

some degree of subjectivity

I'm looking for objective measures:

  • how many new concepts are we introducing? how many are required to learn in order to use the system?
  • how many new areas of coupling are we introducing between modules/classes/files?
  • how many alternate ways of accomplishing the same goal are we introducing?
  • how many abstraction leaks are introduced, by means of exporting internal details between modules/classes/files?

to this end I'm very skeptical of a bookmark class, particularly because I'm still unclear after reading your response what benefit they offer.

I find it easier to reason about bookmarks/pointers if they're represented by a class of their own

right now a bookmark is a name and leaves no residue in the calling code. you are suggesting we create a new class to couple into the tag processor that we have to track in our class, and on which we want to call methods that affect the internal state of the tag processor.

what is easier about this?

of note: the only viable mechanism for managing bookmarks IMO is within the tag processor, because that's exactly where all the logic for manipulating the input document and its string indexes occur. if we export this to another class we're going to need to add a lot of weak coupling and remember in all the places to account for those external bookmarks, plus we introduce the bigger question of what happens when bookmarks live on beyond the tag processor.

today none of the hassle or accounting of bookmarks is an issue because you can't have bookmarks outlive a tag processor or get out of sync with one.

assuming we'd go with the "unsafe" class

I'm still exploring the unsafe class, but given that I think we can eventually support spec-compliant HTML parsing I think WP_HTML_Processor might be adequate, if we give appropriate names to what we're doing, e.g. content_between_balanced_tags().

Bookmarks themselves are fully safe from a syntax perspective, even if their use can lead to broken semantics (e.g. removing a section of the input HTML that removes a tag opener but not its closer).

I don't currently have any features that I want but aren't there yet!

If we don't have a compelling use-case to lead us, and we don't know what code we want to write, then I'm going to strongly advise that we lean on what we already know and understand before adding new facets to a currently-tiny API surface area.

My hunch was that the logic that'd be needed to implement each layer's additional features would be fairly self-contained.

It could be possible you're under-estimating how separable these operations are from each other, or over-estimating how unifiable they are.

Early on we discovered that if we collapse the semantic operations into a low-level list of text replacements that a lot of complicated logic actually just falls apart into a relatively simple flow (queue text span replacements, flush them out, continue processing). Doing semantic HTML operations fits into the same bucket by queuing safe document changes within the tag processor. As long as the method we expose maintains the HTML semantics, there's no more risk in removing a node or replacing inner content than there is replacing an attribute, as long as we keep the logic inside the tag processor.

dmsnell added a commit that referenced this pull request Dec 15, 2022
A defect introduced in #46018 led to the tag processor backing up
one index too far after flushing its queued changes on a document.

For most operations this didn't cause any harm because when immediately
moving forward after an update, the `next_tag()` returned to the same
spot: it was backing up to one position before the current tag instead
of at the start of the current tag.

Unfortunately, when the current tag was the first in the document this
would lead the processor to rewind to position `-1`, right before the
start of the document, and lead to errors with `strpos()` when it
received out-of-bounds indices.

In this fix we're correcting the adjustment for the HTML tag's `<` and
documenting the math in the file so that it's clearer why it's there
and providing guidance should another fix be necessary.

Props to @anton-vlasenko for finding this bug.
dmsnell added a commit that referenced this pull request Dec 15, 2022
A defect introduced in #46018 led to the tag processor backing up
one index too far after flushing its queued changes on a document.

For most operations this didn't cause any harm because when immediately
moving forward after an update, the `next_tag()` returned to the same
spot: it was backing up to one position before the current tag instead
of at the start of the current tag.

Unfortunately, when the current tag was the first in the document this
would lead the processor to rewind to position `-1`, right before the
start of the document, and lead to errors with `strpos()` when it
received out-of-bounds indices.

In this fix we're correcting the adjustment for the HTML tag's `<` and
documenting the math in the file so that it's clearer why it's there
and providing guidance should another fix be necessary.

Props to @anton-vlasenko for finding this bug.
anton-vlasenko pushed a commit that referenced this pull request Dec 16, 2022
A defect introduced in #46018 led to the tag processor backing up
one index too far after flushing its queued changes on a document.

For most operations this didn't cause any harm because when immediately
moving forward after an update, the `next_tag()` returned to the same
spot: it was backing up to one position before the current tag instead
of at the start of the current tag.

Unfortunately, when the current tag was the first in the document this
would lead the processor to rewind to position `-1`, right before the
start of the document, and lead to errors with `strpos()` when it
received out-of-bounds indices.

In this fix we're correcting the adjustment for the HTML tag's `<` and
documenting the math in the file so that it's clearer why it's there
and providing guidance should another fix be necessary.

Props to @anton-vlasenko for finding this bug.
dmsnell added a commit that referenced this pull request Dec 16, 2022
…46598)

A defect introduced in #46018 led to the tag processor backing up
one index too far after flushing its queued changes on a document.

For most operations this didn't cause any harm because when immediately
moving forward after an update, the `next_tag()` returned to the same
spot: it was backing up to one position before the current tag instead
of at the start of the current tag.

Unfortunately, when the current tag was the first in the document this
would lead the processor to rewind to position `-1`, right before the
start of the document, and lead to errors with `strpos()` when it
received out-of-bounds indices.

In this fix we're correcting the adjustment for the HTML tag's `<` and
documenting the math in the file so that it's clearer why it's there
and providing guidance should another fix be necessary.

As supporting work to this patch we're making the text replacement sort
stable, inside the tag processor, for when determining the order in which
to apply text replacements. This isn't necessary for the runtime but is
a nuissance for testing because different PHP versions produce different
unstable sort orderings and this prevents that from causing the unit
tests to fail in one version but pass in another.

Props to @anton-vlasenko for finding this bug.

Enforce sort stability when flushing out text replacements
dmsnell added a commit that referenced this pull request Dec 20, 2022
…46598)

A defect introduced in #46018 led to the tag processor backing up
one index too far after flushing its queued changes on a document.

For most operations this didn't cause any harm because when immediately
moving forward after an update, the `next_tag()` returned to the same
spot: it was backing up to one position before the current tag instead
of at the start of the current tag.

Unfortunately, when the current tag was the first in the document this
would lead the processor to rewind to position `-1`, right before the
start of the document, and lead to errors with `strpos()` when it
received out-of-bounds indices.

In this fix we're correcting the adjustment for the HTML tag's `<` and
documenting the math in the file so that it's clearer why it's there
and providing guidance should another fix be necessary.

As supporting work to this patch we're making the text replacement sort
stable, inside the tag processor, for when determining the order in which
to apply text replacements. This isn't necessary for the runtime but is
a nuissance for testing because different PHP versions produce different
unstable sort orderings and this prevents that from causing the unit
tests to fail in one version but pass in another.

Props to @anton-vlasenko for finding this bug.

Enforce sort stability when flushing out text replacements
@adamziel adamziel mentioned this pull request Feb 9, 2023
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Experimental Experimental feature or API. [Type] New API New API to be used by plugin developers or package users.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants