Solving rewriting site URLs in WordPress using the HTML API and URL parser #74

adamziel · 2024-07-10T15:47:48Z

adamziel
Jul 10, 2024

Every time we want to migrate content to and from WordPress, we need to replace the original site URLs with the target site URL.

Traditional methods like wp search-replace just don't cut it. I've recently explored a solution based on HTML API that, despite being an early prototype, may already be the most comprehensive and correct URL rewriting tool out there.

In this discussion, I'd like to:

Gather feedback on the overall approach and ideas, learn what's exciting and what's missing.
Propose this approach for Data Liberation, WordPress Playground, and eventually for WordPress core.

Also, a lot of credit for these ideas goes to @dmsnell who spent countless hours building block parsers, HTML parsers, fixing unicode issues, and just being awesome.

The Problem with Traditional Methods

Traditional methods of URL replacement in WordPress, such as using the wp search-replace CLI command, come with several limitations that can lead to various issues. These problems stem from the simplistic nature of these methods, which treat the content as plain text without understanding the context or structure of the document. The primary pitfalls include:

Inconsistent Replacements

Traditional URL replacement methods rely on straightforward string matching and replacement techniques. While this approach can be effective for simple cases, it often leads to inconsistent replacements in more complex scenarios. For example:

Substring Matching: If you need to replace https://science.com with a new URL, the tool might inadvertently replace instances where science.com appears as part of a larger URL like https://science.comcast.net, leading to incorrect and broken URLs.
Case Sensitivity Issues: These methods might not handle different cases (e.g., Science.com vs. science.com) consistently, resulting in partial or missed replacements.

Lack of Context

The traditional methods treat the entire content as raw text and lack an understanding of the document’s structure. This can cause several issues:

HTML attributes: The search-replace operation does not distinguish between URLs in HTML tag attributes (like href or src) and URLs that may appear in plain text, comments, or scripts. For instance, altering <div id="https://science.com"> to <div id="https://newsite.com"> might affect JavaScript or CSS, leading to unintended behaviors.
Structured Data: Data serialized in formats like JSON, where URLs are part of a more complex structure, might go unreplaced at best or get malformed at worst. A URL found in text needs to be escaped differently than one found in a <a href> attribute or inside block markup.

Here's a few examples:

<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->
&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/.

Punycode, URL Encoding

The URL syntax described in WHATWG URL standard isn't trivial. There are special rules for encoding unicode characters, and they're different in paths and query strings. Here's just two:

Punycode: Internationalized domain names (IDNs) often use Punycode encoding (e.g., https://xn--fsq.com for https://🚀science.com). Simple search-and-replace methods do not account for these encoded values, potentially missing them or corrupting the URL.
Encoded URLs: URLs often contain encoded characters like %20 for spaces, making direct matching tricky. A naive replacement might fail to recognize or properly handle these encodings, leading to incomplete or erroneous replacements.

The same URL may be expressed in a lot of diferent ways, for example:

🚀-science.com/science
🚀-science.com/%73%63ience
https://xn---science-7f85g.com/science

Other edge Cases

In real-world use cases, URLs can take various forms and structures that challenge traditional search-replace methods:

Variants and Subdomains: URLs can have different subdomains, paths, or query parameters. A method targeting https://science.com might miss https://blog.science.com or https://science.com/path?query=1. A person doing the migration might either want to either preserve or replace the latter two.
Even more contextual awareness: A URL found inside a <script> tag might need to be migrated or might need to be left alone. Ditto for URLs found in HTML attributes such as class.

The Solution Using HTML API

The HTML API-based prototype I’ve been developing addresses these traditional pitfalls by leveraging a more sophisticated approach to URL replacement that includes:

Contextual Awareness: By parsing the HTML content, we can correctly decode the data, parse it as a URL, chose to replace specific parts like the domain and the path, and then encode it back as, e.g. JSON inside HTML comment (for block markup), or as an HTML attribute, or just as text within the document.
WHATWG-compliant URL Parser: By parsing URLs in the same way as a browser, we're able to correctly recognize 🚀-science.com/science and https://xn---science-7f85g.com/%73%63ience as the same URL.
Stream-rewriting: Instead of writing to the database and then issuing a ton of UPDATE queries, we do all the rewriting before the data ever makes it into the database. This enables tracking progress, short-circuiting on error, retrying, and frontloading media files. We can always be sure that every post in the database was correctly migrated and doesn't have to be processed again.

Technical details

Here's a few highlights from the https://github.com/adamziel/site-transfer-protocol/ repository where the prototype lives:

WP_Block_Markup_Processor extends WP_HTML_Tag_Processor with the ability to parse and rewrite block attributes.
WP_Block_Markup_Url_Processor also ships the next_url() method capable of semantically finding the next URL in text nodes, HTML attributes, and block markup. It also provides a set_url() method that performs a context-aware substitution, escaping, and encoding.
public_suffix_list.php is used to reduce false-positive URL parsing, e.g. we don't want to even consider index.html in the index.html file as a URL, but we do want to consider wordpress.org in the wordpress.org site as one.

Examples

Here's a sample of what the URL rewriting prototype can already do today. We're migrating https://🚀-science.com/science to https://science.wordpress.com:

Inline text

<!-- wp:paragraph --><p>🚀-science.com/science</p><!-- /wp:paragraph -->

Gets rewritten as:

<!-- wp:paragraph --><p>science.wordpress.com</p><!-- /wp:paragraph -->

Punycode and HTML entities in text

<!-- wp:paragraph --><p>&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/</p><!-- /wp:paragraph -->

Gets rewritten as:

<!-- wp:paragraph --><p>https://science.wordpress.com/</p><!-- /wp:paragraph -->

Similar-looking domains

	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science

Gets rewritten as:

	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science

Block attributes

<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
    <img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->

Gets rewritten as:

<!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} -->
<img src="https://science.wordpress.com/wp-content/image.png">
<!-- /wp:image -->

Non-URL attributes

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

Gets rewritten as:

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

Because many existing solutions lack the ability to efficiently parse all of the URLs in a document, they often resort to full-URL string replace. This works pretty well because it's not likely to have https://fluffyandflakey.blog/wp-content/uploads/2024/06/rickroll.mov be a substring of any other URL in the post. However, this also leads to the iterative nature of replacing them, as we often replace one URL at a time in a document and then come back later to replace the next one. This can lead to heavy database pressure on large imports.

With what you have, a way to efficiently and effectively determine "is this a link and does it point to the old domain, does it need rewriting to the new one?" we can now collapse all of these multiple passes into a single transformation of the document, rewriting only the domains that need changing, and possibly the whole base URL if the new site isn't at the / path.

0 replies

rene-hermenau · 2024-07-10T18:46:09Z

rene-hermenau
Jul 10, 2024

I did not try out the HTML Api, yet but at a glance on the code I just want to throw in here, that "parsing" a page like the HTML Api does probably takes a lot of time compared to "traditional" method of running a search and replacement. When running a migration for a site that has hundreds of thousands of different kind of post types, it will take a long time, parsing each post type.

For migrating specific block content from one site to another the HTML API approach will work fine for sure, but for full site migrations I forecast a potential problem when it comes to speed. There are databases that have millions of rows and we need to make sure that all rows are going through a replacement for a full site migration.

A few special S/R rules that needs to be taken into account are

Objects in the database including "Incomplete" ones
Arrays
Serialized data
Subdomains
Sites in a subfolder
base64 encoded data

You can have a look at our search & replace class that we created for WP Staging that contains a few more special cases. It's robust and handles all kind of special cases that we collected over the years. It's fully unit tested (although our tests are not in our public github repo)
Entry point is here: https://github.com/wp-staging/wp-staging/blob/master/Framework/Database/SearchReplace.php

7 replies

adamziel Jul 11, 2024
Author

base64 encoded data

This one's interesting – would you mind elaborating? How did this come up first?

adamziel Jul 11, 2024
Author

Just to add to @dmsnell's point:

There are databases that have millions of rows and we need to make sure that all rows are going through a replacement for a full site migration.

With this "rewrite before inserting" approach, we can process in batches of data and pause or resume as needed. A crash, a PHP timeout, or a power outage halfway through the migration wouldn't trash the progress. We'd just pick it up where we left off, potentially surfacing any ambiguous cases, inaccessible media attachments, and any other issues to the user to make a decision, e.g. "retry", "ignore", "upload from a local directory" etc.

rene-hermenau Jul 11, 2024

resting – would you mind elaborating? How did this come up first?

Use WPBakery 6.6.0 (the version our client used was 6.6.0) and then create a new page with "Raw HTML" module. It will encode the data like below:

We had a few clients who used that RAW module so the contained links in that block of encoded data could not be replaced.

Not sure if this is still relevant issue. It was 2021 when we had the first occurance of that happening. As WP Bakery is popular we decided to create a fix for that.

adamziel Jul 11, 2024
Author

Oh wow, that's super useful – thank you for sharing!

rene-hermenau Jul 11, 2024

A crash, a PHP timeout, or a power outage halfway through the migration wouldn't trash the progress.

Similar as WP Staging does for the same reason. We do the search & replace operation first on the data and only when everything succeeded it will be inserted into the db. Also the data will be written into temp tables first. When the whole data is migrated the tmp tables will be switched over. Extra caution has to be made when a database has many tables. Then batch processing for the tmp > live table operation needs to be or there will be timeouts.

A few more challenges to solve when using tmp tables:

Make sure table names never exceeds 64 characters or operation will fail. This requires a dynamic tmp table prefix, or sometimes even a whole renaming of table to something else. Happens if server has a table without any prefix but table name already consumes 64 characters.
Take into account that tables can have mixed lower and upper case prefixes depending on the OS. Importing a table from WinOS to Linux can fail due to that. Issue: https://core.trac.wordpress.org/ticket/44440 (If you can push someone in automattic fixing this, would make migration handling less complex and error prone)

If there is a way an user or plugin can do something outside standards, he will do it for sure:)

~~Are you even considering using temp tables or planning to grab some piece of content, do the replacement and then immidiately insert it to the database?~~
~~The later would at least require using of TRANSACTIONS, to not insert partially replaced data during runtime.~~
(not relevant for this discussion - ignore these questions please.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solving rewriting site URLs in WordPress using the HTML API and URL parser #74

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Solving rewriting site URLs in WordPress using the HTML API and URL parser #74

adamziel Jul 10, 2024

The Problem with Traditional Methods

Inconsistent Replacements

Lack of Context

Punycode, URL Encoding

Other edge Cases

The Solution Using HTML API

Technical details

Examples

Inline text

Punycode and HTML entities in text

Similar-looking domains

Block attributes

Non-URL attributes

Related

Replies: 2 comments · 7 replies

dmsnell Jul 10, 2024

rene-hermenau Jul 10, 2024

adamziel Jul 11, 2024 Author

adamziel Jul 11, 2024 Author

rene-hermenau Jul 11, 2024

adamziel Jul 11, 2024 Author

rene-hermenau Jul 11, 2024

adamziel
Jul 10, 2024

Replies: 2 comments 7 replies

dmsnell
Jul 10, 2024

rene-hermenau
Jul 10, 2024

adamziel Jul 11, 2024
Author

adamziel Jul 11, 2024
Author

adamziel Jul 11, 2024
Author