HTML API: Optimize low-level parsing details. #6890

dmsnell · 2024-06-24T19:35:10Z

Status

passes all tests, code audit looks right
performs more efficiently in almost all cases, but worse in only a few minor cases. the expected speedup is around 3.5% - 7.5% for the cost of parsing the tags in realistic HTML documents.

Summary

Measured change in parsing time to count tags in HTML5 spec:

from 1050ms to 930ms
roughly 13% faster in the HTML spec single-page.html
much less speedup in small documents with a higher text-to-syntax ratio

Benchmarking

Based on the top100 set of URLs from https://github.com/ada-url/url-various-datasets I ran a script count the number of tags in each document, having previously downloaded all URLs.

Of these, not all downloaded successfully and not all were HTML files.

82,406 HTML files were analyzed, representing pages from the top 100 most popular websites.

When counting all tags, trunk took between 310 seconds and 313 seconds across multiple test runs, measured from microtime() within the process parsing the HTML, and only measuring around the next_token() loop.

On this branch, the counting took between 293 seconds and 300 seconds, representing around a 5% real-world improvement in token parsing speed.

For the top100 dataset, this histogram represents the relative parsing speed in MB/s for the branch against trunk.

Measured change in parsing time to count tags in HTML5 spec: - from 1050ms to 930ms - roughly 13% faster in the worst-case document

dmsnell · 2024-06-24T19:37:52Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

-						++$at;
-						continue;
-					}
+				if ( 1 !== strspn( $html, '!/?abcdefghijklmnopqrstuvwxyzABCEFGHIJKLMNOPQRSTUVWXYZ', $at + 1, 1 ) ) {


thanks @adamziel for pointing out to me that strspn() and strcspn() have the $length parameter!

github-actions · 2024-06-24T19:47:30Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

github-actions · 2024-07-01T18:58:02Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Introduces a number of micro-level optimizations in the Tag Processor to improve token-scanning performance. Should contain no functional changes. Based on benchmarking against a list of the 100 most-visited websites, these changes result in an average improvement in performance of the Tag Processor for scanning tags from between 3.5% and 7.5%. Developed in #6890 Discussed in https://core.trac.wordpress.org/ticket/61545 Follow-up to [55203]. See #61545. git-svn-id: https://develop.svn.wordpress.org/trunk@58613 602fd350-edb4-49c9-b593-d223f7449a82

dmsnell · 2024-07-01T23:35:51Z

Merged in [58613]
cf064ef

Introduces a number of micro-level optimizations in the Tag Processor to improve token-scanning performance. Should contain no functional changes. Based on benchmarking against a list of the 100 most-visited websites, these changes result in an average improvement in performance of the Tag Processor for scanning tags from between 3.5% and 7.5%. Developed in WordPress/wordpress-develop#6890 Discussed in https://core.trac.wordpress.org/ticket/61545 Follow-up to [55203]. See #61545. Built from https://develop.svn.wordpress.org/trunk@58613 git-svn-id: http://core.svn.wordpress.org/trunk@58046 1a063a9b-81f0-0310-95a4-ce76da25c4cd

Introduces a number of micro-level optimizations in the Tag Processor to improve token-scanning performance. Should contain no functional changes. Based on benchmarking against a list of the 100 most-visited websites, these changes result in an average improvement in performance of the Tag Processor for scanning tags from between 3.5% and 7.5%. Developed in WordPress/wordpress-develop#6890 Discussed in https://core.trac.wordpress.org/ticket/61545 Follow-up to [55203]. See #61545. Built from https://develop.svn.wordpress.org/trunk@58613 git-svn-id: https://core.svn.wordpress.org/trunk@58046 1a063a9b-81f0-0310-95a4-ce76da25c4cd

Introduces a number of micro-level optimizations in the Tag Processor to improve token-scanning performance. Should contain no functional changes. Based on benchmarking against a list of the 100 most-visited websites, these changes result in an average improvement in performance of the Tag Processor for scanning tags from between 3.5% and 7.5%. Developed in WordPress#6890 Discussed in https://core.trac.wordpress.org/ticket/61545 Follow-up to [55203]. See #61545. git-svn-id: https://develop.svn.wordpress.org/trunk@58613 602fd350-edb4-49c9-b593-d223f7449a82

HTML API: Optimize low-level parsing details.

4dd166c

Measured change in parsing time to count tags in HTML5 spec: - from 1050ms to 930ms - roughly 13% faster in the worst-case document

dmsnell commented Jun 24, 2024

View reviewed changes

dmsnell marked this pull request as ready for review July 1, 2024 18:57

Merge branch 'trunk' into html-api/optimize-parsing

7b3bce1

dmsnell closed this Jul 1, 2024

dmsnell deleted the html-api/optimize-parsing branch July 1, 2024 23:36

dmsnell mentioned this pull request Jul 1, 2024

HTML API: Plans for 6.7 WordPress/gutenberg#60396

Closed

19 tasks

dmsnell mentioned this pull request Aug 13, 2024

HTML API: Only stop on full matches for requested tag name. #7189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML API: Optimize low-level parsing details. #6890

HTML API: Optimize low-level parsing details. #6890

Uh oh!

dmsnell commented Jun 24, 2024 •

edited

Loading

Uh oh!

dmsnell Jun 24, 2024

Uh oh!

github-actions bot commented Jun 24, 2024

Uh oh!

github-actions bot commented Jul 1, 2024

Uh oh!

dmsnell commented Jul 1, 2024

Uh oh!

Uh oh!

HTML API: Optimize low-level parsing details. #6890

HTML API: Optimize low-level parsing details. #6890

Uh oh!

Conversation

dmsnell commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Summary

Benchmarking

Uh oh!

dmsnell Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 24, 2024

Test using WordPress Playground

Some things to be aware of

Uh oh!

github-actions bot commented Jul 1, 2024

Uh oh!

dmsnell commented Jul 1, 2024

Uh oh!

Uh oh!

dmsnell commented Jun 24, 2024 •

edited

Loading