Token Map: Introduce an efficient lookup and translation class for string mappings. #5373

dmsnell · 2023-10-03T18:00:45Z

Trac ticket: Core-60698
(previously Core-60841)

Motivated by the need to properly transform HTML named character references (like &) I found the need for a new semantic class which can efficiently perform search and replacement of a set of static tokens. Existing patterns in the codebase are not sufficient for the HTML need, and I suspect there are other use-cases where this class would help.

In #6387 I have built a spec-compliant HTML5 text decoder utilizing this token map. The performance of the new decoder is approximately 20% slower than calling html_entity_decode() directly, except that it properly decodes what PHP can't. In fact, the performance bottleneck in that PR comes from converting into UTF-8 the sequence of digits in numeric character references, not in looking up named character references.

This proposal is adding a new class WP_Token_Map providing at least two methods for normal use:

contains( $token ) returns whether the passed string is in the set.
read_token( $text, $offset = 0, $skip_bytes ) indicates if the character sequence starting at the given offset in the passed string forms a token in the set, and if so, returns the replacement for that token. It also sets &$skip_bytes to the length of the token so that calling code .

It also provides utility functions for pre-computing these classes, as they are designed for relatively-static cases where the actual code is intended to be generated dynamically, but stay static over time. For example, HTML5 defines the set of named character references and indicates that the list shall not change or be expanded. HTML5 spec. Precomputing can save on the startup-up cost of building the optimized lookup tables.

WP_Token_Map::from_array( array $mappings ) generates a new token map from the given array of whose keys are tokens and whose values are the replacements.
to_array() dumps the set of mapping into an array suitable for passing back into from_array().
WP_Token_Map::from_precomputed_table( ...$table ) instantiates a token set from a precomputed table, skipping the computation for building the table and sorting the tokens.
precomputed_php_source_table() generates PHP source code which can be loaded with the previous static method for maintenance of the core static token sets.

Other potential uses

Converting smilies like :)
Converting emoji sequences like :happy:
Determining if a given verb/action is allowed in an API call.

github-actions · 2024-03-25T16:16:53Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, gziolo, jorbin.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions · 2024-03-25T17:47:57Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

dmsnell · 2024-05-08T23:17:30Z

Rebased from e98508f10d17b606cf3f8761c82bb0f8750b263d to 502e4f5

This patch introduces a new class: `WP_Token_Map`, which is designed for efficient lookup and translation of static mappings between string keys or tokens, and string replacements (for example, HTML character references). The token map incorporates certain limitations on the string length of the tokens, but takes advantage of multiple optimizations to efficiently

gziolo

I did some high level pass on the PR and left my feedback. I was looking at the code from the HTML API encoding/decoding angle. I don't have better insights to share than @aaronjorbin in the Trac ticket.

Overall, the code is very complex and very specific to the requirements that HTML API has so I'm not the best person to draw any conclusions on whether this class is general enough. However, based on the explanation provided by @dmsnell, it definitely is important part of the implementation in the terms of performance for the work on HTML entities encoding and decoding.

tests/phpunit/tests/wp-token-map/wpTokenMap.php

tests/phpunit/tests/html-api/generate-html5-named-character-references.php

tests/phpunit/tests/wp-token-map/wpTokenMap.php

src/wp-includes/class-wp-token-map.php

tests/phpunit/tests/wp-token-map/wpTokenMap.php

gziolo · 2024-05-16T16:57:46Z

Some CI jobs report that some static test helpers are private and it can’t access them.

tests/phpunit/tests/html-api/generate-html5-named-character-references.php

sirreal · 2024-05-16T17:11:19Z

tests/phpunit/tests/html-api/generate-html5-named-character-references.php

+
+// phpcs:disable
+
+global \$html5_named_character_references;


Note for other reviewers:

It took me a while to see that \$var is escaping the $ in $var. I kept seeing this as the namespace global space. In the generated file is just global $html5_named_character_references;

Added a note in a comment above the template. Hope that helps

sirreal · 2024-05-16T17:16:12Z

tests/phpunit/tests/wp-token-map/wpTokenMap.php

+		$this->assertSame(
+			self::KNOWN_COUNT_OF_ALL_HTML5_NAMED_CHARACTER_REFERENCES,
+			count( $html5 ),
+			'Found the wrong number of HTML5 named character references: confirm the entities.json file."'
+		);


It seems out of place to have an assertion in the data provider. Maybe another simple test could not annotate the data provider, but call it make an assertion about its length?

🤷‍♂️ maybe, thought then the test isn't testing the unit under test but the test runner itself. it's weird in the data provider, but does it warrant its own separate test?

I thought about throwing in here too but then figured an assertion could be clearer.

I don't have strong feelings about this; the assertion inside the data provider seemed a fitting tool but I'm happy to reconsider.

tests/phpunit/tests/wp-token-map/wpTokenMap.php

tests/phpunit/data/html5-entities/generate-html5-named-character-references.php

tests/phpunit/data/html5-entities/README.md

Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>

sirreal

This is great, but there's a lot in here. I'm digging into the implementation more now and I'll finish a complete review tomorrow.

sirreal · 2024-05-16T17:53:11Z

src/wp-includes/class-wp-token-map.php

+	private $key_length = 2;
+
+	/**
+	 * Stores an optimized form of the word set, where words are grouped


I'd like to see the explicit rule for large vs. small words in their docs:

…where large words are strings whose strlen greater than is key_length

(that's almost certainly wrong, just an example)

added class-level description in 31c57b0

sirreal · 2024-05-16T17:53:54Z

src/wp-includes/class-wp-token-map.php

+	private $groups = '';
+
+	/**
+	 * Stores an optimized row of small words, where every entry is


I'd like to see the explicit rule for large vs. small words in their docs:

…where small words are strings whose strlen is less than or equal to key_length

(that's almost certainly wrong, just an example)

added a class-level description in 31c57b0

dmsnell · 2024-05-17T18:29:29Z

It seems fine to me if we don't support case insensitive non-ascii token matching as long as we're clear about it.

Ha! I went to add these and I already noted this in the function docblock.

'case-insensitive' to ignore ASCII case or default of 'case-sensitive'

Text issues are apparently always on my mind. Let me know if you still think this is unclear. I'm worried that if we try and add more documentation when it's already there right at the cursor when calling the function, then it won't be any more helpful, just more noise.

sirreal · 2024-05-20T10:27:32Z

I already noted this in the function docblock … Let me know if you still think this is unclear.

I think this is fine. I was searching for words like "mb" or "multibyte", but it seems like ASCII is correct 👍

I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?

8.2.0 Case conversion no longer depends on the locale set with setlocale(). Only ASCII characters will be converted.

That seems like exactly what we want, but I've looked around PHP source, the changelog, etc. and I haven't managed to find exactly what changed or examples of the different.

sirreal

I've been over this in detail and left a lot of feedback. Thanks for the patience @dmsnell. I've left what I believe are my final questions, comments, and suggestions, but I feel good about this change in general.

src/wp-includes/class-wp-token-map.php

sirreal · 2024-05-20T12:34:28Z

src/wp-includes/class-wp-token-map.php

+		if ( $ignore_case ) {
+			$search_text = strtoupper( $search_text );
+		}


I was comparing the performance of strtoupper and strcasecmp here. I thought that using strcasecmp would outperform strtoupper, but the results weren't conclusive. I suspect it's likely highly dependent on inputs.

EDIT: To be perfectly clear this is not a change request. I'm documenting that I explored alternatives in the implementation.

patch

diff --git before/src/wp-includes/class-wp-token-map.php after/src/wp-includes/class-wp-token-map.php index 40c0520659..75beabe074 100644 --- before/src/wp-includes/class-wp-token-map.php +++ after/src/wp-includes/class-wp-token-map.php @@ -575,19 +575,16 @@ class WP_Token_Map { * @return string|false Mapped value of lookup key if found, otherwise `false`. */ private function read_small_token( $text, $offset, &$skip_bytes, $case_sensitivity = 'case-sensitive' ) { - $ignore_case = 'case-insensitive' === $case_sensitivity; - $small_length = strlen( $this->small_words ); - $search_text = substr( $text, $offset, $this->key_length ); - if ( $ignore_case ) { - $search_text = strtoupper( $search_text ); - } + $ignore_case = 'case-insensitive' === $case_sensitivity; + $small_length = strlen( $this->small_words ); + $search_text = substr( $text, $offset, $this->key_length ); $starting_char = $search_text[0]; $at = 0; while ( $at < $small_length ) { if ( $starting_char !== $this->small_words[ $at ] && - ( ! $ignore_case || strtoupper( $this->small_words[ $at ] ) !== $starting_char ) + ( ! $ignore_case || 0 !== strcasecmp( $this->small_words[ $at ], $starting_char ) ) ) { $at += $this->key_length + 1; continue; @@ -601,7 +598,7 @@ class WP_Token_Map { if ( $search_text[ $adjust ] !== $this->small_words[ $at + $adjust ] && - ( ! $ignore_case || strtoupper( $this->small_words[ $at + $adjust ] !== $search_text[ $adjust ] ) ) + ( ! $ignore_case || 0 !== strcasecmp( $this->small_words[ $at + $adjust ], $search_text[ $adjust ] ) ) ) { $at += $this->key_length + 1; continue 2;

thanks for the alternative exploration. I am not sure it would be possible to use strcasecmp like this here because we aren't looking for full matches, necessarily, in the small_words array. we're really looking for a match in the $text of a string in the small words array, but the words in the small words array are zero-extended.

for example, if we have ab in $text at the $offset, and "a\x00\x00" in the small words array, we will not match 0 === strcasecmp(), but there is indeed a match.

so we could add a length to the comparison, but then if it didn't match, we'd need to backtrack and try matching with length - 1 and so on until length === 1 and there are no matches.

by crawling forward only in the document we can ensure that any prefix of the input text will match as expected on a word shorter than the longest small word.

The idea is still to do character by character comparison, not full string comparison, but instead of calling strtoupper on the needle and on the characters, strcasecmp does the work of ASCII case-insensitive matching.

sirreal · 2024-05-20T14:51:30Z

src/wp-includes/class-wp-token-map.php

+	 * @param string  $text              String in which to search for a lookup key.
+	 * @param ?int    $offset            How many bytes into the string where the lookup key ought to start.
+	 * @param ?int    &$skip_bytes       Holds byte-length of found lookup key if matched, otherwise not set.
+	 * @param ?string $case_sensitivity 'case-insensitive' to ignore ASCII case or default of 'case-sensitive'.
+	 * @return string|false Mapped value of lookup key if found, otherwise `false`.
+	 */
+	public function read_token( $text, $offset = 0, &$skip_bytes = null, $case_sensitivity = 'case-sensitive' ) {


(this is a reply to other comments - GitHub doesn't really thread well when it's part of a review)

For the record, it seems you can pass an uninitialized variable name to the function and it's not even a warning in this case:

// first place $skip_bytes appears $map->read_token( 'x', 0, $skip_bytes, 'case-insensitive' ); var_dump($skip_bytes); // int(0)

sirreal · 2024-05-20T14:52:36Z

src/wp-includes/class-wp-token-map.php

+	 *
+	 * @param string  $text              String in which to search for a lookup key.
+	 * @param ?int    $offset            How many bytes into the string where the lookup key ought to start.
+	 * @param ?int    &$skip_bytes       Holds byte-length of found lookup key if matched, otherwise not set.


This parameter is very interesting. It seems very important, it's a bit like $matches in preg_match in that it's an output of the function and not an input.

I think the name skip_bytes can be improved. That's very focused on what calling code would likelye do with the value and less what the value is. Maybe something like $matched_token_bytelength or would be a descriptive name of what it is?

Yes, please. It was nearly impossible for me to understand the exact purpose of this $skip_bytes param by looking at code examples. I'm still not sure why it would be useful for developers.

The idea is that you're working with an input like this and using a token map to detect (and perform) replacements:

true ¬ false

We'd like to decode this by replacing entities, so we start by checking matches (this would be in a loop):

$m = WP_Token_Map::from_array([ '¬' => '¬' ]); $s = 'true ¬ false'; // check here // | // v // "true ¬ false" $replacement = $m->read_token( $s, 0, $skip_bytes ); // $replacement = false // $skip_bytes = null // check here // | // v // "true ¬ false" $replacement = $m->read_token( $s, 1, $skip_bytes ); // $replacement = false // $skip_bytes = null // … // check here // | // v // "true ¬ false" $replacement = $m->read_token( $s, 5, $skip_bytes ); // $replacement = "¬" // $skip_bytes = 5

At this point, we've hit a match and would perform replacement, but the function just returned "¬" which means "at index 5 there's a token matching ¬" but we don't know what the token is so it's unclear how we should perform replacement. We need to know where it starts (we know the offset, we passed it in as an argument) what it matches (¬ was returned) but also where the token ends. That's where $skip_bytes is involved, it lets us know the byte length of the matched token so we can replace (or resume looking after the end of the token).

// 5 here is the index in the string where we found the token, we know that from our // original iteration substr( $s, 0, 5 ) . $replacement . substr( $s, 5 + $skip_bytes ); // "true ¬ false"

Ok, so it’s the length of the token matched that we will ignore while constructing the replaced version. The coincidence that the offset and skip bytes were both equal to 5 made this puzzle more challenging, but I hope I got it right. I was mostly making the case that it was the only difficult part to reason in the PHPDoc.

sounds like I should pull in some of the wording from the dev note and move it into the docblocks…

https://make.wordpress.org/core/?p=113042&preview=1&_ppp=452181011b

it’s the length of the token matched

Yep! In this case it's strlen( "¬" ): 5.

the coincidence that the offset and skip bytes were both equal to 5 made this puzzle more challenging.

🤦 yes, poor execution in my example 😓

I've renamed the argument to $matched_token_byte_length. It's verbose, but more explanatory.

sirreal · 2024-05-20T14:53:18Z

src/wp-includes/class-wp-token-map.php

+	 *     false === $smilies->read_token( 'Not sure :?.', 0, $bytes_skipped );
+	 *     '😕'  === $smilies->read_token( 'Not sure :?.', 9, $bytes_skipped );
+	 *     2     === $bytes_skipped;
+	 *
+	 * Example:
+	 *
+	 *     while ( $at < strlen( $input ) ) {
+	 *         $next_at = strpos( $input, ':', $at );
+	 *         if ( false === $next_at ) {
+	 *             break;
+	 *         }
+	 *
+	 *         $smily = $smilies->read_token( $input, $next_at, $bytes_skipped );
+	 *         if ( false === $next_at ) {
+	 *             ++$at;
+	 *             continue;
+	 *         }
+	 *
+	 *         $prefix  = substr( $input, $at, $next_at - $at );
+	 *         $at     += $bytes_skipped;
+	 *         $output .= "{$prefix}{$smily}";
+	 *     }


Note that this example uses bytes_skipped instead of skip_bytes, which is slightly confusing. It would be good to use the parameter name.

they were different because of perspective. including matched_ in the name I think helps the wording on the internal side, but I think that calling code still has a valid reason to use a different name. maybe I can further clarify with something like $handle_length

dmsnell · 2024-05-20T21:59:03Z

I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?

I've struggled to test this with the HTML API. in effect, some system locales could cause, I believe, a few characters to be case-folded differently, but this is such a broken system I'm not of the opinion that it's a big risk here. I'm willing to let this sit as a hypothetical bug until we have a clearer understanding of it.

aaronjorbin

This is looking good. Thanks!

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

aaronjorbin · 2024-05-20T16:31:13Z

src/wp-includes/class-wp-token-map.php

+			) {
+				_doing_it_wrong(
+					__METHOD__,
+					__( 'Token Map tokens and substitutions must all be shorter than 256 bytes.' ),


Minor: Rather than hardcoding 256, I think this should use self::MAX_LENGTH on the chance it changes in the future.

Suggested change

__( 'Token Map tokens and substitutions must all be shorter than 256 bytes.' ),

/* translators: maximum byte length */

sprintf( __( 'Token Map tokens and substitutions must all be shorter than %i bytes.' ), self::MAX_LENGTH ),

good catch! incorporated in 0e7ab4c

aaronjorbin · 2024-05-21T17:22:29Z

tests/phpunit/tests/wp-token-map/wpTokenMap.php

+ * @package WordPress
+ *
+ * @since 6.6.0
+ *


I think we can add a new token-map group. It also might be worth putting in the html-api group since this is where the bigger effort lies

aaronjorbin · 2024-05-21T17:24:18Z

tests/phpunit/tests/wp-token-map/wpTokenMap.php

+	 * be downloaded on every test run. By specification, it cannot change
+	 * and will not be updated.
+	 *
+	 * @ticket 60698


As this is a helper rather than a test, the ticket number isn't necessary

removed in 0e7ab4c

aaronjorbin · 2024-05-21T17:28:03Z

tests/phpunit/tests/wp-token-map/wpTokenMap.php

+			$this->assertSame(
+				$replacement,
+				$response,
+				'Returned the wrong replacement value'


Minor: Make the error a bit more descriptive

Suggested change

'Returned the wrong replacement value'

'Returned the wrong replacement value for ' . $token

included the token in the message in 0e7ab4c

dmsnell · 2024-05-21T23:06:45Z

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it).

my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.

do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.

I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?

dmsnell · 2024-05-21T23:18:41Z

I think we can add a new token-map group. It also might be worth putting in the html-api group since this is where the bigger effort lies

@aaronjorbin following the html5lib test suite, I added the wpTokenMap test module to the @group html-api-token-map group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package and @subpackage but I was looking for @subgroup and didn't find any

aaronjorbin · 2024-05-22T20:39:09Z

@aaronjorbin following the html5lib test suite, I added the wpTokenMap test module to the @group html-api-token-map group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package and @subpackage but I was looking for @subgroup and didn't find any

@subgroup isn't an annotation that is a part of PHPUnit. I think this group makes sense, thanks!

aaronjorbin · 2024-05-22T20:41:01Z

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it).

my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.

do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.

I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?

Upon reading https://github.com/whatwg/html/blob/main/FAQ.md#html-should-add-more-named-character-references I withdrawal this suggestion. I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.

dmsnell · 2024-05-22T21:52:36Z

I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.

Thanks again @aaronjorbin!

It's already mentioned at the top of the README, in the character reference class (and as well in the autogenerator script where it comes from), the PR description, and the Trac ticket description.

Is that good enough or was there somewhere else you wanted it? Maybe the way I wrote it wasn't as direct or clear as it should be?

aaronjorbin

This is looking good to me. Thanks for answering all my questions @dmsnell!

This patch introduces a new class: `WP_Token_Map`, designed for efficient lookup and translation of static mappings between string keys or tokens, and string replacements (for example, HTML character references). The Token Map imposes certain restrictions on the byte length of the lookup tokens and their replacements, but is a highly-optimized data structure for mappings with a very high number of tokens. Developed in #5373 Discussed in https://core.trac.wordpress.org/ticket/60698 Fixes #60698. Props: dmsnell, gziolo, jonsurrell, jorbin. git-svn-id: https://develop.svn.wordpress.org/trunk@58188 602fd350-edb4-49c9-b593-d223f7449a82

This patch introduces a new class: `WP_Token_Map`, designed for efficient lookup and translation of static mappings between string keys or tokens, and string replacements (for example, HTML character references). The Token Map imposes certain restrictions on the byte length of the lookup tokens and their replacements, but is a highly-optimized data structure for mappings with a very high number of tokens. Developed in WordPress/wordpress-develop#5373 Discussed in https://core.trac.wordpress.org/ticket/60698 Fixes #60698. Props: dmsnell, gziolo, jonsurrell, jorbin. Built from https://develop.svn.wordpress.org/trunk@58188 git-svn-id: http://core.svn.wordpress.org/trunk@57651 1a063a9b-81f0-0310-95a4-ce76da25c4cd

This patch introduces a new class: `WP_Token_Map`, designed for efficient lookup and translation of static mappings between string keys or tokens, and string replacements (for example, HTML character references). The Token Map imposes certain restrictions on the byte length of the lookup tokens and their replacements, but is a highly-optimized data structure for mappings with a very high number of tokens. Developed in WordPress/wordpress-develop#5373 Discussed in https://core.trac.wordpress.org/ticket/60698 Fixes #60698. Props: dmsnell, gziolo, jonsurrell, jorbin. Built from https://develop.svn.wordpress.org/trunk@58188 git-svn-id: https://core.svn.wordpress.org/trunk@57651 1a063a9b-81f0-0310-95a4-ce76da25c4cd

dmsnell · 2024-05-23T20:04:38Z

Merged in [58188]
bcd25b1

sirreal · 2024-05-24T20:38:04Z

src/wp-includes/class-wp-token-map.php

+
+		// Longer strings are less-than for comparison's sake.
+		if ( $length_a !== $length_b ) {
+			return $length_b - $length_a;


This is a perfect place for the spaceship operator comparison to return -1, 0, or 1 return $length_b <=> $length_a;

are we sure about this?

we need to make sure that longer strings sort before shorter ones, so for example, ba appears before a

-1 === longest_first_then_alphabetical( 'ba', 'a' ); 1 === 'ba' <=> 'a'; 1 === strcmp( 'ba', 'a' );

Yes, I think this is correct. I meant specifically the length comparison could be return $length_b <=> $length_a;. The result is basically the same, but it returns -1, 0, or 1 instead of <0, 0, or >0.

demo

<?php function longest_first_then_alphabetical( $a, $b ) { if ( $a === $b ) { return 0; } $length_a = strlen( $a ); $length_b = strlen( $b ); // Longer strings are less-than for comparison's sake. if ( $length_a !== $length_b ) { return $length_b - $length_a; } return strcmp( $a, $b ); } function longest_first_then_alphabetical_spaceship( $a, $b ) { if ( $a === $b ) { return 0; } $length_a = strlen( $a ); $length_b = strlen( $b ); // Longer strings are less-than for comparison's sake. if ( $length_a !== $length_b ) { // 👇👇👇 This is the only different line 👇👇👇 return $length_b <=> $length_a; } return strcmp( $a, $b ); } var_dump( longest_first_then_alphabetical( 'baa', 'a'), longest_first_then_alphabetical_spaceship( 'baa', 'a'), "\n", longest_first_then_alphabetical( 'abb', 'a'), longest_first_then_alphabetical_spaceship( 'abb', 'a'), "\n", longest_first_then_alphabetical( 'a', 'abb'), longest_first_then_alphabetical_spaceship( 'a', 'abb'), "\n", longest_first_then_alphabetical( 'a', 'abb'), longest_first_then_alphabetical_spaceship( 'a', 'abb'), "\n", longest_first_then_alphabetical( 'a', 'baa'), longest_first_then_alphabetical_spaceship( 'a', 'baa'), "\n", longest_first_then_alphabetical( 'a', 'baa'), longest_first_then_alphabetical_spaceship( 'a', 'baa'), "\n", longest_first_then_alphabetical( 'a', 'a'), longest_first_then_alphabetical_spaceship( 'a', 'a'), "\n", longest_first_then_alphabetical( 'b', 'a'), longest_first_then_alphabetical_spaceship( 'b', 'a'), "\n", longest_first_then_alphabetical( 'a', 'b'), longest_first_then_alphabetical_spaceship( 'a', 'b'), "\n", );

int(-2)
int(-1)
string(1) "
"
int(-2)
int(-1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(0)
int(0)
string(1) "
"
int(1)
int(1)
string(1) "
"
int(-1)
int(-1)
string(1) "
"

</details>

dmsnell force-pushed the experiment/add-wp-token-set branch from c82d002 to 77b3a5b Compare October 3, 2023 18:01

dmsnell marked this pull request as ready for review March 25, 2024 16:16

dmsnell force-pushed the experiment/add-wp-token-set branch from 45c9bd9 to c749413 Compare March 25, 2024 17:22

dmsnell force-pushed the experiment/add-wp-token-set branch from c749413 to 502e4f5 Compare May 8, 2024 23:17

dmsnell changed the title ~~Experiment: Add optimized set lookup class~~ Token Map: Introduce an efficient lookup and translation class for string mappings. May 13, 2024

dmsnell force-pushed the experiment/add-wp-token-set branch 3 times, most recently from 90ab28d to aa78d9b Compare May 14, 2024 17:25

dmsnell added 2 commits May 14, 2024 20:21

Add case-insensitivity and early-abort for small words when none exist.

96bebe4

gziolo reviewed May 16, 2024

View reviewed changes

PR Feedback

388c50d

dmsnell force-pushed the experiment/add-wp-token-set branch from aa78d9b to 388c50d Compare May 16, 2024 16:14

Verify count of named character references.

45a5236

Shallow abstractions diverging from their truth, tsk tsk

31695c6

sirreal reviewed May 16, 2024

View reviewed changes

tests/phpunit/tests/wp-token-map/wpTokenMap.php Outdated Show resolved Hide resolved

dmsnell added 3 commits May 16, 2024 10:31

Restructure generator script

6055bf3

fixup! Restructure generator script

09dd38b

PR Feedback

2cf613f

sirreal reviewed May 16, 2024

View reviewed changes

tests/phpunit/data/html5-entities/generate-html5-named-character-references.php Show resolved Hide resolved

tests/phpunit/data/html5-entities/README.md Outdated Show resolved Hide resolved

dmsnell and others added 3 commits May 16, 2024 10:42

Update tests/phpunit/data/html5-entities/README.md

f312669

Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>

Fix markdown link reference

d63da61

Add more links

adbf8b6

sirreal reviewed May 16, 2024

View reviewed changes

I'm no longer a fan of this shallow abstraction

b5cdac4

dmsnell added 3 commits May 17, 2024 11:30

Fix string interpolation issue.

9cd6459

Easier to be worse

c0f4973

Update the HTML5 reference table.

1a02856

sirreal approved these changes May 20, 2024

View reviewed changes

Clarify case insensitivity.

7d0cbbe

dmsnell added 3 commits May 20, 2024 15:00

Rename $skip_bytes to $matched_token_byte_length

71ccdf6

Fix compare result description.

42c0d08

Update docblock wording and examples

ff0e59f

dmsnell mentioned this pull request May 20, 2024

HTML API: Plans for WP 6.6 WordPress/gutenberg#60324

Closed

2 tasks

aaronjorbin reviewed May 21, 2024

View reviewed changes

PR Feedback

0e7ab4c

aaronjorbin approved these changes May 23, 2024

View reviewed changes

Merge branch 'trunk' into experiment/add-wp-token-set

225cd85

dmsnell closed this May 23, 2024

dmsnell deleted the experiment/add-wp-token-set branch May 23, 2024 20:04

sirreal reviewed May 24, 2024

View reviewed changes

This was referenced May 29, 2024

HTML API: Add custom text decoder #6387

Closed

Use strcasecmp over strtoupper dmsnell/wordpress-develop#15

Open

	__( 'Token Map tokens and substitutions must all be shorter than 256 bytes.' ),
	/* translators: maximum byte length */
	sprintf( __( 'Token Map tokens and substitutions must all be shorter than %i bytes.' ), self::MAX_LENGTH ),

	'Returned the wrong replacement value'
	'Returned the wrong replacement value for ' . $token

Token Map: Introduce an efficient lookup and translation class for string mappings. #5373

Token Map: Introduce an efficient lookup and translation class for string mappings. #5373

Conversation

dmsnell commented Oct 3, 2023 • edited Loading

Other potential uses

github-actions bot commented Mar 25, 2024 • edited Loading

github-actions bot commented Mar 25, 2024

Test using WordPress Playground

Some things to be aware of

dmsnell commented May 8, 2024

gziolo left a comment • edited Loading

Choose a reason for hiding this comment

gziolo commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirreal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmsnell commented May 17, 2024

sirreal commented May 20, 2024

sirreal left a comment

Choose a reason for hiding this comment

sirreal May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirreal May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirreal May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirreal May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmsnell commented May 20, 2024

aaronjorbin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmsnell commented May 21, 2024

dmsnell commented May 21, 2024

aaronjorbin commented May 22, 2024

aaronjorbin commented May 22, 2024

dmsnell commented May 22, 2024

aaronjorbin left a comment

Choose a reason for hiding this comment

dmsnell commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmsnell commented Oct 3, 2023 •

edited

Loading

github-actions bot commented Mar 25, 2024 •

edited

Loading

gziolo left a comment •

edited

Loading

sirreal May 20, 2024 •

edited

Loading

sirreal May 20, 2024 •

edited

Loading

sirreal May 20, 2024 •

edited

Loading

sirreal May 20, 2024 •

edited

Loading