Support Unicode 17.0 #23702

khwilliamson · 2025-09-12T04:13:19Z

And fix two bugs in the 16.0 support which showed up in the 17.0 test files but apply to 16.0 as well

This set of changes requires a perldelta entry, and it is included.

Its too complicated to go into here, but some Asian languages have a different method than Western ones for determining grapheme clusters. I'll abbreviate, and not bother with the long names that don't mean anything to a non-specialist. But the regular expression that is supposed to match is qr/ C [EL]* L [EL]* C /x This is matched with a little DFA inside regexec.c. I rewrote that to qr/ C [EL]* C /x and kept a count of the L's encountered. If at least one L occurred, it was a match. The problem is that I wrote a do { } while(), and it really needed a plain while(), so it was getting false positives in some circumstances. It passed the extensive tests furnished by Unicode for 16.0. They have provided a new test file for 17.0, which has new tests, and this failed for a Balinese example. I also changed from using a counter, where a simple bool will do. This fix applies to 16.0 as well as 17.0.

A combining mark (and ZWJ) usually attach to the preceding character. That makes sense, an 'a' with an acute accent following it, are considered a unit. But marks do not attach to some classes of characters. If you have a space followed by an acute accent, the accent stands on its own and doesn't hang over the space. What Unicode says to do, then is to pretend that the mark is actually an alphabetic. The implementation of \b{lb} includes a bunch of DFAs. And in several, it didn't implement this properly. This commit fixes this. When parsing backwards in the input to examine the context, in some DFAs it is supposed to ignore intervening marks. But when it gets to the end and the character is one the marks don't attach to, it should return alphabetic instead of the character. This commit changes to do that. It required some calls to the backwards parse routine to change to handle the marks themselves. The code passed the extensive tests furnished by Unicode for 16.0. They have provided a new test file for 17.0, which has new tests, and it failed for one test. This fix applies to 16.0 as well as 17.0.

During the development of supporting Unicode 16.0, I planned to not bother supporting these hieroglyphic specialist properties; but it turned out to not be much work, so I ended up supporting them, but forgot to remove this code which allowed them to be empty.

Some of the properties in this Unikemet d.b. are provisional, so Unicode doesn't furnish information about them. So we have to do it ourselves. I didn't get it right previously. This also changes one property to an enum, which it should have been all along, and adds the current possible enum values.

Several Egyptian Hieroglyph properties are provisional. That fact was not previously noted.

indent statement properly

Unicode 16.0 created a subcategory of hyphens containing just U+2010 "HYPHEN". They did not name it, so I called it U2010. Unicode 17.0 does name it as HH (and adds more code points to it). So this commit changes the name to HH, in preparation for 17.0

This is includes updates to a few perl files that need to know the current Unicode version, and regenerating perl files that depend on the Unicode data

This adds full support for this latest version of Unicode. What was essentially missing was updating the rules for the break properties, like \b{wb}. This is always a pain, but the changes made for 15.1 and 16.0 made it much easier.

khwilliamson and others added 9 commits September 11, 2025 18:50

Add new Unicode property type: 'provisional'

ad7ce79

Several Egyptian Hieroglyph properties are provisional. That fact was not previously noted.

mktables: White-space only

77decba

indent statement properly

Add Unicode 17.0

563a5a0

This is includes updates to a few perl files that need to know the current Unicode version, and regenerating perl files that depend on the Unicode data

Support Unicode 17.0

2e07609

This adds full support for this latest version of Unicode. What was essentially missing was updating the rules for the break properties, like \b{wb}. This is always a pain, but the changes made for 15.1 and 16.0 made it much easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Unicode 17.0 #23702

Support Unicode 17.0 #23702

Uh oh!

khwilliamson commented Sep 12, 2025

Uh oh!

Uh oh!

Support Unicode 17.0 #23702

Are you sure you want to change the base?

Support Unicode 17.0 #23702

Uh oh!

Conversation

khwilliamson commented Sep 12, 2025

Uh oh!

Uh oh!