Skip to content

Conversation

khwilliamson
Copy link
Contributor

And fix two bugs in the 16.0 support which showed up in the 17.0 test files but apply to 16.0 as well

  • This set of changes requires a perldelta entry, and it is included.

khwilliamson and others added 9 commits September 11, 2025 18:50
Its too complicated to go into here, but some Asian languages have a
different method than Western ones for determining grapheme clusters.
I'll abbreviate, and not bother with the long names that don't mean
anything to a non-specialist.  But the regular expression that is
supposed to match is

    qr/ C [EL]* L [EL]* C /x

This is matched with a little DFA inside regexec.c.  I rewrote that to

    qr/ C [EL]* C /x

and kept a count of the L's encountered.  If at least one L occurred, it
was a match.

The problem is that I wrote a do { } while(), and it really needed a
plain while(), so it was getting false positives in some circumstances.

It passed the extensive tests furnished by Unicode for 16.0.  They have
provided a new test file for 17.0, which has new tests, and this failed
for a Balinese example.

I also changed from using a counter, where a simple bool will do.

This fix applies to 16.0 as well as 17.0.
A combining mark (and ZWJ) usually attach to the preceding character.
That makes sense, an 'a' with an acute accent following it, are
considered a unit.

But marks do not attach to some classes of characters.  If you have a
space followed by an acute accent, the accent stands on its own and
doesn't hang over the space.

What Unicode says to do, then is to pretend that the mark is actually an
alphabetic.

The implementation of \b{lb} includes a bunch of DFAs.  And in several,
it didn't implement this properly.

This commit fixes this.  When parsing backwards in the input to examine
the context, in some DFAs it is supposed to ignore intervening marks.
But when it gets to the end and the character is one the marks don't
attach to, it should return alphabetic instead of the character.

This commit changes to do that.

It required some calls to the backwards parse routine to change to
handle the marks themselves.

The code  passed the extensive tests furnished by Unicode for 16.0.
They have provided a new test file for 17.0, which has new tests, and
it failed for one test.

This fix applies to 16.0 as well as 17.0.
During the development of supporting Unicode 16.0, I planned to not
bother supporting these hieroglyphic specialist properties; but it
turned out to not be much work, so I ended up supporting them, but
forgot to remove this code which allowed them to be empty.
Some of the properties in this Unikemet d.b. are provisional, so Unicode
doesn't furnish information about them.  So we have to do it ourselves.
I didn't get it right previously.

This also changes one property to an enum, which it should have been all
along, and adds the current possible enum values.
Several Egyptian Hieroglyph properties are provisional.  That fact was
not previously noted.
indent statement properly
Unicode 16.0 created a subcategory of hyphens containing just U+2010
"HYPHEN".  They did not name it, so I called it U2010.

Unicode 17.0 does name it as HH (and adds more code points to it).  So
this commit changes the name to HH, in preparation for 17.0
This is includes updates to a few perl files that need to know the
current Unicode version, and regenerating perl files that depend on the
Unicode data
This adds full support for this latest version of Unicode.  What was
essentially missing was updating the rules for the break properties,
like \b{wb}.  This is always a pain, but the changes made for 15.1 and
16.0 made it much easier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant