-
Notifications
You must be signed in to change notification settings - Fork 591
Support Unicode 17.0 #23702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
khwilliamson
wants to merge
9
commits into
Perl:blead
Choose a base branch
from
khwilliamson:17.0
base: blead
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Support Unicode 17.0 #23702
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Its too complicated to go into here, but some Asian languages have a different method than Western ones for determining grapheme clusters. I'll abbreviate, and not bother with the long names that don't mean anything to a non-specialist. But the regular expression that is supposed to match is qr/ C [EL]* L [EL]* C /x This is matched with a little DFA inside regexec.c. I rewrote that to qr/ C [EL]* C /x and kept a count of the L's encountered. If at least one L occurred, it was a match. The problem is that I wrote a do { } while(), and it really needed a plain while(), so it was getting false positives in some circumstances. It passed the extensive tests furnished by Unicode for 16.0. They have provided a new test file for 17.0, which has new tests, and this failed for a Balinese example. I also changed from using a counter, where a simple bool will do. This fix applies to 16.0 as well as 17.0.
A combining mark (and ZWJ) usually attach to the preceding character. That makes sense, an 'a' with an acute accent following it, are considered a unit. But marks do not attach to some classes of characters. If you have a space followed by an acute accent, the accent stands on its own and doesn't hang over the space. What Unicode says to do, then is to pretend that the mark is actually an alphabetic. The implementation of \b{lb} includes a bunch of DFAs. And in several, it didn't implement this properly. This commit fixes this. When parsing backwards in the input to examine the context, in some DFAs it is supposed to ignore intervening marks. But when it gets to the end and the character is one the marks don't attach to, it should return alphabetic instead of the character. This commit changes to do that. It required some calls to the backwards parse routine to change to handle the marks themselves. The code passed the extensive tests furnished by Unicode for 16.0. They have provided a new test file for 17.0, which has new tests, and it failed for one test. This fix applies to 16.0 as well as 17.0.
During the development of supporting Unicode 16.0, I planned to not bother supporting these hieroglyphic specialist properties; but it turned out to not be much work, so I ended up supporting them, but forgot to remove this code which allowed them to be empty.
Some of the properties in this Unikemet d.b. are provisional, so Unicode doesn't furnish information about them. So we have to do it ourselves. I didn't get it right previously. This also changes one property to an enum, which it should have been all along, and adds the current possible enum values.
Several Egyptian Hieroglyph properties are provisional. That fact was not previously noted.
indent statement properly
Unicode 16.0 created a subcategory of hyphens containing just U+2010 "HYPHEN". They did not name it, so I called it U2010. Unicode 17.0 does name it as HH (and adds more code points to it). So this commit changes the name to HH, in preparation for 17.0
This is includes updates to a few perl files that need to know the current Unicode version, and regenerating perl files that depend on the Unicode data
This adds full support for this latest version of Unicode. What was essentially missing was updating the rules for the break properties, like \b{wb}. This is always a pain, but the changes made for 15.1 and 16.0 made it much easier.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
And fix two bugs in the 16.0 support which showed up in the 17.0 test files but apply to 16.0 as well