Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeSet/property tools: Script_Extensions missing characters #192

Closed
markusicu opened this issue Jan 20, 2022 · 6 comments · Fixed by #615
Closed

UnicodeSet/property tools: Script_Extensions missing characters #192

markusicu opened this issue Jan 20, 2022 · 6 comments · Fixed by #615
Assignees
Labels
bug Something isn't working util for the https://util.unicode.org website

Comments

@markusicu
Copy link
Member

Reported as https://unicode-org.atlassian.net/browse/ICU-21892 but ICU UnicodeSet implements scx as intended (see the ticket comments).

In the JSPs, [:scx=Deva:] does not contain Danda and Double Danda, and [:scx=Beng:] does not contain Bangla digits.

Example: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7Bsc%3DBengali%7D%5D+-+%5B%5Cp%7Bscx%3DBengali%7D%5D&c=on&g=gc&i=

@markusicu markusicu added bug Something isn't working util for the https://util.unicode.org website labels Jan 20, 2022
@markusicu
Copy link
Member Author

markusicu commented Feb 7, 2022

Maybe the tool is missing the special logic for scx to use "contains" not "equals".

Possible proof:
This shows the Bengali digits: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ascx%3DBengali%2CChakma%2CSyloti_Nagri%3A%5D&g=&i=
This shows that the value with a different order of scripts is not recognized, and prints the known values: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ascx%3DChakma%2CBengali%2CSyloti_Nagri%3A%5D&g=&i=

Note that multi-script sets are printed with commas but no spaces between scripts.

Co-debug with @macchiati

Other useful links:

Compare sets: https://util.unicode.org/UnicodeJsps/unicodeset.jsp?a=[:sc=Beng:]&b=[:scx=Beng:]

"Vedic" characters with scx info: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AName%3D%2FVEDIC%2F%3A%5D&g=&i=scx

Another indication that scx=Beng does not work right: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%3Asc%3Dbeng%3A%5D%5B%3Ascx%3Dbeng%3A%5D%5D&g=&i=sc+scx

@markusicu
Copy link
Member Author

Related: Mark's code refactoring idea in issue #195

@echeran
Copy link
Contributor

echeran commented Aug 17, 2022

Note: JSP UnicodeSet lookups for gc=__ when the value is a multi-category value (ex: L, C, ...) currently works, so there is already some special handling somewhere on a per-property basis in the JSPs code. Script_Extensions is another case where the = operator isn't a strict equality but rather has some special meaning that is specific to the property. What is done here should be a model for (or extensible to) a bug for supporting the Age property (#54).

@macchiati
Copy link
Member

It isn't specific to the property; rather this is the case for any multivalued property.

\p{prop=abc} is equivalent to 'the set of all characters X such that prop(X) ∋ abc.

For single-valued properties, the interpretation is identical (treating the single value as a singleton set).

@markusicu markusicu assigned macchiati and unassigned echeran Sep 28, 2022
@en0ent1ty
Copy link

en0ent1ty commented Jun 3, 2023

+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, it was very confusing and misleading until I figured out the bug, this should be given more priority, a lot of people tend to trust official tools as source of truth, this can lead to spread of misinformation on how Unicode Script_Extensions works and how it should be implemented.

The characters listed by \p{Script_Extensions=Hiragana,Katakana} are expected to show up on \p{Script_Extensions=Katakana} and \p{Script_Extensions=Hiragana} but yet they are not listed and instead behave like the regular Script property, completely nullifying the point of the Script_Extensions property.

The https://util.unicode.org/UnicodeJsps/unicodeset.jsp tool also has the same problem and can more clearly display it too simply by comparing \p{Script=Katakana} to \p{Script_Extensions=Katakana}, they should NOT be equal, but yet the tool shows them as identical.

Interestingly enough the Regex tool understands Script_Extensions correctly as seen here:

The U+3031 character is a "Hiragana,Katakana" Script_Extensions character.

For reference the UTS18 correctly describes the expect behavior here: #Script_Property, including a very similar example to my own.

@markusicu
Copy link
Member Author

+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, ...

The characters listed by \p{Script_Extensions=Hiragana,Katakana} are expected to show up on \p{Script_Extensions=Katakana} and \p{Script_Extensions=Hiragana} but yet they are not listed and instead behave like the regular Script property, completely nullifying the point of the Script_Extensions property.

Actually, that particular syntax is neither documented nor intentionally supported. If you want the union of two scx values, then you need to use union syntax to do so, as in \p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working util for the https://util.unicode.org website
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants