UnicodeSet/property tools: Script_Extensions missing characters #192

markusicu · 2022-01-20T03:34:32Z

Reported as https://unicode-org.atlassian.net/browse/ICU-21892 but ICU UnicodeSet implements scx as intended (see the ticket comments).

In the JSPs, [:scx=Deva:] does not contain Danda and Double Danda, and [:scx=Beng:] does not contain Bangla digits.

Example: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7Bsc%3DBengali%7D%5D+-+%5B%5Cp%7Bscx%3DBengali%7D%5D&c=on&g=gc&i=

The text was updated successfully, but these errors were encountered:

markusicu · 2022-02-07T22:23:31Z

Maybe the tool is missing the special logic for scx to use "contains" not "equals".

Possible proof:
This shows the Bengali digits: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ascx%3DBengali%2CChakma%2CSyloti_Nagri%3A%5D&g=&i=
This shows that the value with a different order of scripts is not recognized, and prints the known values: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ascx%3DChakma%2CBengali%2CSyloti_Nagri%3A%5D&g=&i=

Note that multi-script sets are printed with commas but no spaces between scripts.

Co-debug with @macchiati

Other useful links:

Compare sets: https://util.unicode.org/UnicodeJsps/unicodeset.jsp?a=[:sc=Beng:]&b=[:scx=Beng:]

"Vedic" characters with scx info: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AName%3D%2FVEDIC%2F%3A%5D&g=&i=scx

Another indication that scx=Beng does not work right: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%3Asc%3Dbeng%3A%5D%5B%3Ascx%3Dbeng%3A%5D%5D&g=&i=sc+scx

markusicu · 2022-02-07T22:46:03Z

Related: Mark's code refactoring idea in issue #195

echeran · 2022-08-17T21:54:01Z

Note: JSP UnicodeSet lookups for gc=__ when the value is a multi-category value (ex: L, C, ...) currently works, so there is already some special handling somewhere on a per-property basis in the JSPs code. Script_Extensions is another case where the = operator isn't a strict equality but rather has some special meaning that is specific to the property. What is done here should be a model for (or extensible to) a bug for supporting the Age property (#54).

macchiati · 2022-09-28T20:39:01Z

It isn't specific to the property; rather this is the case for any multivalued property.

\p{prop=abc} is equivalent to 'the set of all characters X such that prop(X) ∋ abc.

For single-valued properties, the interpretation is identical (treating the single value as a singleton set).

en0ent1ty · 2023-06-03T23:59:19Z

+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, it was very confusing and misleading until I figured out the bug, this should be given more priority, a lot of people tend to trust official tools as source of truth, this can lead to spread of misinformation on how Unicode Script_Extensions works and how it should be implemented.

The characters listed by \p{Script_Extensions=Hiragana,Katakana} are expected to show up on \p{Script_Extensions=Katakana} and \p{Script_Extensions=Hiragana} but yet they are not listed and instead behave like the regular Script property, completely nullifying the point of the Script_Extensions property.

The https://util.unicode.org/UnicodeJsps/unicodeset.jsp tool also has the same problem and can more clearly display it too simply by comparing \p{Script=Katakana} to \p{Script_Extensions=Katakana}, they should NOT be equal, but yet the tool shows them as identical.

Interestingly enough the Regex tool understands Script_Extensions correctly as seen here:

The U+3031 character 〱 is a "Hiragana,Katakana" Script_Extensions character.

For reference the UTS18 correctly describes the expect behavior here: #Script_Property, including a very similar example to my own.

markusicu · 2023-08-22T21:01:40Z

+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, ...

The characters listed by \p{Script_Extensions=Hiragana,Katakana} are expected to show up on \p{Script_Extensions=Katakana} and \p{Script_Extensions=Hiragana} but yet they are not listed and instead behave like the regular Script property, completely nullifying the point of the Script_Extensions property.

Actually, that particular syntax is neither documented nor intentionally supported. If you want the union of two scx values, then you need to use union syntax to do so, as in \p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}.

markusicu added bug Something isn't working util for the https://util.unicode.org website labels Jan 20, 2022

markusicu assigned echeran Feb 7, 2022

markusicu assigned macchiati and unassigned echeran Sep 28, 2022

markusicu mentioned this issue Nov 27, 2023

Fix JSP failures with scx #615

Merged

macchiati closed this as completed in #615 Nov 28, 2023

eggrobin mentioned this issue Jan 13, 2024

Fix the handling of multivalued properties in IndexUnicodeProperties and UnicodeProperty #648

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeSet/property tools: Script_Extensions missing characters #192

UnicodeSet/property tools: Script_Extensions missing characters #192

markusicu commented Jan 20, 2022

markusicu commented Feb 7, 2022 •

edited

Loading

markusicu commented Feb 7, 2022

echeran commented Aug 17, 2022

macchiati commented Sep 28, 2022

en0ent1ty commented Jun 3, 2023 •

edited

Loading

markusicu commented Aug 22, 2023

UnicodeSet/property tools: Script_Extensions missing characters #192

UnicodeSet/property tools: Script_Extensions missing characters #192

Comments

markusicu commented Jan 20, 2022

markusicu commented Feb 7, 2022 • edited Loading

markusicu commented Feb 7, 2022

echeran commented Aug 17, 2022

macchiati commented Sep 28, 2022

en0ent1ty commented Jun 3, 2023 • edited Loading

markusicu commented Aug 22, 2023

markusicu commented Feb 7, 2022 •

edited

Loading

en0ent1ty commented Jun 3, 2023 •

edited

Loading