-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeSet/property tools: Script_Extensions missing characters #192
Comments
Maybe the tool is missing the special logic for scx to use "contains" not "equals". Possible proof: Note that multi-script sets are printed with commas but no spaces between scripts. Co-debug with @macchiati Other useful links: Compare sets: https://util.unicode.org/UnicodeJsps/unicodeset.jsp?a=[:sc=Beng:]&b=[:scx=Beng:] "Vedic" characters with scx info: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AName%3D%2FVEDIC%2F%3A%5D&g=&i=scx Another indication that scx=Beng does not work right: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%3Asc%3Dbeng%3A%5D%5B%3Ascx%3Dbeng%3A%5D%5D&g=&i=sc+scx |
Related: Mark's code refactoring idea in issue #195 |
Note: JSP UnicodeSet lookups for |
It isn't specific to the property; rather this is the case for any multivalued property. \p{prop=abc} is equivalent to 'the set of all characters X such that prop(X) ∋ abc. For single-valued properties, the interpretation is identical (treating the single value as a singleton set). |
+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, it was very confusing and misleading until I figured out the bug, this should be given more priority, a lot of people tend to trust official tools as source of truth, this can lead to spread of misinformation on how Unicode Script_Extensions works and how it should be implemented. The characters listed by The https://util.unicode.org/UnicodeJsps/unicodeset.jsp tool also has the same problem and can more clearly display it too simply by comparing Interestingly enough the Regex tool understands Script_Extensions correctly as seen here:
The U+3031 character For reference the UTS18 correctly describes the expect behavior here: #Script_Property, including a very similar example to my own. |
Actually, that particular syntax is neither documented nor intentionally supported. If you want the union of two scx values, then you need to use union syntax to do so, as in |
Reported as https://unicode-org.atlassian.net/browse/ICU-21892 but ICU UnicodeSet implements scx as intended (see the ticket comments).
In the JSPs,
[:scx=Deva:]
does not contain Danda and Double Danda, and[:scx=Beng:]
does not contain Bangla digits.Example: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7Bsc%3DBengali%7D%5D+-+%5B%5Cp%7Bscx%3DBengali%7D%5D&c=on&g=gc&i=
The text was updated successfully, but these errors were encountered: