Ucdxml and TR42 #859

jowilco · 2024-06-06T23:38:16Z

PR to make it easy to see what changes have been made to support UCDXML.

jowilco · 2024-10-16T21:14:44Z

Comment on June 6 is no longer valid - we're now ready for review.

jowilco · 2024-10-16T21:16:42Z

@macchiati @eggrobin @markusicu - Please can you review?

eggrobin · 2024-11-12T10:44:54Z

unicodetools/src/main/resources/org/unicode/props/IndexUnicodeProperties.txt

@@ -310,6 +313,15 @@ Unihan_Variants ; kSpoofingVariant
 Unihan_Variants ; kTraditionalVariant
 Unihan_Variants ; kZVariant


Should there be a line for kZhuang here? (In other words, are you getting any data for kZhuang?)

The current version of UCDXML does not support kZhuang, just kZhuangNumeric.
Similar to Unikemet, we should add support either for the revised 16.0 UCDXML files, or for 17.

eggrobin · 2024-11-12T10:47:52Z

unicodetools/src/main/resources/org/unicode/props/ExtraPropertyAliases.txt

+cjkRSTUnicode ; kRSTUnicode
+cjkReading ; kReading
+cjkSrc_NushuDuben ; kSrc_NushuDuben
+cjkTGT_MergedSrc ; kTGT_MergedSrc


Please revert this change: Tangut and Nüshu are not CJK, they should not have a cjk alias.

The name kReading is unfortunate (since this is really Nüshu-specific), but it is what it is.
I guess you should add the comment that I should have added saying that these are the fields from the Tangut and Nüshu sources files.

unicodetools/src/main/resources/org/unicode/props/ExtraPropertyAliases.txt

eggrobin · 2024-11-27T15:01:15Z

unicodetools/src/main/java/org/unicode/xml/AttributeResolver.java

+                    default:
+                        throw new RuntimeException("Missing Catalog case");
+                }
+            case Enumerated:


This (and the associated pile of UnicodeMaps) seems like it is going to be a bit annoying to maintain as we add properties.
Is there a reason why you are not doing something like

final UnicodeProperty property = indexUnicodeProperties.getProperty(prop); final List<String> valueAliases = property.getValueAliases(property.getValue(codepoint)); return valueAliases.size() == 1 ? valueAliases.get(0) : valueAliases.get(1);

for most of them (special-casing Decomposition_Type etc. as needed)?

I think that I was assuming that there were going to be more special cases, but I agree that as there are not, your solution is better. Implemented.

macchiati · 2024-11-27T22:03:17Z

No good reason. I think the code might predate the formal recognition as properties.

…

On Wed, Nov 27, 2024, 07:01 Robin Leroy ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In unicodetools/src/main/java/org/unicode/xml/AttributeResolver.java <#859 (comment)> : > + case Script: + return map_script.get(codepoint).getShortName(); + case Script_Extensions: + StringBuilder extensionBuilder = new StringBuilder(); + String[] extensions = map_script_extensions.get(codepoint).split("\\|", 0); + for (String extension : extensions) { + extensionBuilder.append( + UcdPropertyValues.Script_Values.valueOf(extension) + .getShortName()); + extensionBuilder.append(" "); + } + return extensionBuilder.toString().trim(); + default: + throw new RuntimeException("Missing Catalog case"); + } + case Enumerated: This (and the associated pile of UnicodeMaps) seems like it is going to be a bit annoying to maintain as we add properties. Is there a reason why you are not doing something like final UnicodeProperty property = indexUnicodeProperties.getProperty(prop);final List<String> valueAliases = property.getValueAliases(property.getValue(codepoint));return valueAliases.size() == 1 ? valueAliases.get(0) : valueAliases.get(1); for most of them (special-casing Decomposition_Type etc. as needed)? — Reply to this email directly, view it on GitHub <#859 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMDD3JBHAYCDTBHLLX32CXNFDAVCNFSM6AAAAABI5USRGKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDINRVGI4DANZTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

markusicu

Hi John, I took a peek -- not yet at the new .java files... -- and we discussed some high-level things in today's Unicode Tools meeting.

Some comments below.
For publication, we are thinking that we continue to copy two of the grouped files into the repo, but run the tool as part of a publication step, so that we don't check in all of the large, highly redundant files.

markusicu · 2024-12-10T19:38:12Z

unicodetools/src/main/resources/org/unicode/props/IndexPropertyRegex.txt

@@ -204,5 +215,5 @@ Confusable_MA			;	SINGLE_VALUED	;				$codePoints
 #Emoji					;	SINGLE_VALUED	;				<enum>
 #Emoji_Presentation		;	SINGLE_VALUED	;				<enum>
 #Emoji_Modifier			;	SINGLE_VALUED	;				<enum>
-#Emoji_Modifier_Base		;	SINGLE_VALUED	;				<enum>
+#Emoji_Modifier_Base 	;	SINGLE_VALUED	;				<enum>


All other lines get their indentation fixed, but this one gets it un-fixed...?

I've now replaced all the tab chars with spaces to avoid this issue appearing on editors with different tab settings.

markusicu · 2024-12-10T22:12:19Z

uax/uax42/Readme.md

Documentation should go into https://github.com/unicode-org/unicodetools/tree/main/docs,
one of

some existing file that covers UCDXML (not sure if there is one)

a new ucdxml.md there

a new index.md in a new ucdxml/ folder there

Moved to docs/ucdxml.md

markusicu · 2024-12-10T22:14:20Z

uax/uax42/fragments/block/block.xml

The Unicode Tools normally work with internal data files stored in https://github.com/unicode-org/unicodetools/tree/main/unicodetools/src/main/resources/org/unicode

Is it necessary to create a new, separate folder structure outside of that?
Why not add a ucdxml/ folder in the usual place?

Moved to resources/org/unicode/uax42/

markusicu · 2024-12-10T22:17:30Z

uax/uax42/fragments/boolean/boolean.xml

There are many little folders with one file each, and one folder with three. Is that necessary or useful? Can we flatten these all into one folder?

Some folders like "properties" are ok, but adding lots of mini folders seems cumbersome.

markusicu · 2024-12-10T22:17:47Z

uax/uax42/fragments/datatypes/code points.xml

Please no spaces in file names

markusicu · 2024-12-10T22:23:25Z

uax/uax42/Readme.md

+
+## Step 1 - Generate property value fragments
+
+- Run org.unicode.xml.GeneratePropertyValues to populate the UNICODETOOLS_REPO_DIR/uax/uax42/fragments/ folder.


This will want to be a specific, reproducible mvn command line.

Done, but please check to see if this is what you were thinking.

markusicu · 2024-12-10T22:24:09Z

uax/uax42/Readme.md

+
+## Step 2 - Generate TR42 index.html and index.rnc 
+
+- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform`


mvn command lines should run from the root as usual, see https://github.com/unicode-org/unicodetools/blob/main/docs/build.md for examples.

I think I have what you want, but...

markusicu · 2024-12-10T22:26:08Z

uax/uax42/Readme.md

+
+- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform`
+
+  index.html and index.rnc will be generated in UNICODETOOLS_REPO_DIR/uax/uax42/output/


The output should go into a folder under UNICODETOOLS_GEN_DIR as usual, such as UNICODETOOLS_GEN_DIR/ucdxml/17.0.0/

I think I have what you want, but...

markusicu · 2024-12-10T22:26:31Z

uax/uax42/Readme.md

+1. Clone and build [jing-trang](https://github.com/relaxng/jing-trang)
+2. Run the following:
+    ```
+   java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file>


I think that we should probably discuss what is in-scope and what is out-of-scope for the process and the utility. The deliverables are:

The UCDXML files

The TR42 HTML file

The RNC file

The fragment files are generated, but are then consumed by the TR42 HTML file and the RNC file. I think that these should be in the repo, especially as some of them are created manually.

Should any of the deliverables be stored in unicodetools?
If we are planning to validate the UCDXML files using the RNC files as part of the build, we'll need a process that incorporates all steps. Should that be considered a "unit test"?

markusicu · 2024-12-10T22:27:45Z

uax/uax42/Readme.md

+    ```
+   java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file>
+   ```
+   Note that the UAX xml file has to be saved as NFD as the Unihan syntax regular expressions are expecting NFD.


Does the tool generate the data files in NFD?
It seems like the files should come out as needed for the tool chain as well as for publication.

The data files are not NFD by default, and I'm not sure that we should change the format for publication.
The rest of your question ties back to my previous comment; we could add a step to create an NFD version of the UCDXML files as part of an end-to-end process. However, is that in scope?

jowilco added 5 commits June 26, 2024 13:48

Rebase

b0656d8

Initial checkin for UcdXML

3ce611a

Interim checkin: implemented groups

0ba5996

Rebase

7764f6c

Ran GenerateEnums

7e161a6

jowilco force-pushed the ucdxml branch from 066326e to 7e161a6 Compare June 26, 2024 21:17

jowilco added 8 commits June 26, 2024 14:22

Fixing a broken rebase

d609d92

Fixing a broken rebase

cb314e8

Added support for comparing different ucdxml files

776e00e

Ran spotless

8b870a6

Added support for the generation of UAX42

d612e96

Added note about NFD

f552e63

Spotless code cleanup

242f22b

Merge branch 'unicode-org:main' into ucdxml

e625ff0

jowilco requested review from macchiati, eggrobin and markusicu October 16, 2024 21:11

jowilco changed the title ~~Ucdxml preview~~ Ucdxml and TR42 Oct 16, 2024

jowilco marked this pull request as ready for review October 16, 2024 21:13

eggrobin reviewed Nov 12, 2024

View reviewed changes

Implemented review comments from eggrobin

6ee2467

eggrobin reviewed Nov 27, 2024

View reviewed changes

markusicu reviewed Dec 10, 2024

View reviewed changes

markusicu requested a review from echeran December 10, 2024 22:30

Updates from Marcus's review comments

dbb5dd3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ucdxml and TR42 #859

Ucdxml and TR42 #859

jowilco commented Jun 6, 2024 •

edited by eggrobin

Loading

jowilco commented Oct 16, 2024

jowilco commented Oct 16, 2024

eggrobin Nov 12, 2024

jowilco Nov 12, 2024

eggrobin Nov 12, 2024

jowilco Nov 12, 2024

eggrobin Nov 27, 2024

jowilco Dec 16, 2024

macchiati commented Nov 27, 2024 via email

markusicu left a comment

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

markusicu Dec 10, 2024

jowilco Dec 16, 2024

		@@ -310,6 +313,15 @@ Unihan_Variants ; kSpoofingVariant
		Unihan_Variants ; kTraditionalVariant
		Unihan_Variants ; kZVariant


		## Step 1 - Generate property value fragments

		- Run org.unicode.xml.GeneratePropertyValues to populate the UNICODETOOLS_REPO_DIR/uax/uax42/fragments/ folder.


		## Step 2 - Generate TR42 index.html and index.rnc

		- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform`


		- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform`

		index.html and index.rnc will be generated in UNICODETOOLS_REPO_DIR/uax/uax42/output/

Ucdxml and TR42 #859

Are you sure you want to change the base?

Ucdxml and TR42 #859

Conversation

jowilco commented Jun 6, 2024 • edited by eggrobin Loading

jowilco commented Oct 16, 2024

jowilco commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati commented Nov 27, 2024 via email

markusicu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jowilco commented Jun 6, 2024 •

edited by eggrobin

Loading