Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ucdxml and TR42 #859

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ perf-*.xml
test-*.xml

# Directories
.idea/
.settings/
.vs/
.vscode/
Expand Down
24 changes: 24 additions & 0 deletions uax/uax42/Readme.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation should go into https://github.com/unicode-org/unicodetools/tree/main/docs,
one of

  • some existing file that covers UCDXML (not sure if there is one)
  • a new ucdxml.md there
  • a new index.md in a new ucdxml/ folder there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to docs/ucdxml.md

Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Generating TR42

## Step 1 - Generate property value fragments

- Run org.unicode.xml.GeneratePropertyValues to populate the UNICODETOOLS_REPO_DIR/uax/uax42/fragments/ folder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will want to be a specific, reproducible mvn command line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but please check to see if this is what you were thinking.


## Step 2 - Generate TR42 index.html and index.rnc

- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mvn command lines should run from the root as usual, see https://github.com/unicode-org/unicodetools/blob/main/docs/build.md for examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have what you want, but...


index.html and index.rnc will be generated in UNICODETOOLS_REPO_DIR/uax/uax42/output/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output should go into a folder under UNICODETOOLS_GEN_DIR as usual, such as UNICODETOOLS_GEN_DIR/ucdxml/17.0.0/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have what you want, but...


## Step 3 - Validate generated UAX XML files

You'll need a [RELAX NG](https://relaxng.org/) schema validator. We'll use [jing-trang](https://github.
com/relaxng/jing-trang) in this example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the recommended tool for the job? There has been no release and no commit in 2.5 years. Abandoned?


1. Clone and build [jing-trang](https://github.com/relaxng/jing-trang)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look workable. Dependencies should be handled by Maven via the pom.xml file which I assume would pull the Trang artifacts from https://mvnrepository.com/artifact/org.relaxng/trang

It should be possible to run one Maven command that builds and runs the tool.

2. Run the following:
```
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mvn ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should probably discuss what is in-scope and what is out-of-scope for the process and the utility. The deliverables are:

  1. The UCDXML files
  2. The TR42 HTML file
  3. The RNC file

The fragment files are generated, but are then consumed by the TR42 HTML file and the RNC file. I think that these should be in the repo, especially as some of them are created manually.

Should any of the deliverables be stored in unicodetools?
If we are planning to validate the UCDXML files using the RNC files as part of the build, we'll need a process that incorporates all steps. Should that be considered a "unit test"?

```
Note that the UAX xml file has to be saved as NFD as the Unihan syntax regular expressions are expecting NFD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the tool generate the data files in NFD?
It seems like the files should come out as needed for the tool chain as well as for publication.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data files are not NFD by default, and I'm not sure that we should change the format for publication.
The rest of your question ties back to my previous comment; we could add a step to create an NFD version of the UCDXML files as part of an end-to-end process. However, is that in scope?


10 changes: 10 additions & 0 deletions uax/uax42/fragments/block/block.xml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode Tools normally work with internal data files stored in https://github.com/unicode-org/unicodetools/tree/main/unicodetools/src/main/resources/org/unicode

Is it necessary to create a new, separate folder structure outside of that?
Why not add a ucdxml/ folder in the usual place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to resources/org/unicode/uax42/

Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="blocks" id='schema.block'>
ucd.content &amp;=
element blocks {
element block {
attribute first-cp { single-code-point },
attribute last-cp { single-code-point },
attribute name { text } }+ }?
</ucdxml:block>
4 changes: 4 additions & 0 deletions uax/uax42/fragments/boolean/boolean.xml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many little folders with one file each, and one folder with three. Is that necessary or useful? Can we flatten these all into one folder?

Some folders like "properties" are ok, but adding lots of mini folders seems cumbersome.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="boolean" id='schema.boolean'>
boolean = "Y" | "N"
</ucdxml:block>
10 changes: 10 additions & 0 deletions uax/uax42/fragments/cjk-radicals/cjk-radicals.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="cjk radicals" id='schema.cjk-radicals'>
ucd.content &amp;=
element cjk-radicals {
element cjk-radical {
attribute number { xsd:string {pattern="[0-9]{1,3}'{0,3}"}},
attribute radical { single-code-point? },
attribute ideograph { single-code-point } }+ }?
</ucdxml:block>
9 changes: 9 additions & 0 deletions uax/uax42/fragments/datatypes/code points.xml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no spaces in file names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="datatype for code points" id='schema.datatypes'>
single-code-point = xsd:string { pattern = "(|[1-9A-F]|(10))[0-9A-F]{4}" }

one-or-more-code-points = list { single-code-point + }
zero-or-more-code-points = list { single-code-point * }
two-code-points = list { single-code-point, single-code-point }
</ucdxml:block>
5 changes: 5 additions & 0 deletions uax/uax42/fragments/datatypes/datatypes.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="datatypes declaration" id='schema.datatypes'>
# default; datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"
</ucdxml:block>
5 changes: 5 additions & 0 deletions uax/uax42/fragments/datatypes/jis-code-point.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="datatype for code points" id='schema.datatypes'>
jis-code-point = xsd:string { pattern = "[0-9A-F]{4}" }
</ucdxml:block>
6 changes: 6 additions & 0 deletions uax/uax42/fragments/description/description.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="description" id='schema.description'>
ucd.content &amp;=
element description { text }?
</ucdxml:block>
22 changes: 22 additions & 0 deletions uax/uax42/fragments/do-not-emit/do-not-emit.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="do-not-emit" id='schema.do-not-emit'>
ucd.content &amp;=
element do-not-emit {
element instead {
attribute of { one-or-more-code-points },
attribute use { one-or-more-code-points },
attribute because { "Bengali_Khanda_Ta"
| "Deprecated"
| "Discouraged"
| "Dotless_Form"
| "Hamza_Form"
| "Indic_Atomic_Consonant"
| "Indic_Consonant_Conjunct"
| "Indic_Vowel_Letter"
| "Malayalam_Chillu"
| "Precomposed_Form"
| "Precomposed_Hieroglyph"
| "Preferred_Spelling"
| "Tamil_Shrii"
} }+ }?
</ucdxml:block>
20 changes: 20 additions & 0 deletions uax/uax42/fragments/emoji-data/Emoji.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="Emoji properties" id='schema.emoji-data'>
code-point-attributes &amp;=
attribute Emoji { boolean }?

code-point-attributes &amp;=
attribute EPres { boolean }?

code-point-attributes &amp;=
attribute EMod { boolean }?

code-point-attributes &amp;=
attribute EBase { boolean }?

code-point-attributes &amp;=
attribute EComp { boolean }?

code-point-attributes &amp;=
attribute ExtPict { boolean }?
</ucdxml:block>
11 changes: 11 additions & 0 deletions uax/uax42/fragments/emoji-sources/emoji-sources.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="emoji sources" id='schema.emoji-sources'>
ucd.content &amp;=
element emoji-sources {
element emoji-source {
attribute unicode { one-or-more-code-points },
attribute docomo { jis-code-point? },
attribute kddi { jis-code-point? },
attribute softbank { jis-code-point? } }+ }?
</ucdxml:block>
15 changes: 15 additions & 0 deletions uax/uax42/fragments/named-sequences/named-sequences.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="named sequences" id='schema.named-sequences'>
ucd.content &amp;=
element named-sequences {
element named-sequence {
attribute cps { one-or-more-code-points },
attribute name { text } }+ }?

ucd.content &amp;=
element provisional-named-sequences {
element named-sequence {
attribute cps { one-or-more-code-points },
attribute name { text } }+ }?
</ucdxml:block>
5 changes: 5 additions & 0 deletions uax/uax42/fragments/namespace/namespace.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="namespace declaration" id='schema.namespace'>
default namespace ucd = "http://www.unicode.org/ns/2003/ucd/1.0"
</ucdxml:block>
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--Manual-->
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="normalization corrections" id='schema.normalization-corrections'>
ucd.content &amp;=
element normalization-corrections {
element normalization-correction {
attribute cp { single-code-point },
attribute old { one-or-more-code-points },
attribute new { one-or-more-code-points },
attribute version { text } }+ }?
</ucdxml:block>
8 changes: 8 additions & 0 deletions uax/uax42/fragments/nushu/Nushu.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="Nushu data" id='schema.nushu'>
code-point-attributes &amp;=
attribute kSrc_NushuDuben { xsd:string { pattern="[0-9]+\.[0-9]+" } }?

code-point-attributes &amp;=
attribute kReading { xsd:string }?
</ucdxml:block>
5 changes: 5 additions & 0 deletions uax/uax42/fragments/properties/Bidi_C.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="Bidi_C attribute" id='schema.properties'>
code-point-attributes &amp;=
attribute Bidi_C { boolean }?
</ucdxml:block>
5 changes: 5 additions & 0 deletions uax/uax42/fragments/properties/Bidi_M.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="Bidi_M attribute" id='schema.properties'>
code-point-attributes &amp;=
attribute Bidi_M { boolean }?
</ucdxml:block>
9 changes: 9 additions & 0 deletions uax/uax42/fragments/properties/InCB.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="InCB attribute" id='schema.properties'>
code-point-attributes &amp;=
attribute InCB { "Consonant"
| "Extend"
| "Linker"
| "None"
}?
</ucdxml:block>
21 changes: 21 additions & 0 deletions uax/uax42/fragments/properties/InPC.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="InPC attribute" id='schema.properties'>
code-point-attributes &amp;=
attribute InPC { "Bottom"
| "Bottom_And_Left"
| "Bottom_And_Right"
| "Left"
| "Left_And_Right"
| "NA"
| "Overstruck"
| "Right"
| "Top"
| "Top_And_Bottom"
| "Top_And_Bottom_And_Left"
| "Top_And_Bottom_And_Right"
| "Top_And_Left"
| "Top_And_Left_And_Right"
| "Top_And_Right"
| "Visual_Order_Left"
}?
</ucdxml:block>
42 changes: 42 additions & 0 deletions uax/uax42/fragments/properties/InSC.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="InSC attribute" id='schema.properties'>
code-point-attributes &amp;=
attribute InSC { "Avagraha"
| "Bindu"
| "Brahmi_Joining_Number"
| "Cantillation_Mark"
| "Consonant"
| "Consonant_Dead"
| "Consonant_Final"
| "Consonant_Head_Letter"
| "Consonant_Initial_Postfixed"
| "Consonant_Killer"
| "Consonant_Medial"
| "Consonant_Placeholder"
| "Consonant_Preceding_Repha"
| "Consonant_Prefixed"
| "Consonant_Subjoined"
| "Consonant_Succeeding_Repha"
| "Consonant_With_Stacker"
| "Gemination_Mark"
| "Invisible_Stacker"
| "Joiner"
| "Modifying_Letter"
| "Non_Joiner"
| "Nukta"
| "Number"
| "Number_Joiner"
| "Other"
| "Pure_Killer"
| "Register_Shifter"
| "Reordering_Killer"
| "Syllable_Modifier"
| "Tone_Letter"
| "Tone_Mark"
| "Virama"
| "Visarga"
| "Vowel"
| "Vowel_Dependent"
| "Vowel_Independent"
}?
</ucdxml:block>
5 changes: 5 additions & 0 deletions uax/uax42/fragments/properties/JSN.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="JSN attribute" id='schema.properties'>
code-point-attributes &amp;=
attribute JSN { xsd:string { pattern="[A-Z]{0,3}" } }?
</ucdxml:block>
5 changes: 5 additions & 0 deletions uax/uax42/fragments/properties/Join_C.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="joining properties" id='schema.properties'>
code-point-attributes &amp;=
attribute Join_C { boolean }?
</ucdxml:block>
10 changes: 10 additions & 0 deletions uax/uax42/fragments/properties/Name_Alias.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?xml version="1.0" encoding="UTF-8"?>
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="name-alias element" id='schema.properties'>
code-point-attributes &amp;=
element name-alias {
attribute alias { xsd:string { pattern="[a-zA-Z0-9]+(( -|- |[\-_ ])[a-zA-Z0-9]+)*" } }?,
attribute type { "abbreviation" | "alternate"
| "control" | "correction"
| "figment"
}? } *
</ucdxml:block>
Loading
Loading