-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ucdxml and TR42 #859
base: main
Are you sure you want to change the base?
Ucdxml and TR42 #859
Changes from 13 commits
b0656d8
3ce611a
0ba5996
7764f6c
7e161a6
d609d92
cb314e8
776e00e
8b870a6
d612e96
f552e63
242f22b
e625ff0
6ee2467
dbb5dd3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -43,6 +43,7 @@ perf-*.xml | |
test-*.xml | ||
|
||
# Directories | ||
.idea/ | ||
.settings/ | ||
.vs/ | ||
.vscode/ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Generating TR42 | ||
|
||
## Step 1 - Generate property value fragments | ||
|
||
- Run org.unicode.xml.GeneratePropertyValues to populate the UNICODETOOLS_REPO_DIR/uax/uax42/fragments/ folder. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will want to be a specific, reproducible There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done, but please check to see if this is what you were thinking. |
||
|
||
## Step 2 - Generate TR42 index.html and index.rnc | ||
|
||
- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mvn command lines should run from the root as usual, see https://github.com/unicode-org/unicodetools/blob/main/docs/build.md for examples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I have what you want, but... |
||
|
||
index.html and index.rnc will be generated in UNICODETOOLS_REPO_DIR/uax/uax42/output/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The output should go into a folder under UNICODETOOLS_GEN_DIR as usual, such as UNICODETOOLS_GEN_DIR/ucdxml/17.0.0/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I have what you want, but... |
||
|
||
## Step 3 - Validate generated UAX XML files | ||
|
||
You'll need a [RELAX NG](https://relaxng.org/) schema validator. We'll use [jing-trang](https://github. | ||
com/relaxng/jing-trang) in this example. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the recommended tool for the job? There has been no release and no commit in 2.5 years. Abandoned? |
||
|
||
1. Clone and build [jing-trang](https://github.com/relaxng/jing-trang) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not look workable. Dependencies should be handled by Maven via the pom.xml file which I assume would pull the Trang artifacts from https://mvnrepository.com/artifact/org.relaxng/trang It should be possible to run one Maven command that builds and runs the tool. |
||
2. Run the following: | ||
``` | ||
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mvn ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that we should probably discuss what is in-scope and what is out-of-scope for the process and the utility. The deliverables are:
The fragment files are generated, but are then consumed by the TR42 HTML file and the RNC file. I think that these should be in the repo, especially as some of them are created manually. Should any of the deliverables be stored in unicodetools? |
||
``` | ||
Note that the UAX xml file has to be saved as NFD as the Unihan syntax regular expressions are expecting NFD. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the tool generate the data files in NFD? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The data files are not NFD by default, and I'm not sure that we should change the format for publication. |
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Unicode Tools normally work with internal data files stored in https://github.com/unicode-org/unicodetools/tree/main/unicodetools/src/main/resources/org/unicode Is it necessary to create a new, separate folder structure outside of that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved to resources/org/unicode/uax42/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="blocks" id='schema.block'> | ||
ucd.content &= | ||
element blocks { | ||
element block { | ||
attribute first-cp { single-code-point }, | ||
attribute last-cp { single-code-point }, | ||
attribute name { text } }+ }? | ||
</ucdxml:block> |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are many little folders with one file each, and one folder with three. Is that necessary or useful? Can we flatten these all into one folder? Some folders like "properties" are ok, but adding lots of mini folders seems cumbersome. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="boolean" id='schema.boolean'> | ||
boolean = "Y" | "N" | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="cjk radicals" id='schema.cjk-radicals'> | ||
ucd.content &= | ||
element cjk-radicals { | ||
element cjk-radical { | ||
attribute number { xsd:string {pattern="[0-9]{1,3}'{0,3}"}}, | ||
attribute radical { single-code-point? }, | ||
attribute ideograph { single-code-point } }+ }? | ||
</ucdxml:block> |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please no spaces in file names There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="datatype for code points" id='schema.datatypes'> | ||
single-code-point = xsd:string { pattern = "(|[1-9A-F]|(10))[0-9A-F]{4}" } | ||
|
||
one-or-more-code-points = list { single-code-point + } | ||
zero-or-more-code-points = list { single-code-point * } | ||
two-code-points = list { single-code-point, single-code-point } | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="datatypes declaration" id='schema.datatypes'> | ||
# default; datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes" | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="datatype for code points" id='schema.datatypes'> | ||
jis-code-point = xsd:string { pattern = "[0-9A-F]{4}" } | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="description" id='schema.description'> | ||
ucd.content &= | ||
element description { text }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="do-not-emit" id='schema.do-not-emit'> | ||
ucd.content &= | ||
element do-not-emit { | ||
element instead { | ||
attribute of { one-or-more-code-points }, | ||
attribute use { one-or-more-code-points }, | ||
attribute because { "Bengali_Khanda_Ta" | ||
| "Deprecated" | ||
| "Discouraged" | ||
| "Dotless_Form" | ||
| "Hamza_Form" | ||
| "Indic_Atomic_Consonant" | ||
| "Indic_Consonant_Conjunct" | ||
| "Indic_Vowel_Letter" | ||
| "Malayalam_Chillu" | ||
| "Precomposed_Form" | ||
| "Precomposed_Hieroglyph" | ||
| "Preferred_Spelling" | ||
| "Tamil_Shrii" | ||
} }+ }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="Emoji properties" id='schema.emoji-data'> | ||
code-point-attributes &= | ||
attribute Emoji { boolean }? | ||
|
||
code-point-attributes &= | ||
attribute EPres { boolean }? | ||
|
||
code-point-attributes &= | ||
attribute EMod { boolean }? | ||
|
||
code-point-attributes &= | ||
attribute EBase { boolean }? | ||
|
||
code-point-attributes &= | ||
attribute EComp { boolean }? | ||
|
||
code-point-attributes &= | ||
attribute ExtPict { boolean }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="emoji sources" id='schema.emoji-sources'> | ||
ucd.content &= | ||
element emoji-sources { | ||
element emoji-source { | ||
attribute unicode { one-or-more-code-points }, | ||
attribute docomo { jis-code-point? }, | ||
attribute kddi { jis-code-point? }, | ||
attribute softbank { jis-code-point? } }+ }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="named sequences" id='schema.named-sequences'> | ||
ucd.content &= | ||
element named-sequences { | ||
element named-sequence { | ||
attribute cps { one-or-more-code-points }, | ||
attribute name { text } }+ }? | ||
|
||
ucd.content &= | ||
element provisional-named-sequences { | ||
element named-sequence { | ||
attribute cps { one-or-more-code-points }, | ||
attribute name { text } }+ }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="namespace declaration" id='schema.namespace'> | ||
default namespace ucd = "http://www.unicode.org/ns/2003/ucd/1.0" | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!--Manual--> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="normalization corrections" id='schema.normalization-corrections'> | ||
ucd.content &= | ||
element normalization-corrections { | ||
element normalization-correction { | ||
attribute cp { single-code-point }, | ||
attribute old { one-or-more-code-points }, | ||
attribute new { one-or-more-code-points }, | ||
attribute version { text } }+ }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="Nushu data" id='schema.nushu'> | ||
code-point-attributes &= | ||
attribute kSrc_NushuDuben { xsd:string { pattern="[0-9]+\.[0-9]+" } }? | ||
|
||
code-point-attributes &= | ||
attribute kReading { xsd:string }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="Bidi_C attribute" id='schema.properties'> | ||
code-point-attributes &= | ||
attribute Bidi_C { boolean }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="Bidi_M attribute" id='schema.properties'> | ||
code-point-attributes &= | ||
attribute Bidi_M { boolean }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="InCB attribute" id='schema.properties'> | ||
code-point-attributes &= | ||
attribute InCB { "Consonant" | ||
| "Extend" | ||
| "Linker" | ||
| "None" | ||
}? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="InPC attribute" id='schema.properties'> | ||
code-point-attributes &= | ||
attribute InPC { "Bottom" | ||
| "Bottom_And_Left" | ||
| "Bottom_And_Right" | ||
| "Left" | ||
| "Left_And_Right" | ||
| "NA" | ||
| "Overstruck" | ||
| "Right" | ||
| "Top" | ||
| "Top_And_Bottom" | ||
| "Top_And_Bottom_And_Left" | ||
| "Top_And_Bottom_And_Right" | ||
| "Top_And_Left" | ||
| "Top_And_Left_And_Right" | ||
| "Top_And_Right" | ||
| "Visual_Order_Left" | ||
}? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="InSC attribute" id='schema.properties'> | ||
code-point-attributes &= | ||
attribute InSC { "Avagraha" | ||
| "Bindu" | ||
| "Brahmi_Joining_Number" | ||
| "Cantillation_Mark" | ||
| "Consonant" | ||
| "Consonant_Dead" | ||
| "Consonant_Final" | ||
| "Consonant_Head_Letter" | ||
| "Consonant_Initial_Postfixed" | ||
| "Consonant_Killer" | ||
| "Consonant_Medial" | ||
| "Consonant_Placeholder" | ||
| "Consonant_Preceding_Repha" | ||
| "Consonant_Prefixed" | ||
| "Consonant_Subjoined" | ||
| "Consonant_Succeeding_Repha" | ||
| "Consonant_With_Stacker" | ||
| "Gemination_Mark" | ||
| "Invisible_Stacker" | ||
| "Joiner" | ||
| "Modifying_Letter" | ||
| "Non_Joiner" | ||
| "Nukta" | ||
| "Number" | ||
| "Number_Joiner" | ||
| "Other" | ||
| "Pure_Killer" | ||
| "Register_Shifter" | ||
| "Reordering_Killer" | ||
| "Syllable_Modifier" | ||
| "Tone_Letter" | ||
| "Tone_Mark" | ||
| "Virama" | ||
| "Visarga" | ||
| "Vowel" | ||
| "Vowel_Dependent" | ||
| "Vowel_Independent" | ||
}? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="JSN attribute" id='schema.properties'> | ||
code-point-attributes &= | ||
attribute JSN { xsd:string { pattern="[A-Z]{0,3}" } }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="joining properties" id='schema.properties'> | ||
code-point-attributes &= | ||
attribute Join_C { boolean }? | ||
</ucdxml:block> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<ucdxml:block xmlns:ucdxml="http://unicode.org/ns/2001/ucdxml" title="name-alias element" id='schema.properties'> | ||
code-point-attributes &= | ||
element name-alias { | ||
attribute alias { xsd:string { pattern="[a-zA-Z0-9]+(( -|- |[\-_ ])[a-zA-Z0-9]+)*" } }?, | ||
attribute type { "abbreviation" | "alternate" | ||
| "control" | "correction" | ||
| "figment" | ||
}? } * | ||
</ucdxml:block> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation should go into https://github.com/unicode-org/unicodetools/tree/main/docs,
one of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to docs/ucdxml.md