-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ucdxml and TR42 #859
base: main
Are you sure you want to change the base?
Ucdxml and TR42 #859
Conversation
Comment on June 6 is no longer valid - we're now ready for review. |
@macchiati @eggrobin @markusicu - Please can you review? |
@@ -310,6 +313,15 @@ Unihan_Variants ; kSpoofingVariant | |||
Unihan_Variants ; kTraditionalVariant | |||
Unihan_Variants ; kZVariant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a line for kZhuang here? (In other words, are you getting any data for kZhuang?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version of UCDXML does not support kZhuang, just kZhuangNumeric.
Similar to Unikemet, we should add support either for the revised 16.0 UCDXML files, or for 17.
cjkRSTUnicode ; kRSTUnicode | ||
cjkReading ; kReading | ||
cjkSrc_NushuDuben ; kSrc_NushuDuben | ||
cjkTGT_MergedSrc ; kTGT_MergedSrc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change: Tangut and Nüshu are not CJK, they should not have a cjk alias.
The name kReading is unfortunate (since this is really Nüshu-specific), but it is what it is.
I guess you should add the comment that I should have added saying that these are the fields from the Tangut and Nüshu sources files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
unicodetools/src/main/resources/org/unicode/props/ExtraPropertyAliases.txt
Show resolved
Hide resolved
default: | ||
throw new RuntimeException("Missing Catalog case"); | ||
} | ||
case Enumerated: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the associated pile of UnicodeMap
s) seems like it is going to be a bit annoying to maintain as we add properties.
Is there a reason why you are not doing something like
final UnicodeProperty property = indexUnicodeProperties.getProperty(prop);
final List<String> valueAliases = property.getValueAliases(property.getValue(codepoint));
return valueAliases.size() == 1 ? valueAliases.get(0) : valueAliases.get(1);
for most of them (special-casing Decomposition_Type etc. as needed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I was assuming that there were going to be more special cases, but I agree that as there are not, your solution is better. Implemented.
No good reason. I think the code might predate the formal recognition as
properties.
…On Wed, Nov 27, 2024, 07:01 Robin Leroy ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In unicodetools/src/main/java/org/unicode/xml/AttributeResolver.java
<#859 (comment)>
:
> + case Script:
+ return map_script.get(codepoint).getShortName();
+ case Script_Extensions:
+ StringBuilder extensionBuilder = new StringBuilder();
+ String[] extensions = map_script_extensions.get(codepoint).split("\\|", 0);
+ for (String extension : extensions) {
+ extensionBuilder.append(
+ UcdPropertyValues.Script_Values.valueOf(extension)
+ .getShortName());
+ extensionBuilder.append(" ");
+ }
+ return extensionBuilder.toString().trim();
+ default:
+ throw new RuntimeException("Missing Catalog case");
+ }
+ case Enumerated:
This (and the associated pile of UnicodeMaps) seems like it is going to
be a bit annoying to maintain as we add properties.
Is there a reason why you are not doing something like
final UnicodeProperty property = indexUnicodeProperties.getProperty(prop);final List<String> valueAliases = property.getValueAliases(property.getValue(codepoint));return valueAliases.size() == 1 ? valueAliases.get(0) : valueAliases.get(1);
for most of them (special-casing Decomposition_Type etc. as needed)?
—
Reply to this email directly, view it on GitHub
<#859 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMDD3JBHAYCDTBHLLX32CXNFDAVCNFSM6AAAAABI5USRGKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDINRVGI4DANZTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi John, I took a peek -- not yet at the new .java files... -- and we discussed some high-level things in today's Unicode Tools meeting.
Some comments below.
For publication, we are thinking that we continue to copy two of the grouped files into the repo, but run the tool as part of a publication step, so that we don't check in all of the large, highly redundant files.
@@ -204,5 +215,5 @@ Confusable_MA ; SINGLE_VALUED ; $codePoints | |||
#Emoji ; SINGLE_VALUED ; <enum> | |||
#Emoji_Presentation ; SINGLE_VALUED ; <enum> | |||
#Emoji_Modifier ; SINGLE_VALUED ; <enum> | |||
#Emoji_Modifier_Base ; SINGLE_VALUED ; <enum> | |||
#Emoji_Modifier_Base ; SINGLE_VALUED ; <enum> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other lines get their indentation fixed, but this one gets it un-fixed...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now replaced all the tab chars with spaces to avoid this issue appearing on editors with different tab settings.
uax/uax42/Readme.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation should go into https://github.com/unicode-org/unicodetools/tree/main/docs,
one of
- some existing file that covers UCDXML (not sure if there is one)
- a new ucdxml.md there
- a new index.md in a new ucdxml/ folder there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to docs/ucdxml.md
uax/uax42/fragments/block/block.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Unicode Tools normally work with internal data files stored in https://github.com/unicode-org/unicodetools/tree/main/unicodetools/src/main/resources/org/unicode
Is it necessary to create a new, separate folder structure outside of that?
Why not add a ucdxml/ folder in the usual place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to resources/org/unicode/uax42/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many little folders with one file each, and one folder with three. Is that necessary or useful? Can we flatten these all into one folder?
Some folders like "properties" are ok, but adding lots of mini folders seems cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no spaces in file names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
uax/uax42/Readme.md
Outdated
|
||
## Step 1 - Generate property value fragments | ||
|
||
- Run org.unicode.xml.GeneratePropertyValues to populate the UNICODETOOLS_REPO_DIR/uax/uax42/fragments/ folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will want to be a specific, reproducible mvn
command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but please check to see if this is what you were thinking.
uax/uax42/Readme.md
Outdated
|
||
## Step 2 - Generate TR42 index.html and index.rnc | ||
|
||
- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mvn command lines should run from the root as usual, see https://github.com/unicode-org/unicodetools/blob/main/docs/build.md for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have what you want, but...
uax/uax42/Readme.md
Outdated
|
||
- In UNICODETOOLS_REPO_DIR/uax/uax42/ run `mvn xml:transform` | ||
|
||
index.html and index.rnc will be generated in UNICODETOOLS_REPO_DIR/uax/uax42/output/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output should go into a folder under UNICODETOOLS_GEN_DIR as usual, such as UNICODETOOLS_GEN_DIR/ucdxml/17.0.0/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have what you want, but...
uax/uax42/Readme.md
Outdated
1. Clone and build [jing-trang](https://github.com/relaxng/jing-trang) | ||
2. Run the following: | ||
``` | ||
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mvn ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should probably discuss what is in-scope and what is out-of-scope for the process and the utility. The deliverables are:
- The UCDXML files
- The TR42 HTML file
- The RNC file
The fragment files are generated, but are then consumed by the TR42 HTML file and the RNC file. I think that these should be in the repo, especially as some of them are created manually.
Should any of the deliverables be stored in unicodetools?
If we are planning to validate the UCDXML files using the RNC files as part of the build, we'll need a process that incorporates all steps. Should that be considered a "unit test"?
uax/uax42/Readme.md
Outdated
``` | ||
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file> | ||
``` | ||
Note that the UAX xml file has to be saved as NFD as the Unihan syntax regular expressions are expecting NFD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the tool generate the data files in NFD?
It seems like the files should come out as needed for the tool chain as well as for publication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data files are not NFD by default, and I'm not sure that we should change the format for publication.
The rest of your question ties back to my previous comment; we could add a step to create an NFD version of the UCDXML files as part of an end-to-end process. However, is that in scope?
PR to make it easy to see what changes have been made to support UCDXML.