-
Notifications
You must be signed in to change notification settings - Fork 5
A highly configurable UCD processing utility for dumping the data into any subset and format #20
Comments
|
Omg I love these horrible project names—really, I gave that heart for them! I also noticed https://github.com/harfbuzz/packtab the other day, which is mostly about space efficiency and performance I assume. Swift’s UCD seems to come from ICU’s API. I wonder what Rust does—@Manishearth, can you point us to where the Rust community deals with preprocessing UCD data files? And I know @duerst has been doing a pretty good job at bridging the UCD and Ruby—Martin, would you like to briefly introduce Ruby’s status quo? |
We typically have Python scripts like this one that ingest data files and spit out Rust code. Most of them are copied from each other and modified. We also have ucd-generate, which is a generalized Rust crate for dealing with this. Some projects use that. |
LettError/glyphNameFormatter also has its own UCD parsing, from the flat XML: |
The Unicode Character Database is such a fundamental set of data files, but it’s kinda tricky to access. Although it generally has well-defined syntax for each file, many data files in it reference to each other, and it’s just too much to process from scratch.
The UAX #42 is meant to provide a unified format that is more accessible (and it does do a good job) but it’s still incomplete and slightly biased. For example, it doesn’t include property/value aliases and it’s stuck in in the 6.1.0 version for the property/value short names. Also there are a lot informative fields in the UCD that can be better consumed if we can process them into a more accessible format.
Different downstream projects have different concerns and need to compromise between various factors like space, performance, information, typing, etc, in different ways. And a helpful upstream project needs to configurable enough to be able to serve them all. Yes, I believe the planet is better off with a single, predominant UCD utility project.
We probably need to, again, start with documenting all the existing UCD processing tools (and even UCD-related APIs), including the internal tools used to generate the slimmed down data files in various programming languages’ (standard) libraries.
See also a short thread between @alerque and me:
The text was updated successfully, but these errors were encountered: