Skip to content
This repository has been archived by the owner on Mar 7, 2023. It is now read-only.

A highly configurable UCD processing utility for dumping the data into any subset and format #20

Open
lianghai opened this issue Oct 13, 2020 · 4 comments
Labels

Comments

@lianghai
Copy link
Contributor

The Unicode Character Database is such a fundamental set of data files, but it’s kinda tricky to access. Although it generally has well-defined syntax for each file, many data files in it reference to each other, and it’s just too much to process from scratch.

The UAX #42 is meant to provide a unified format that is more accessible (and it does do a good job) but it’s still incomplete and slightly biased. For example, it doesn’t include property/value aliases and it’s stuck in in the 6.1.0 version for the property/value short names. Also there are a lot informative fields in the UCD that can be better consumed if we can process them into a more accessible format.

Different downstream projects have different concerns and need to compromise between various factors like space, performance, information, typing, etc, in different ways. And a helpful upstream project needs to configurable enough to be able to serve them all. Yes, I believe the planet is better off with a single, predominant UCD utility project.

We probably need to, again, start with documenting all the existing UCD processing tools (and even UCD-related APIs), including the internal tools used to generate the slimmed down data files in various programming languages’ (standard) libraries.


See also a short thread between @alerque and me:

A bit annoyed by how UCD formats/models out there are all biased in their own ways. UAX #42 (XML) and ppucd are already the best but still not quite complete and are normalized a bit. Should we build a highly configurable UCD processor to dump whatever subset in whatever format?

Markus’s preparseducd.py (https://github.com/unicode-org/icu/blob/master/tools/unicode/py/preparseucd.py, which generates the ppucd: https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt) may be a natural starting point.

Having a robust utility to process and dump the data would make downstream projects easier to maintain. Updating data sets by importing and post-processing somebody else's dump in some other format is just a recipe for being out of date and introducing mistakes.

@lianghai lianghai added external project encoding The Unicode Standard, etc tooling labels Oct 13, 2020
@simoncozens
Copy link
Collaborator

@lianghai
Copy link
Contributor Author

Omg I love these horrible project names—really, I gave that heart for them!

I also noticed https://github.com/harfbuzz/packtab the other day, which is mostly about space efficiency and performance I assume.

Swift’s UCD seems to come from ICU’s API. I wonder what Rust does—@Manishearth, can you point us to where the Rust community deals with preprocessing UCD data files? And I know @duerst has been doing a pretty good job at bridging the UCD and Ruby—Martin, would you like to briefly introduce Ruby’s status quo?

@Manishearth
Copy link
Member

We typically have Python scripts like this one that ingest data files and spit out Rust code. Most of them are copied from each other and modified.

We also have ucd-generate, which is a generalized Rust crate for dealing with this. Some projects use that.

@lianghai
Copy link
Contributor Author

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants