A highly configurable UCD processing utility for dumping the data into any subset and format #20

lianghai · 2020-10-13T10:24:01Z

The Unicode Character Database is such a fundamental set of data files, but it’s kinda tricky to access. Although it generally has well-defined syntax for each file, many data files in it reference to each other, and it’s just too much to process from scratch.

The UAX #42 is meant to provide a unified format that is more accessible (and it does do a good job) but it’s still incomplete and slightly biased. For example, it doesn’t include property/value aliases and it’s stuck in in the 6.1.0 version for the property/value short names. Also there are a lot informative fields in the UCD that can be better consumed if we can process them into a more accessible format.

Different downstream projects have different concerns and need to compromise between various factors like space, performance, information, typing, etc, in different ways. And a helpful upstream project needs to configurable enough to be able to serve them all. Yes, I believe the planet is better off with a single, predominant UCD utility project.

We probably need to, again, start with documenting all the existing UCD processing tools (and even UCD-related APIs), including the internal tools used to generate the slimmed down data files in various programming languages’ (standard) libraries.

See also a short thread between @alerque and me:

A bit annoyed by how UCD formats/models out there are all biased in their own ways. UAX #42 (XML) and ppucd are already the best but still not quite complete and are normalized a bit. Should we build a highly configurable UCD processor to dump whatever subset in whatever format?

Markus’s preparseducd.py (https://github.com/unicode-org/icu/blob/master/tools/unicode/py/preparseucd.py, which generates the ppucd: https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt) may be a natural starting point.

Having a robust utility to process and dump the data would make downstream projects easier to maintain. Updating data sets by importing and post-processing somebody else's dump in some other format is just a recipe for being out of date and introducing mistakes.

simoncozens · 2020-10-13T13:32:34Z

Harfbuzz has various gen-... utilities here to make its own property tables: https://github.com/harfbuzz/harfbuzz/tree/master/src
Behdad wrote this but HB didn't end up using it: https://github.com/harfbuzz/youseedy
I wrote this and am using it in my own shaping engine: https://github.com/simoncozens/youseedee

lianghai · 2020-10-13T14:41:46Z

Omg I love these horrible project names—really, I gave that heart for them!

I also noticed https://github.com/harfbuzz/packtab the other day, which is mostly about space efficiency and performance I assume.

Swift’s UCD seems to come from ICU’s API. I wonder what Rust does—@Manishearth, can you point us to where the Rust community deals with preprocessing UCD data files? And I know @duerst has been doing a pretty good job at bridging the UCD and Ruby—Martin, would you like to briefly introduce Ruby’s status quo?

Manishearth · 2020-10-13T15:47:41Z

We typically have Python scripts like this one that ingest data files and spit out Rust code. Most of them are copied from each other and modified.

We also have ucd-generate, which is a generalized Rust crate for dealing with this. Some projects use that.

lianghai · 2020-11-17T16:35:03Z

LettError/glyphNameFormatter also has its own UCD parsing, from the flat XML:

https://github.com/LettError/glyphNameFormatter/blob/master/Lib/glyphNameFormatter/data/buildFlatUnicodeList.py

lianghai added external project encoding The Unicode Standard, etc tooling labels Oct 13, 2020

lianghai mentioned this issue Nov 17, 2020

Meeting 5 schedule and agenda #28

Closed

lianghai mentioned this issue Dec 4, 2020

Meeting 6 schedule and agenda #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A highly configurable UCD processing utility for dumping the data into any subset and format #20

A highly configurable UCD processing utility for dumping the data into any subset and format #20

lianghai commented Oct 13, 2020

simoncozens commented Oct 13, 2020

lianghai commented Oct 13, 2020

Manishearth commented Oct 13, 2020

lianghai commented Nov 17, 2020

A highly configurable UCD processing utility for dumping the data into any subset and format #20

A highly configurable UCD processing utility for dumping the data into any subset and format #20

Comments

lianghai commented Oct 13, 2020

simoncozens commented Oct 13, 2020

lianghai commented Oct 13, 2020

Manishearth commented Oct 13, 2020

lianghai commented Nov 17, 2020