Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data tables to Unicode 7.0.0 #9

Merged
merged 8 commits into from
Jul 18, 2014
Merged

Update data tables to Unicode 7.0.0 #9

merged 8 commits into from
Jul 18, 2014

Conversation

jiahao
Copy link
Collaborator

@jiahao jiahao commented Jul 18, 2014

Updates:

  1. Updates the data_generator.rb script. This script now runs on a modern version of ruby (>1.8) and has the hard-coded data tables replaced with file reads from the appropriate Unicode data (UNIDATA) files.
  2. Provides a new Makefile target, update, which automatically downloads the relevant UNIDATA and runs data_generator.rb to produce the file utf8proc_data.c.new.
  3. Updates utf8proc_data.c to the output generated by running make update against UNIDATA v7.0.0

Observations:

  1. There are #defined constants in utf8proc.c which may in principle have changed from v5.0 to v7.0, such as the constants marking the location of Hangul, Unihan, etc. I haven't checked them and it's probably not worth recomputing for each new Unicode version.
  2. It looks like utf8proc implements an internal processing mode called LUMP, which is briefly described in lump.txt. As far as I can tell, this is a custom normalization mode which is separate from the Unicode standard, but I think we'll want to use these.

Closes #1

@stevengj
Copy link
Member

This is great!

As a sanity check, if you run on the Unicode 5.0.0 files then does it reproduce the old utf8proc_data.c?

@stevengj
Copy link
Member

And yes, LUMP is a custom normalization of utf8proc, which we should keep as-is for API compatibility.

@jiahao
Copy link
Collaborator Author

jiahao commented Jul 18, 2014

As a sanity check, if you run on the Unicode 5.0.0 files then does it reproduce the old utf8proc_data.c?

Yes, see #8

stevengj added a commit that referenced this pull request Jul 18, 2014
Update data tables to Unicode 7.0.0
@stevengj stevengj merged commit a5aeb49 into master Jul 18, 2014
@jiahao jiahao deleted the cjh/markdata branch July 18, 2014 17:41
@PallHaraldsson PallHaraldsson mentioned this pull request Oct 24, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update Unicode tables
2 participants