Update data tables to Unicode 7.0.0 #9

jiahao · 2014-07-18T14:17:12Z

Updates:

Updates the data_generator.rb script. This script now runs on a modern version of ruby (>1.8) and has the hard-coded data tables replaced with file reads from the appropriate Unicode data (UNIDATA) files.
Provides a new Makefile target, update, which automatically downloads the relevant UNIDATA and runs data_generator.rb to produce the file utf8proc_data.c.new.
Updates utf8proc_data.c to the output generated by running make update against UNIDATA v7.0.0

Observations:

There are #defined constants in utf8proc.c which may in principle have changed from v5.0 to v7.0, such as the constants marking the location of Hangul, Unihan, etc. I haven't checked them and it's probably not worth recomputing for each new Unicode version.
It looks like utf8proc implements an internal processing mode called LUMP, which is briefly described in lump.txt. As far as I can tell, this is a custom normalization mode which is separate from the Unicode standard, but I think we'll want to use these.

Closes #1

…ection matches

This target downloads all necessary Unicode data files using curl and rebuilds utf8proc_data.c using data_generator.rb (saving the new copy to utf8proc_data.c.new).

stevengj · 2014-07-18T15:31:55Z

This is great!

As a sanity check, if you run on the Unicode 5.0.0 files then does it reproduce the old utf8proc_data.c?

stevengj · 2014-07-18T15:39:10Z

And yes, LUMP is a custom normalization of utf8proc, which we should keep as-is for API compatibility.

jiahao · 2014-07-18T15:45:50Z

As a sanity check, if you run on the Unicode 5.0.0 files then does it reproduce the old utf8proc_data.c?

Yes, see #8

Update data tables to Unicode 7.0.0

Mark location of CaseFolding.txt data

f0943b4

jiahao mentioned this pull request Jul 18, 2014

Update data tables to Unicode 7.0.0 #6

Closed

jiahao added 7 commits July 18, 2014 10:46

Mark Default_Ignorable_Code_Point data

7633bd0

Mark Grapheme_Extend data

aa9823f

Mark composition exclusion characters

5404ef8

Update data_generator so that it runs on ruby 2.2

7932385

Replace all explicitly marked regions with Ruby file read and regex s…

7d4541e

…ection matches

Add 'update' target to Makefile

13a72c1

This target downloads all necessary Unicode data files using curl and rebuilds utf8proc_data.c using data_generator.rb (saving the new copy to utf8proc_data.c.new).

Update utf8proc_data.c (generated by data_generator.rb)

b81326e

stevengj added a commit that referenced this pull request Jul 18, 2014

Merge pull request #9 from JuliaLang/cjh/markdata

a5aeb49

Update data tables to Unicode 7.0.0

stevengj merged commit a5aeb49 into master Jul 18, 2014

stevengj mentioned this pull request Jul 18, 2014

update utf8proc -> libmojibake JuliaLang/julia#7656

Closed

jiahao deleted the cjh/markdata branch July 18, 2014 17:41

stevengj mentioned this pull request Jul 19, 2014

build failure due to undefined UTF8PROC_BIDI_CLASS_LRI #14

Closed

PallHaraldsson mentioned this pull request Oct 24, 2023

Unicode 15.1 support #253

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data tables to Unicode 7.0.0 #9

Update data tables to Unicode 7.0.0 #9

jiahao commented Jul 18, 2014

stevengj commented Jul 18, 2014

stevengj commented Jul 18, 2014

jiahao commented Jul 18, 2014

Update data tables to Unicode 7.0.0 #9

Update data tables to Unicode 7.0.0 #9

Conversation

jiahao commented Jul 18, 2014

stevengj commented Jul 18, 2014

stevengj commented Jul 18, 2014

jiahao commented Jul 18, 2014