Efficient Pure Ruby Unicode Normalization (eprun)

(pronounced e-prune)

The Talk

Please see the Internationalization & Unicode Conference 37 talk on Implementing Normalization in Pure Ruby - the Fast and Easy Way.

Directories and Files

lib/normalize.rb: The core normalization code.
lib/string_normalize.rm: String#normalize.
lib/generate.rb: Generation script, generates lib/normalize_tables.rb from data/UnicodeData.txt and data/CompositionExclusions.txt. This needs to be run only once when updating to a new Unicode version.
lib/normalize_tables.rb: Data used for normalization, automatically generated by lib/generate.rb.
data/: All three files in this directory are downloaded from the Unicode Character Database. They are currently at Unicode version 6.3. They need to be updated for a newer Unicode version (happens about once a year).
test/test_normalize.rb: Tests for lib/string_normalize.rb, using data/NormalizationTest.txt.
benchmark/benchmark.rb: Runs the benchmark with example text files. Automatically checks for existing gems/libraries; if e.g. the unicode_util gem is not available, that part of the benchmark is skipped. This also applies to eprun, which will not be run on Ruby 1.8.
benchmark/Deutsch_.txt, Japanese_.txt, Korean_.txt, Vietnamese_.txt: example texts extracted from random Wikipedia pages (see http://en.wikipedia.org/wiki/Wikipedia:Random). The languages are choosen based on number of characters affected by normalization (Deutsch < Japanese < Vietnamese < Korean). These files have somewhat differing lengths, so the results cannot directly be compared across languages. Adding other files with ending "_.txt" will include them in the benchmark.
benchmark/benchmark_results.rb: Results of benchmark for eprun, unicode_utils, ActiveSupport::Multibyte (version 3.0.0), twitter_cldr, and the unicode gem. Eprun, unicode_utils, and unicode normalizations are run 100 times each, ActiveSupport::Multibyte is run 10 times each, and twitter_cldr is run only 1 time (didn't want to wait any longer).
benchmark/benchmark_results_jruby.txt: Results of benchmark when using jruby (excludes unicode gem), version 1.7.4 (1.9.3p392, 2013-05-16 2390d3b on Java HotSpot(TM) Client VM 1.7.0_07-b10 [Windows 7-x86]).
benchmark/benchmark.pl: Runs the benchmark using Perl, both with xsub (i.e. C) version (run 100 times) and pure Perl version (run 10 times).
benchmark/benchmark_results_pl.txt: Results of Perl benchmarks.

TODOs and Ideas

Publish as a gem, or several gems.
Deal better with encodings other than UTF-8.
Add methods such as String#nfc, String#nfd,...
Add methods for normalization variants.
See talk for more.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
benchmark		benchmark
data		data
lib		lib
test		test
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Pure Ruby Unicode Normalization (eprun)

The Talk

Directories and Files

TODOs and Ideas

About

Releases

Packages

Languages

License

duerst/eprun

Folders and files

Latest commit

History

Repository files navigation

Efficient Pure Ruby Unicode Normalization (eprun)

The Talk

Directories and Files

TODOs and Ideas

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages