Skip to content

Latest commit

 

History

History
312 lines (239 loc) · 13 KB

README.md

File metadata and controls

312 lines (239 loc) · 13 KB

Important

This crate is a better optimized implementation of the older unicode-id crate. This crate uses less static storage, and is able to classify both ASCII and non-ASCII codepoints with better performance, 2–10× faster than unicode-id.

Unicode ID_start

github crates.io docs.rs build status

Implementation of Unicode Standard Annex #31 for determining which char values are valid in programming language identifiers.

Changelog

1.3.0

  • Unicode 16.0.0

1.2.0

  • patch U+30FB KATAKANA MIDDLE DOT and U+FF65 HALFWIDTH KATAKANA MIDDLE DOT

Unicode 4.1 through Unicode 15 omitted these two characters from ID_Continue by accident. However, this accident was corrected in Unicode 15.1. Any JS VM that supports ES6+ but that uses a version of Unicode earlier than 15.1 will consider these to be a syntax error, so we deliberately omit these characters from the set of identifiers that are valid in both ES5 and ES6+. For more info see 2.2 in https://www.unicode.org/L2/L2023/23160-utc176-properties-recs.pdf

1.1.2

  • Unicode 15.1.0

1.1.1

  • Unicode 15.0.0

1.0.4

  • Unicode 14.0.0

Comparison of performance

The following table shows a comparison between five Unicode identifier implementations.

The static storage column shows the total size of static tables that the crate bakes into your binary, measured in 1000s of bytes.

The remaining columns show the cost per call to evaluate whether a single char has the ID_Start or ID_Continue Unicode property, comparing across different ratios of ASCII to non-ASCII codepoints in the input data.

static storage 0% nonascii 1% 10% 100% nonascii
unicode-ident 10.0 K 0.96 ns 0.95 ns 1.09 ns 1.55 ns
unicode-xid 11.5 K 1.88 ns 2.14 ns 3.48 ns 15.63 ns
ucd-trie 10.2 K 1.29 ns 1.28 ns 1.36 ns 2.15 ns
fst 138 K 55.1 ns 54.9 ns 53.2 ns 28.5 ns
roaring 66.1 K 2.78 ns 3.09 ns 3.37 ns 4.70 ns

Source code for the benchmark is provided in the bench directory of this repo and may be repeated by running cargo criterion.


Comparison of data structures

unicode-id

They use a sorted array of character ranges, and do a binary search to look up whether a given character lands inside one of those ranges.

static ID_Continue_table: [(char, char); 763] = [
    ('\u{30}', '\u{39}'),  // 0-9
    ('\u{41}', '\u{5a}'),  // A-Z('\u{e0100}', '\u{e01ef}'),
];

The static storage used by this data structure scales with the number of contiguous ranges of identifier codepoints in Unicode. Every table entry consumes 8 bytes, because it consists of a pair of 32-bit char values.

In some ranges of the Unicode codepoint space, this is quite a sparse representation – there are some ranges where tens of thousands of adjacent codepoints are all valid identifier characters. In other places, the representation is quite inefficient. A characater like µ (U+00B5) which is surrounded by non-identifier codepoints consumes 64 bits in the table, while it would be just 1 bit in a dense bitmap.

On a system with 64-byte cache lines, binary searching the table touches 7 cache lines on average. Each cache line fits only 8 table entries. Additionally, the branching performed during the binary search is probably mostly unpredictable to the branch predictor.

Overall, the crate ends up being about 10× slower on non-ASCII input compared to the fastest crate.

A potential improvement would be to pack the table entries more compactly. Rust's char type is a 21-bit integer padded to 32 bits, which means every table entry is holding 22 bits of wasted space, adding up to 3.9 K. They could instead fit every table entry into 6 bytes, leaving out some of the padding, for a 25% improvement in space used. With some cleverness it may be possible to fit in 5 bytes or even 4 bytes by storing a low char and an extent, instead of low char and high char. I don't expect that performance would improve much but this could be the most efficient for space across all the libraries, needing only about 7 K to store.

ucd-trie

Their data structure is a compressed trie set specifically tailored for Unicode codepoints. The design is credited to Raph Levien in rust-lang/rust#33098.

pub struct TrieSet {
    tree1_level1: &'static [u64; 32],
    tree2_level1: &'static [u8; 992],
    tree2_level2: &'static [u64],
    tree3_level1: &'static [u8; 256],
    tree3_level2: &'static [u8],
    tree3_level3: &'static [u64],
}

It represents codepoint sets using a trie to achieve prefix compression. The final states of the trie are embedded in leaves or "chunks", where each chunk is a 64-bit integer. Each bit position of the integer corresponds to whether a particular codepoint is in the set or not. These chunks are not just a compact representation of the final states of the trie, but are also a form of suffix compression. In particular, if multiple ranges of 64 contiguous codepoints have the same Unicode properties, then they all map to the same chunk in the final level of the trie.

Being tailored for Unicode codepoints, this trie is partitioned into three disjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints [0, 0x800), the second [0x800, 0x10000) and the third [0x10000, 0x110000). These partitions conveniently correspond to the space of 1 or 2 byte UTF-8 encoded codepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints, respectively.

Lookups in this data structure are significantly more efficient than binary search. A lookup touches either 1, 2, or 3 cache lines based on which of the trie partitions is being accessed.

One possible performance improvement would be for this crate to expose a way to query based on a UTF-8 encoded string, returning the Unicode property corresponding to the first character in the string. Without such an API, the caller is required to tokenize their UTF-8 encoded input data into char, hand the char into ucd-trie, only for ucd-trie to undo that work by converting back into the variable-length representation for trie traversal.

fst

Uses a finite state transducer. This representation is built into ucd-generate but I am not aware of any advantage over the ucd-trie representation. In particular ucd-trie is optimized for storing Unicode properties while fst is not.

As far as I can tell, the main thing that causes fst to have large size and slow lookups for this use case relative to ucd-trie is that it does not specialize for the fact that only 21 of the 32 bits in a char are meaningful. There are some dense arrays in the structure with large ranges that could never possibly be used.

roaring

This crate is a pure-Rust implementation of Roaring Bitmap, a data structure designed for storing sets of 32-bit unsigned integers.

Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be hundreds of times faster and they often offer significantly better compression.

In this use case the performance was reasonably competitive but still substantially slower than the Unicode-optimized crates. Meanwhile the compression was significantly worse, requiring 6× as much storage for the data structure.

I also benchmarked the croaring crate which is an FFI wrapper around the C reference implementation of Roaring Bitmap. This crate was consistently about 15% slower than pure-Rust roaring, which could just be FFI overhead. I did not investigate further.

unicode-ident

This crate is most similar to the ucd-trie library, in that it's based on bitmaps stored in the leafs of a trie representation, achieving both prefix compression and suffix compression.

The key differences are:

  • Uses a single 2-level trie, rather than 3 disjoint partitions of different depth each.
  • Uses significantly larger chunks: 512 bits rather than 64 bits.
  • Compresses the ID_Start and ID_Continue properties together simultaneously, rather than duplicating identical trie leaf chunks across the two.

The following diagram show the ID_Start and ID_Continue Unicode boolean properties in uncompressed form, in row-major order:

ID_StartID_Continue
ID_Start bitmap ID_Continue bitmap

Uncompressed, these would take 140 K to store, which is beyond what would be reasonable. However, as you can see there is a large degree of similarity between the two bitmaps and across the rows, which lends well to compression.

This crate stores one 512-bit "row" of the above bitmaps in the leaf level of a trie, and a single additional level to index into the leafs. It turns out there are 124 unique 512-bit chunks across the two bitmaps so 7 bits are sufficient to index them.

The chunk size of 512 bits is selected as the size that minimizes the total size of the data structure. A smaller chunk, like 256 or 128 bits, would achieve better deduplication but require a larger index. A larger chunk would increase redundancy in the leaf bitmaps. 512 bit chunks are the optimum for total size of the index plus leaf bitmaps.

In fact since there are only 124 unique chunks, we can use an 8-bit index with a spare bit to index at the half-chunk level. This achieves an additional 8.5% compression by eliminating redundancies between the second half of any chunk and the first half of any other chunk. Note that this is not the same as using chunks which are half the size, because it does not necessitate raising the size of the trie's first level.

In contrast to binary search or the ucd-trie crate, performing lookups in this data structure is straight-line code with no need for branching.

is_id_start:
	mov eax, edi
	shr eax, 9
	lea rcx, [rip + unicode_ident::tables::TRIE_START]
	add rcx, rax
	xor eax, eax
	cmp edi, 201728
	cmovb rax, rcx
	test rax, rax
	lea rcx, [rip + .L__unnamed_1]
	cmovne rcx, rax
	movzx eax, byte ptr [rcx]
	shl rax, 5
	mov ecx, edi
	shr ecx, 3
	and ecx, 63
	add rcx, rax
	lea rax, [rip + unicode_ident::tables::LEAF]
	mov al, byte ptr [rax + rcx]
	and dil, 7
	mov ecx, edi
	shr al, cl
	and al, 1
	ret

License

Use of the Unicode Character Database, as this crate does, is governed by the UNICODE LICENSE V3.

All intellectual property within this crate that is not generated using the Unicode Character Database as input is licensed under either of Apache License, Version 2.0 or MIT license at your option.

The generated files incorporate tabular data derived from the Unicode Character Database, together with intellectual property from the original source code content of the crate. One must comply with the terms of both the Unicode License Agreement and either of the Apache license or MIT license when those generated files are involved.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be licensed as just described, without any additional terms or conditions.