-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ZeroTrie, an efficient string-to-int collection #2722
Conversation
Haven't reviewed but it's cool this can live in a separate crate. (In retrospect, might have made sense to do the same for ZeroMap) |
I was trying to decide between separate crate, icu_collections, or zerovec, so I started in a separate crate to defer that decision. |
Since this isn't reviewed yet, I'll make it draft and keep pushing things to the branch. I wrote up a builder implementation that I want to polish a bit more. |
With a bigger data set:
AsciiTrie2 is modestly smaller in both code and data than AsciiTrie, but it performs more than 2x worse. I don't know exactly why, because they should have the same number of operations. AsciiTrie is probably more cache-friendly. |
From discussion with @markusicu: try an AsciiTrie3 with binary instead of trinary. |
AsciiTrie3 is not better:
Last thing I'll try is a different varint encoding that puts the length in the first byte (more like UTF-8). |
My initial measurement has the new varint about 10% slower across the board (since all three implementations use it). Edit: I verified that the varint implementation is behaving correctly, so the 10% slower measurements seem accurate. |
Size comparison of the first_weekday_for_region.rs example (total code size of the main function with everything inlined) and the short_subtags data set in the PR:
* Assumes strings stored in a VarZeroVec with 2 bytes of hash metadata per string |
Current conclusions:
WDYT @zbraniecki @robertbastian @markusicu @echeran? Here are my potential paths forward on this PR:
|
With the direction we're going in #2885 we need key structures (sorted array, perfect hash, ascii trie) that are simple maps from |
The |
I'm planning to revisit this PR once ZeroHashMap (#2579) is merged so that I can add it to the benchmarks. |
I added the ZeroHashMap benches. Small bench: Benchmarking get/basic/AsciiTrie: Collecting 100 samples in estimated 5.0001 s (48M ite get/basic/AsciiTrie time: [104.34 ns 104.70 ns 105.11 ns] change: [-11.477% -11.201% -10.945%] (p = 0.00 < 0.05) Performance has improved. Found 25 outliers among 100 measurements (25.00%) 7 (7.00%) low severe 7 (7.00%) low mild 6 (6.00%) high mild 5 (5.00%) high severe Benchmarking get/basic/AsciiTrie2: Collecting 100 samples in estimated 5.0002 s (33M it get/basic/AsciiTrie2 time: [149.02 ns 149.53 ns 150.09 ns] change: [-0.1683% +0.1942% +0.5685%] (p = 0.31 > 0.05) No change in performance detected. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe Benchmarking get/basic/AsciiTrie3: Collecting 100 samples in estimated 5.0000 s (34M it get/basic/AsciiTrie3 time: [146.90 ns 147.27 ns 147.62 ns] change: [-0.4641% -0.1158% +0.2224%] (p = 0.52 > 0.05) No change in performance detected. Found 18 outliers among 100 measurements (18.00%) 15 (15.00%) low mild 1 (1.00%) high mild 2 (2.00%) high severe Benchmarking get/basic/ZeroMap/usize: Collecting 100 samples in estimated 5.0005 s (44M get/basic/ZeroMap/usize time: [113.27 ns 113.57 ns 113.86 ns] change: [-0.8415% -0.5492% -0.2367%] (p = 0.00 < 0.05) Change within noise threshold. Found 22 outliers among 100 measurements (22.00%) 7 (7.00%) low severe 10 (10.00%) low mild 2 (2.00%) high mild 3 (3.00%) high severe Benchmarking get/basic/ZeroMap/u8: Collecting 100 samples in estimated 5.0003 s (52M it get/basic/ZeroMap/u8 time: [95.369 ns 95.625 ns 95.889 ns] change: [-1.2355% -0.5995% -0.0770%] (p = 0.03 < 0.05) Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild Benchmarking get/basic/HashMap: Collecting 100 samples in estimated 5.0001 s (59M itera get/basic/HashMap time: [85.354 ns 86.072 ns 87.147 ns] change: [+0.3076% +7.4054% +18.870%] (p = 0.08 > 0.05) No change in performance detected. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe Benchmarking get/basic/ZeroHashMap/usize: Collecting 100 samples in estimated 5.0002 s get/basic/ZeroHashMap/usize time: [111.27 ns 111.55 ns 111.82 ns] change: [+2.3400% +2.6571% +2.9320%] (p = 0.00 < 0.05) Performance has regressed. Found 21 outliers among 100 measurements (21.00%) 1 (1.00%) low severe 16 (16.00%) low mild 4 (4.00%) high mild Benchmarking get/basic/ZeroHashMap/u8: Collecting 100 samples in estimated 5.0001 s (56 get/basic/ZeroHashMap/u8 time: [88.323 ns 89.146 ns 90.685 ns] change: [+5.9094% +6.8107% +7.8053%] (p = 0.00 < 0.05) Performance has regressed. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe Medium bench: Benchmarking get/subtags/AsciiTrie: Collecting 100 samples in estimated 5.0152 s (1.3M get/subtags/AsciiTrie time: [3.6910 µs 3.7000 µs 3.7086 µs] change: [-1.5978% -1.2175% -0.8624%] (p = 0.00 < 0.05) Change within noise threshold. Found 27 outliers among 100 measurements (27.00%) 10 (10.00%) low severe 12 (12.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe Benchmarking get/subtags/AsciiTrie2: Collecting 100 samples in estimated 5.0406 s (465k get/subtags/AsciiTrie2 time: [10.795 µs 10.834 µs 10.872 µs] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild Benchmarking get/subtags/AsciiTrie3: Collecting 100 samples in estimated 5.0089 s (460k get/subtags/AsciiTrie3 time: [10.911 µs 10.947 µs 10.983 µs] Found 24 outliers among 100 measurements (24.00%) 4 (4.00%) low severe 11 (11.00%) low mild 7 (7.00%) high mild 2 (2.00%) high severe Benchmarking get/subtags/ZeroMap/usize: Collecting 100 samples in estimated 5.0142 s (7 get/subtags/ZeroMap/usize time: [6.8100 µs 6.8416 µs 6.8770 µs] Found 14 outliers among 100 measurements (14.00%) 6 (6.00%) low mild 7 (7.00%) high mild 1 (1.00%) high severe Benchmarking get/subtags/ZeroMap/u8: Collecting 100 samples in estimated 5.0191 s (954k get/subtags/ZeroMap/u8 time: [5.2495 µs 5.2654 µs 5.2813 µs] Found 25 outliers among 100 measurements (25.00%) 20 (20.00%) low mild 1 (1.00%) high mild 4 (4.00%) high severe Benchmarking get/subtags/HashMap: Collecting 100 samples in estimated 5.0097 s (2.2M it get/subtags/HashMap time: [2.3302 µs 2.3380 µs 2.3457 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild Benchmarking get/subtags/ZeroHashMap/usize: Collecting 100 samples in estimated 5.0110 get/subtags/ZeroHashMap/usize time: [3.2209 µs 3.2286 µs 3.2360 µs] Found 28 outliers among 100 measurements (28.00%) 17 (17.00%) low severe 4 (4.00%) low mild 1 (1.00%) high mild 6 (6.00%) high severe Benchmarking get/subtags/ZeroHashMap/u8: Collecting 100 samples in estimated 5.0032 s ( get/subtags/ZeroHashMap/u8 time: [2.5595 µs 2.5658 µs 2.5717 µs] Found 8 outliers among 100 measurements (8.00%) 7 (7.00%) low mild 1 (1.00%) high mild Large bench (new): Benchmarking get/subtags_full/AsciiTrie: Collecting 100 samples in estimated 5.2934 s ( get/subtags_full/AsciiTrie time: [81.655 µs 81.892 µs 82.127 µs] Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe Benchmarking get/subtags_full/AsciiTrie2: Collecting 100 samples in estimated 5.4004 s get/subtags_full/AsciiTrie2 time: [177.85 µs 178.43 µs 179.05 µs] Found 12 outliers among 100 measurements (12.00%) 10 (10.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe Benchmarking get/subtags_full/AsciiTrie3: Collecting 100 samples in estimated 5.0660 s get/subtags_full/AsciiTrie3 time: [167.08 µs 167.55 µs 168.03 µs] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild Benchmarking get/subtags_full/ZeroMap/usize: Collecting 100 samples in estimated 5.0982 get/subtags_full/ZeroMap/usize time: [126.04 µs 126.37 µs 126.70 µs] Found 19 outliers among 100 measurements (19.00%) 8 (8.00%) low severe 7 (7.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe Benchmarking get/subtags_full/ZeroMap/u8: Collecting 100 samples in estimated 5.0934 s get/subtags_full/ZeroMap/u8 time: [112.43 µs 112.99 µs 113.66 µs] Found 26 outliers among 100 measurements (26.00%) 17 (17.00%) low mild 2 (2.00%) high mild 7 (7.00%) high severe Benchmarking get/subtags_full/HashMap: Collecting 100 samples in estimated 5.0237 s (21 get/subtags_full/HashMap time: [23.646 µs 23.729 µs 23.812 µs] Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) low mild 1 (1.00%) high mild Benchmarking get/subtags_full/ZeroHashMap/usize: Collecting 100 samples in estimated 5. get/subtags_full/ZeroHashMap/usize time: [33.772 µs 34.098 µs 34.575 µs] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe Benchmarking get/subtags_full/ZeroHashMap/u8: Collecting 100 samples in estimated 5.122 get/subtags_full/ZeroHashMap/u8 time: [25.991 µs 26.054 µs 26.113 µs] Found 24 outliers among 100 measurements (24.00%) 13 (13.00%) low severe 4 (4.00%) low mild 4 (4.00%) high mild 3 (3.00%) high severe Summary:
* The small data and small perf are both small, but they are not on the same data set Small = 7 elements |
In code, the displacements are |
#[cfg(feature = "alloc")] | ||
pub use cached_owned::PerfectByteHashMapCacheOwned; | ||
|
||
use ref_cast::RefCast; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: imo we should do this manually and not include the extra dep
(we should also consider making this a utility exported by zerovec_derive
since zerovec situations need it a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pub struct ZeroTrie<S>(pub(crate) ZeroTrieInner<S>); | ||
|
||
#[derive(Debug, Clone, Copy, PartialEq, Eq)] | ||
pub(crate) enum ZeroTrieInner<S> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: such data structure enums are often called flavors, if you're looking for a better name for this (ZeroTrieFlavor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ExtendedCapacity(ZeroTrieExtendedCapacity<S>), | ||
} | ||
|
||
/// A data structure that compactly maps from ASCII strings to integers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: docs pertaining to the specific variant (for all three)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify? The docs for this one say "ASCII strings to integers" which is correct. I added a link back to ZeroTrie docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(specifically about the format of each and the general format)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the client-facing docs are okay, and the internal files now have fairly extensive docs about the internal difference between these types. Let me know where there might still lack clarity.
I wrote improved docs for just about everything now, public and private; you can view them either in source or rendered nicely with:
I think I fixed or responded to all the feedbacks so far. |
If it helps, here's what I consider to be the main building blocks:
|
I'll defer to Manish for the detailed logic reviews |
I went ahead and did my best to write docs for how the builder works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! | ||
//! When a node is prepended, it is shown in **boldface**. | ||
//! | ||
//! 1. Initialize the builder by setting `i=3`, `j=4`, `prefix_len=5` (the last string), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: i don't think i
and j
have been sufficiently explained for this to be obvious, could you explain them more above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions on how I can be more clear than
//! - `prefix_len` indicates the byte index we are currently processing.
//! - `i` and `j` bracket a window of strings in the input that share the same prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is extremely unclear as to what "bracket a window of strings" means.
Examples scoped to explaining what they are may help. For example, here you can explicitly say "the only string sharing the same prefix here is adhgk, so i and j bracket it and you get 3..4
pub fn get_bsearch_only(mut trie: &[u8], mut ascii: &[u8]) -> Option<usize> { | ||
loop { | ||
let (b, x, i, search); | ||
(b, trie) = trie.split_first()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly from a correctness standpoint: this code is somewhat stateful when it comes to whether the current cursor in the trie is itself a valid trie, and it seems like a useful distinction to maintain in code. But not blocking, like i said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not carefully reviewed the builder aside from the docs. I hope to get to it this week, but I am comfortable having this land as-is.
a new license file, checker broke when #3875 merged |
Fixes #2721
Docs:
http://icu4x-pr-artifacts.storage.googleapis.com/gha/37107e708359b7549553b91a2f5481e2fc757831/docs/zerotrie/index.html