Port UCharsTrie #1264

dminor · 2021-11-04T23:10:34Z

Initial work in progress. This implements the iterator interface which is hopefully sufficient to unblock collator work.

Fixes #1202

zbraniecki · 2021-11-04T23:36:03Z

This reads well! Great work Dan!

I would like to see a bit of an overdocumentation on the decision to use u16 in a Rust project over u8. I think we should strongly document our decision both in README and in lib.rs docstrings to explain our rationale, and make it much easier for someone in the future to reason about revisiting it and switching to u8.

echeran · 2021-11-05T17:55:43Z

Cargo.toml

@@ -37,6 +37,7 @@ members = [
    "utils/fixed_decimal",
    "utils/litemap",
    "utils/pattern",
+    "utils/ucharstrie",


suggestion: To be consistent with naming, this component should be named just charstrie / CharsTrie.

The 'U' prefix convention in C/C++ wasn't necessary and considered less readable, ex: we went with the ICU4J name CodePointTrie over the ICU4C UCPTrie.

I was thinking about it as well. I ended up shyed away from suggesting this and my rationale is that char in Rust is u8. UChar is a unique Unicode type backed by u16.
Since this crate operates exclusively on u16 then chartrie may be slightly misleading.

My position here is very weakly held and in particular if we envision switching/extending the crate to support u8 then I'd be strongly in favor of following your advice.

Yes, I was afraid that char would imply u8 to a lot of people and this data structure is operating on u16.

@zbraniecki What do you see as the advantage of using u8 rather than u16 when the underlying data structure we're exporting from ICU4C uses u16?

Just alignment with how Rust thinks about chars. And if we were to explore it I'd expect us to convert the underlying data structure to u8.

Beyond that, maybe there is some performance/memory consideration - @hsivonen @sffc @Manishearth @markusicu @iainireland ?

We have plans to implement BytesTrie and that will support u8.

Perhaps Char16Trie ?

@markusicu convinced me last week that it is better to use u16 units rather than u8 units in the trie even if we are dealing with UTF-8 source text. Basically we can pack a lot more data into the u16 units.

iainireland · 2021-11-05T19:43:04Z

Dumb question: what are the differences between CharsTrie/BytesTrie/CodepointTrie? Is there a way to unify some/all of them instead of having multiple trie types?

sffc · 2021-11-05T20:27:03Z

Dumb question: what are the differences between CharsTrie/BytesTrie/CodepointTrie? Is there a way to unify some/all of them instead of having multiple trie types?

CharsTrie = map from strings to integers, with 16-bit units
BytesTrie = map from strings to integers, with 8-bit units
CodePointTrie = map from single code points to integers

CharsTrie vs BytesTrie is largely an implementation detail. I have asked that we run benchmarks to figure out if we need both or if we should just do one or the other. @markusicu suggested starting with CharsTrie because the implementation is simpler.

jira-pull-request-webhook · 2021-11-16T14:41:14Z

Notice: the branch changed across the force-push!

Cargo.lock is different
Cargo.toml is different
utils/char16trie/Cargo.toml is now changed in the branch
utils/char16trie/LICENSE is now changed in the branch
utils/char16trie/README.md is now changed in the branch
utils/char16trie/src/char16trie.rs is now changed in the branch
utils/char16trie/src/lib.rs is now changed in the branch
utils/char16trie/tests/test_util.rs is now changed in the branch
utils/char16trie/tests/testdata/empty.toml is now changed in the branch
utils/char16trie/tests/testdata/months.toml is now changed in the branch
utils/char16trie/tests/testdata/test_a_ab.toml is now changed in the branch
utils/char16trie/tests/testdata/test_a.toml is now changed in the branch
utils/char16trie/tests/testdata/test_branches.toml is now changed in the branch
utils/char16trie/tests/testdata/test_compact.toml is now changed in the branch
utils/char16trie/tests/testdata/test_long_branch.toml is now changed in the branch
utils/char16trie/tests/testdata/test_long_sequence.toml is now changed in the branch
utils/char16trie/tests/testdata/test_shortest_branch.toml is now changed in the branch
utils/char16trie/tests/trie_tests.rs is now changed in the branch
utils/ucharstrie/Cargo.toml is no longer changed in the branch
utils/ucharstrie/LICENSE is no longer changed in the branch
utils/ucharstrie/README.md is no longer changed in the branch
utils/ucharstrie/src/lib.rs is no longer changed in the branch
utils/ucharstrie/src/ucharstrie.rs is no longer changed in the branch
utils/ucharstrie/tests/test_util.rs is no longer changed in the branch
utils/ucharstrie/tests/testdata/empty.toml is no longer changed in the branch
utils/ucharstrie/tests/testdata/months.toml is no longer changed in the branch
utils/ucharstrie/tests/testdata/test_a_ab.toml is no longer changed in the branch
utils/ucharstrie/tests/testdata/test_a.toml is no longer changed in the branch
utils/ucharstrie/tests/testdata/test_branches.toml is no longer changed in the branch
utils/ucharstrie/tests/testdata/test_compact.toml is no longer changed in the branch
utils/ucharstrie/tests/testdata/test_long_branch.toml is no longer changed in the branch
utils/ucharstrie/tests/testdata/test_long_sequence.toml is no longer changed in the branch
utils/ucharstrie/tests/testdata/test_shortest_branch.toml is no longer changed in the branch
utils/ucharstrie/tests/trie_tests.rs is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

markusicu · 2021-11-17T23:55:07Z

Dumb question: what are the differences between CharsTrie/BytesTrie/CodepointTrie? Is there a way to unify some/all of them instead of having multiple trie types?

CharsTrie = map from strings to integers, with 16-bit units BytesTrie = map from strings to integers, with 8-bit units CodePointTrie = map from single code points to integers

Right. For details and design docs see #1202 (comment)

UCharsTrie like in ICU4C might work. In my post on issue 1202 I speculated about U16Trie as a plausible name in Rust.

Basically, this type of "string trie" matches sequences of 16-bit units. Used to be UChar in ICU4C before we changed to char16_t. char in Java. u16 in Rust.

CharsTrie vs BytesTrie is largely an implementation detail. I have asked that we run benchmarks to figure out if we need both or if we should just do one or the other. @markusicu suggested starting with CharsTrie because the implementation is simpler.

With a larger base unit, each trie node can encode more stuff in a single unit, which makes processing simpler and possibly a little faster. And Henri needs the 16-bit-units trie for the collation data.

In terms of data size, a BytesTrie might be more compact for a small set of strings because the deltas to be encoded are small. A 16-bit-units trie might be smaller for large sets of strings that need large deltas.

markusicu · 2021-11-18T00:12:41Z

UCharsTrie like in ICU4C might work. In my post on issue 1202 I speculated about U16Trie as a plausible name in Rust.

Basically, this type of "string trie" matches sequences of 16-bit units. Used to be UChar in ICU4C before we changed to char16_t. char in Java. u16 in Rust.

I see that it's "char16trie" now. That's probably ok.

For ICU I chose plural CharsTrie / UCharsTrie because it matches sequences of characters. But with the "16" in the name I am not sure where I would stick the plural "s".

markusicu

Partial review. Sorry for running so late!

markusicu · 2021-11-18T00:16:44Z

utils/char16trie/README.md

+This component provides a data structure for an time-efficient lookup of values
+associated to code points.


Suggested change

This component provides a data structure for an time-efficient lookup of values

associated to code points.

This component provides a data structure for a space-efficient and time-efficient lookup of

sequences of 16-bit units (commonly but not necessarily UTF-16 code units)

which map to integer values.

markusicu · 2021-11-18T00:21:23Z

utils/char16trie/README.md

+// A Char16Trie containing the ASCII characters 'a' and 'b'.
+let trie_data = vec![48, 97, 176, 98, 32868];
+let trie = Char16Trie {
+    data: ZeroVec::from_slice(trie_data.as_slice()),


FYI Henri asked for a more direct way to create this object from an array slice, or really also a more direct way to create what's currently called the Char16TrieIterator from an array slice.

If it's not necessary to make a named type for what's really just a slice of an array or of a ZeroVec, then it would be nice to cut out this layer.

@sffc Thoughts on this? Like Markus says, there's no real use for a separate struct from the iterator right now, but maybe you had something in mind for the future?

Data coming from a provider will most likely already be a ZeroVec, and there's no need to create a Char16Trie first, it's possible to create the Char16TrieIterator directly from a slice of ZeroVec data.

As in other areas, I tend to prefer the extra wrapper layer; it is a zero-cost abstraction that makes code a little bit more readable and maintainable. For example, I prefer this:

let trie: Char16Trie = // get trie from data provider let mut iter = trie.iter();

over this:

let zv: ZeroVec<u16> = // get zerovec from data provider let mut iter: Char16TrieIterator = zv.as_slice().into()

In the data provider itself, I also prefer the readability of the extra type:

#[icu_provider::data_struct] pub struct MyDataStruct<'data> { /// A map from strings to values trie: Char16Trie<'data>, }

Otherwise, you have this:

#[icu_provider::data_struct] pub struct MyDataStruct<'data> { /// A vector of numbers that represent a Char16Trie, a map from strings to values trie_data: ZeroVec<'data, u16>, }

That being said, the extra type may not be a zero-cost abstraction in serde; but if this is a problem, we can add #[serde(flatten)] to make it zero-cost.

My impression was that for the current use cases (collator, segmenter) the trie data is interleaved with other data at unpredictable offsets, so I think we'll end up constructing the iterator using a slice, rather than using the Char16Trie structure directly. But I don't see any harm in keeping Char16Trie, it might be more useful in the future, and I might be wrong about the current use cases.

I guess there can be both a named type for just the array slice and also an iterator constructor that takes an array slice, so that the intermediate type need not be explicitly constructed (just to unwrap it again).

Yes, in collation data and other structures, there are large arrays where certain slices carry string-trie data. For each one you just know where it starts, you don't even know (nor need to know) the length of the slice -- so the slices typically go from some start to the end of the array.

markusicu · 2021-11-18T00:21:53Z

utils/char16trie/README.md

+    data: ZeroVec::from_slice(trie_data.as_slice()),
+};
+
+let mut itor = Char16TrieIterator::new(trie.data.as_slice());


s/itor/iter/ for more common naming?

markusicu · 2021-11-18T00:26:52Z

utils/char16trie/README.md

+use icu_char16trie::char16trie::{Char16Trie, Char16TrieIterator, TrieResult};
+use zerovec::ZeroVec;
+
+// A Char16Trie containing the ASCII characters 'a' and 'b'.


Suggested change

// A Char16Trie containing the ASCII characters 'a' and 'b'.

// A Char16Trie mapping string "a" to 1 and string "ab" to 100.

markusicu · 2021-11-18T00:32:57Z

utils/char16trie/src/char16trie.rs

+    }
+}
+
+/// This struct represents a de-serialized Char16Trie that was exported from


Way too much information for public API docs. For public docs, all that a user should care about is that the data is an array of 16-bit units, and it encodes a structure for mapping strings/sequences to values. That is, the documentation that is on the class top level here: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html

What you have here is from the private/internal details that explain the nitty-gritty details to someone trying to understand the code and the data structure. It should be here somewhere, but it should remain private/internal.

It's good to have these docs but perhaps you can attach them to an internal module or somewhere like that.

markusicu · 2021-11-18T00:35:12Z

utils/char16trie/src/char16trie.rs

+    /// The input unit(s) continued a matching string and there is a value for
+    /// the string so far.  Another input byte/unit can continue a matching
+    /// string.
+    Intermediate(i32),


Suggested change

Intermediate(i32),

IntermediateValue(i32),

The Result type should have additional functions:

matches()

hasValue()

hasNext()

See

https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/BytesTrie.Result.html

https://github.com/unicode-org/icu/blob/main/icu4j/main/classes/core/src/com/ibm/icu/util/BytesTrie.java#L215-L235

https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/ustringtrie.h#L69-L95

Note how the order (and thus numeric values) of the enum values is chosen in both C++ and Java so that these important predicate functions are trivial to implement. (Each with a single conditional.)

markusicu · 2021-11-18T00:36:39Z

utils/char16trie/src/char16trie.rs

+    /// Once next() return TrieResult::NoMatch, all further calls to next()
+    /// will also return TrieResult::NoMatch.


Suggested change

/// Once next() return TrieResult::NoMatch, all further calls to next()

/// will also return TrieResult::NoMatch.

/// Once next() returns TrieResult::NoMatch, all further calls to next()

/// will also return TrieResult::NoMatch.

markusicu · 2021-11-18T00:41:07Z

utils/char16trie/src/char16trie.rs

+    /// The input unit(s) continued a matching string and there is a value for
+    /// the string so far. No further input byte/unit can continue a matching
+    /// string.
+    FinalValue(i32),


In C++ & Java, next() returns a Result that only expresses the state, not also the value. I chose that for performance. Some users will not care about the value of an intermediate result. Some users will set all values to 0 and only look at the state for a boolean "did match". These users can skip decoding the result, saving some cycles. That's why there is a separate getValue() function.

@Manishearth Do you think the compiler would optimize away retrieving the value if the user does not care about it? E.g. if the user is matching against FinalValue(_) would we avoid calling get_value() since we never use it?

Depends on whether it gets inlined. Hard to say.

The difference is bigger for IntermediateValue. Whenever you have more input (a common case), you want to just skip over the array units that encode the value, rather than read and decode them. If the API always returns the value, then it will always read & decode. Maybe the compiler could throw out the decoding, but I doubt it would be equivalent to skipping.

markusicu · 2021-11-18T00:50:23Z

utils/char16trie/src/char16trie.rs

+    ///     data: ZeroVec::from_slice(trie_data.as_slice()),
+    /// };
+    ///
+    /// let mut itor = Char16TrieIterator::new(trie.data.as_slice());


As above, it should be possible to make this object directly from the array. To the user, there is no value in the intermediate type/object currently called Char16Trie.

In my opinion, the primary way to construct this object should be a method on the Char16Trie. That said, it's harmless to also have an alternate constructor from the slice like @dminor has here.

markusicu · 2021-11-18T00:57:29Z

utils/char16trie/src/char16trie.rs

+    }
+
+    fn skip_value(&self, pos: usize) -> usize {
+        let lead_byte = self.get(pos);


Please change _byte suffixes to _unit: These are u16's. See https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/ucharstrie.h#L432

markusicu · 2021-11-18T03:17:37Z

utils/char16trie/src/char16trie.rs

+    /// Remaining length of a linear-match node, minus 1, or None if not in
+    /// such a node.
+    remaining_match_length: Option<usize>,


Why not use a simple signed integer like in ICU?

https://github.com/unicode-org/icu/blob/main/icu4j/main/classes/core/src/com/ibm/icu/util/CharsTrie.java#L51

I think Option<> is more idiomatic Rust code and also makes it impossible to have bugs where we forget to check that the value is -1 before using it.

There is very little checking for non-negative before using the length, and what there is takes advantage of loading one value instead of two (boolean+length) for performance. And the transition to end-of-linear-match-node is trivial: Decrement and store the length, and if we are done that's automatically -1.

markusicu · 2021-11-18T03:20:52Z

utils/char16trie/src/char16trie.rs

+    /// Index of next trie unit to read, or None if there are no more matches.
+    pos: Option<usize>,


FYI -- ICU4J is also using a pos index rather than C++'s pos pointer, and uses a negative pos value for "none". Simpler than Option<usize> in Rust, too? (Loads only one int, rather than bool-then-int as in the Option.)

https://github.com/unicode-org/icu/blob/main/icu4j/main/classes/core/src/com/ibm/icu/util/CharsTrie.java#L697

Like above, I think Option<> is more idiomatic Rust.

That's probably ok for pos, more than for remaining_match_length.

markusicu · 2021-11-18T03:30:59Z

utils/char16trie/src/char16trie.rs

+        }
+    }
+
+    pub fn get_value(&self, pos: usize) -> i32 {


If you always call this whenever you return a TrieResult that has a value, then the pos is always after the value, and then this public function, when called explicitly, will never return a real value, right? Making it useless to have this as public API?

My preference is for this code to work like in C++ & Java, and have the user call get_value() only when TrieResult.has_value() and the user cares to see the value.

I agree this should not be public if we're using the Result the way it is defined above. Let's iterate on that conversation before I change anything here.

I also prefer the TrieResult-based API.

markusicu · 2021-11-18T03:38:57Z

utils/char16trie/src/char16trie.rs

+    Intermediate(i32),
+}
+
+impl<'a> Char16TrieIterator<'a> {


There is very little of the public API here, compared to the ICU version. I understand that the ICU CharsTrie.Iterator (which iterates over the (string, value) pairs reachable from the current state) is out of scope here, but what about the rest?

https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html

At a minimum, I would expect to see nextForCodePoint(int cp):
https://github.com/unicode-org/icu/blob/main/icu4j/main/classes/core/src/com/ibm/icu/util/CharsTrie.java#L250-L263

I think Henri will need that for collation.

And since you intentionally omit API that is well-defined and exists in ICU, I think that somewhere in this file there should be a note pointing to ICU for the complete set.

This code started with what @makotokato implemented to support segmentation, and added what @hsivonen asked for to support the collator. I wasn't trying for a complete port, just trying to address our immediate needs.

Since we're not at 1.0 yet, we can expand the API later as needed. But maybe we should just do it now. I'd be interested in other's thoughts on this as well.

The most important method is nextForCodePoint(). In fact, the method labeled next() is dangerous, because it looks like you can pass a code point in there, but as soon as you hit a supplementary code point, you will get unexpected results.

In Rust, I suggest the following methods:

next(char cp) --> primary, "safe" method

next_u16(u16 code_unit) --> for lower-level use; the range of the int is enforced by the u16 type

next_u32(u32 cp) --> for when you have a u32 and don't want to convert it to a char

The data structure is defined as mapping from a sequence of 16-bit units to an int32_t. It should support that, even when the 16-bit units have nothing to do with UTF-16 strings, and so next(unit) should be the primary entry point.

"nextCodePoint()" is named suggestively, but there is no enforcement that the inputs must be Unicode strings; it's for a gentleman's agreement between data builder and user to store and provide surrogate pairs.

I found it very handy for next() to take an int32_t with values outside 0..ffff simply not matching: Neither the caller nor the implementation needs to do a range check, error values simply yield NO_MATCH. Easy to use, and efficient.

Remember that this is a low-level data structure.

Idea:

TrieResult is just a four-value enum like in C++ & Java; no built-in value

no public get_value()

next(unit) returns TrieResult, no way to get the value

next_with_value(unit) returns a (TrieResult, value) pair

next_code_point(cp) calls next(lead surrogate) + next(trail surrogate)

next_code_point_with_value(cp) calls next(lead surrogate) + next_with_value(trail surrogate)

the not-with-value functions could eagerly skip the value units, if the node has any

This also seems like a reasonable suggestion to me. I actually don't have that strong of an opinion on the API surface, but I would like for us to come to a decision on this before I write any more code :)

@sffc what do you think?

I don't feel strongly about when TrieResult and the value are or are not returned. I just want to avoid the footgun of the default next() method taking a i32 of which only 16 bits are ever used.

Since it seems like we can't agree on that, can we agree on suffixing all of the next methods?

next_u16(u16)

next_code_point(u32)

next_char(char)

I may be okay with next(u16). I am mainly hesitant to next(i32) because that's just an invitation for someone to misuse the API.

Discussion 2021-12-09:

Rust may perform optimizations over what C++ had done

C++ APIs are less ergonomic / have footguns

Conclusion: Start with Rusty APIs, and file a follow-up issue to investigate codegen and experiment with Markus's suggestions

I am ok with suffixes for the input type. How about the following then:

TrieResult is just a four-value enum like in C++ & Java; no built-in value

no public get_value()

next_for_u16(u16 unit) returns TrieResult, no way to get the value

next_for_char(char c) calls next(lead surrogate) + next(trail surrogate) if supplementary, and returns TrieResult, no way to get the value

next_with_value_for_u16(unit) returns a (TrieResult, value) pair

next_with_value_for_char(char c) ...

the not-with-value functions could eagerly skip the value units, if the node has any

From much work in C/C++/Java, where you frequently end up with "int" anyway, I would also want ..._for_i32(i32_unit) versions of these functions; same as the ..._for_u16(u16 unit) versions except if the unit is outside 0..ffff then no match. Note that in the trie implementation this happens automatically -- no range checks needed on the unit. I don't know whether the same kind of coding leads Rust devs to end up with i32 or u32 types, so we could omit these versions for now and see if call sites end up with inconvenient and unnecessary range checks.

PS: Taking an i32 or u32 does not mean taking a code point. It means being user-friendly in accepting a wider type for the 16-bit input unit, and if the value in the wider type is outside of the u16 range, then the trie code automagically yields no match, without requiring anyone to do an explicit range check. Accepting a code point or scalar value (char) means checking for BMP vs. supplementary, and taking the latter apart into a surrogate pair and calling next() twice.

My plan is to land this in experimental for now, since it has been difficult for me to find the time to work on this and this will unblock other people from using this code. I've filed #1439 to track the performance related points that Markus brought up during the review.

markusicu · 2021-11-18T03:44:13Z

utils/char16trie/tests/trie_tests.rs

+    assert_eq!(res, TrieResult::NoMatch);
+    assert_eq!(res, TrieResult::NoMatch);


duplicate line

markusicu · 2021-11-18T03:45:50Z

utils/char16trie/tests/trie_tests.rs

+    let res = itor.next('h' as i32);
+    assert_eq!(res, TrieResult::NoMatch);
+
+    let mut itor = Char16TrieIterator::new(trie.data.as_slice());


If you ported more of the API (e.g., first(unit) and/or reset()) then you could reuse the same object and wouldn't have to reload the test data.

This isn't reloading the data, it's just setting a pointer. My expectation for a Rust iterator is that once it is consumed, it won't be reused, which is why I didn't implement reset.

I also hope we can also do this:

Suggested change

let mut itor = Char16TrieIterator::new(trie.data.as_slice());

let mut iter = trie.iter();

sffc

(still working)

sffc · 2021-11-22T20:45:05Z

utils/char16trie/src/char16trie.rs

+    /// The input unit(s) continued a matching string and there is a value for
+    /// the string so far. No further input byte/unit can continue a matching
+    /// string.
+    FinalValue(i32),


Suggestion: In CodePointTrie, we return an upgraded type (like Script) from the trie, rather than a plain integer. It would be nice if we could do the same thing here. In other words, add a type parameter T onto Char16Trie<T> and all other types, and then use the T here when returning the values.

@hsivonen What do you think of this suggestion?

For collation, the result is a uint32_t which encodes a compact collation element value.

I don't know if the added complexity of genericizing the value type is worth it.

sffc · 2021-11-22T20:45:51Z

utils/char16trie/src/char16trie.rs

+    /// The input unit(s) continued a matching string and there is a value for
+    /// the string so far. No further input byte/unit can continue a matching
+    /// string.
+    FinalValue(i32),


Question: where does the type i32 come from? Why is it signed? Does the data structure support other values? Does it support storing a u16 or u8 to save on space? @markusicu

I used i32 because that is what the ICU4C implementation uses for getValue: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/ucharstrie.h#L258

The data structure is defined as mapping sequences of uint16_t to int32_t, leaving it up to the user to interpret the values. Some don't care about the value at all, only about the TrieResult. In many cases, these will be small-ish integers. For collation, it will be interpreted as a uint32_t. (AFAIR the internal encoding favors non-negative values, essentially storing uint32_t in a variable-length way -- but programmers like to use signed integers by default.)

sffc

Mostly just some suggestions regarding API shape. Otherwise it looks quite good!

sffc · 2021-11-22T21:13:27Z

utils/char16trie/src/char16trie.rs

+/// - [ICU4J CharsTrie](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html) API.
+#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
+#[derive(Clone)]
+pub struct Char16Trie<'data> {


Suggestion: a higher-level API that may be more user-friendly in certain situations would be

impl<'data, T> Char16Trie<'data, T> { pub fn get(s: &str) -> Option<T> }

This would expand to a loop walking over the Char16TrieIterator with TrieResult.

Note that this is something nice that a standalone Char16Trie enables us to do.

Suggestion: a higher-level API that may be more user-friendly in certain situations would be

impl<'data, T> Char16Trie<'data, T> { pub fn get(s: &str) -> Option<T> }

This would expand to a loop walking over the Char16TrieIterator with TrieResult.

Note that this is something nice that a standalone Char16Trie enables us to do.

Is there a follow-up issue to explore this possibility?

Thanks for catching that, filed #1445.

sffc · 2021-11-22T21:18:50Z

utils/char16trie/tests/trie_tests.rs

+    let res = itor.next('h' as i32);
+    assert_eq!(res, TrieResult::NoMatch);
+
+    let mut itor = Char16TrieIterator::new(trie.data.as_slice());


I also hope we can also do this:

Suggested change

let mut itor = Char16TrieIterator::new(trie.data.as_slice());

let mut iter = trie.iter();

Import of the UCharsTrie implementation as of: makotokato/dictionary_segmenter@4d79bbc

jira-pull-request-webhook · 2021-12-03T20:48:13Z

Notice: the branch changed across the force-push!

Cargo.lock is different
utils/char16trie/Cargo.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

utils/char16trie/src/char16trie.rs

Unblocking code that is landing in experimental, as discussed.

dminor · 2021-12-22T19:08:46Z

@sffc I've filed issues for the outstanding comments from you and Markus and moved the code from util to experimental. I'm struggling to find enough time to work on this and I would prefer to land it under experimental to unblock Henri and Makoto from being able to use and improve on this as necessary.

sffc · 2021-12-22T19:38:46Z

@dminor There are still some open comments about API shape. Can you make sure there are follow-ups filed for those?

Also, please fix the build/test failures found by CI.

sffc

(setting review bit)

dminor · 2021-12-23T12:51:28Z

@dminor There are still some open comments about API shape. Can you make sure there are follow-ups filed for those?

Also, please fix the build/test failures found by CI.

I filed #1444 for making the value type generic. I think the main discussion about API shape is covered by #1439 since we're trying to decide if a less rusty api would give us significant performance benefits.

sffc mentioned this pull request Nov 5, 2021

Implementation of icu's bytestrie structure to Rust #1155

Closed

echeran reviewed Nov 5, 2021

View reviewed changes

dminor force-pushed the issue-1202 branch from 1d4cc33 to 7c21cd7 Compare November 16, 2021 14:41

dminor marked this pull request as ready for review November 16, 2021 14:42

dminor requested a review from a team as a code owner November 16, 2021 14:42

dminor requested review from sffc, echeran, zbraniecki and hsivonen November 16, 2021 14:42

markusicu reviewed Nov 18, 2021

View reviewed changes

sffc reviewed Nov 22, 2021

View reviewed changes

sffc requested changes Nov 22, 2021

View reviewed changes

dminor added 11 commits December 3, 2021 15:46

Add ucharstrie to utils

8b07f6f

Import uchars_trie

00a3a5b

Import of the UCharsTrie implementation as of: makotokato/dictionary_segmenter@4d79bbc

Rename uchars_trie to ucharstrie and add to lib.

0f5c533

Add get_value method

3b8bac9

Add months unit test

4004346

Fold trie into ucharstrie

8fd9c0c

Add UCharsTrieIterator

591ed9d

Return value as part of TrieResult

385bc5b

Don't truncate input character to u16

443fb8d

Use ZeroVec

e63001d

Add additional unit tests and fix bugs

e7d5f59

dminor added 5 commits December 3, 2021 15:46

Update generated readme

e28fbca

Fix issues found in CI

fc1c76a

Fixup generated readme

4d8f172

Address review feedback

7dd27c6

Bump ZV version

4bbe117

dminor force-pushed the issue-1202 branch from 7d7ef49 to 4bbe117 Compare December 3, 2021 20:48

sffc previously requested changes Dec 7, 2021

View reviewed changes

utils/char16trie/src/char16trie.rs Outdated Show resolved Hide resolved

utils/char16trie/src/char16trie.rs Outdated Show resolved Hide resolved

dminor added the discuss-priority Discuss at the next ICU4X meeting label Dec 8, 2021

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Dec 9, 2021

echeran removed their request for review December 9, 2021 18:30

dminor added 2 commits December 22, 2021 12:58

Address some review feedback

d89c291

Move char16trie to experimental

54ad444

dminor mentioned this pull request Dec 22, 2021

Investigate char16trie codegen and performance enhancements #1439

Open

dminor removed request for zbraniecki and hsivonen December 22, 2021 18:21

dminor added 2 commits December 22, 2021 13:24

Add todo for get panic

09a8ef9

Merge branch 'main' into issue-1202

5e7da61

dminor requested a review from sffc December 22, 2021 19:06

sffc reviewed Dec 22, 2021

View reviewed changes

Use as_ule_slice

ed9f96d

dminor requested a review from sffc December 23, 2021 12:44

dminor mentioned this pull request Dec 23, 2021

Consider returning an upgraded type from char16trie #1444

Closed

sffc approved these changes Dec 23, 2021

View reviewed changes

dminor mentioned this pull request Dec 23, 2021

Consider adding search interface to Char16Trie #1445

Open

dminor merged commit 1a87cea into unicode-org:main Dec 23, 2021

		This component provides a data structure for an time-efficient lookup of values
		associated to code points.

-This component provides a data structure for an time-efficient lookup of values
-associated to code points.
+This component provides a data structure for a space-efficient and time-efficient lookup of
+sequences of 16-bit units (commonly but not necessarily UTF-16 code units)
+which map to integer values.

	// A Char16Trie containing the ASCII characters 'a' and 'b'.
	// A Char16Trie mapping string "a" to 1 and string "ab" to 100.

		/// Once next() return TrieResult::NoMatch, all further calls to next()
		/// will also return TrieResult::NoMatch.

		/// Index of next trie unit to read, or None if there are no more matches.
		pos: Option<usize>,

		assert_eq!(res, TrieResult::NoMatch);
		assert_eq!(res, TrieResult::NoMatch);

	let mut itor = Char16TrieIterator::new(trie.data.as_slice());
	let mut iter = trie.iter();

Port UCharsTrie #1264

Port UCharsTrie #1264

Conversation

dminor commented Nov 4, 2021

zbraniecki commented Nov 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zbraniecki Nov 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iainireland commented Nov 5, 2021

sffc commented Nov 5, 2021 • edited Loading

jira-pull-request-webhook bot commented Nov 16, 2021

markusicu commented Nov 17, 2021

markusicu commented Nov 18, 2021

markusicu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc Nov 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc Dec 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markusicu Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zbraniecki Nov 5, 2021 •

edited

Loading

sffc commented Nov 5, 2021 •

edited

Loading

sffc Nov 22, 2021 •

edited

Loading

sffc Dec 8, 2021 •

edited

Loading

markusicu Dec 10, 2021 •

edited

Loading