-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port UCharsTrie #1264
Port UCharsTrie #1264
Conversation
This reads well! Great work Dan! I would like to see a bit of an overdocumentation on the decision to use u16 in a Rust project over u8. I think we should strongly document our decision both in README and in lib.rs docstrings to explain our rationale, and make it much easier for someone in the future to reason about revisiting it and switching to u8. |
Cargo.toml
Outdated
@@ -37,6 +37,7 @@ members = [ | |||
"utils/fixed_decimal", | |||
"utils/litemap", | |||
"utils/pattern", | |||
"utils/ucharstrie", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: To be consistent with naming, this component should be named just charstrie
/ CharsTrie
.
The 'U' prefix convention in C/C++ wasn't necessary and considered less readable, ex: we went with the ICU4J name CodePointTrie
over the ICU4C UCPTrie
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it as well. I ended up shyed away from suggesting this and my rationale is that char
in Rust is u8
. UChar
is a unique Unicode type backed by u16
.
Since this crate operates exclusively on u16
then chartrie
may be slightly misleading.
My position here is very weakly held and in particular if we envision switching/extending the crate to support u8
then I'd be strongly in favor of following your advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was afraid that char would imply u8
to a lot of people and this data structure is operating on u16
.
@zbraniecki What do you see as the advantage of using u8
rather than u16
when the underlying data structure we're exporting from ICU4C uses u16
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just alignment with how Rust thinks about chars. And if we were to explore it I'd expect us to convert the underlying data structure to u8.
Beyond that, maybe there is some performance/memory consideration - @hsivonen @sffc @Manishearth @markusicu @iainireland ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have plans to implement BytesTrie
and that will support u8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps Char16Trie
?
@markusicu convinced me last week that it is better to use u16 units rather than u8 units in the trie even if we are dealing with UTF-8 source text. Basically we can pack a lot more data into the u16 units.
Dumb question: what are the differences between CharsTrie/BytesTrie/CodepointTrie? Is there a way to unify some/all of them instead of having multiple trie types? |
CharsTrie = map from strings to integers, with 16-bit units CharsTrie vs BytesTrie is largely an implementation detail. I have asked that we run benchmarks to figure out if we need both or if we should just do one or the other. @markusicu suggested starting with CharsTrie because the implementation is simpler. |
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Right. For details and design docs see #1202 (comment) UCharsTrie like in ICU4C might work. In my post on issue 1202 I speculated about U16Trie as a plausible name in Rust. Basically, this type of "string trie" matches sequences of 16-bit units. Used to be UChar in ICU4C before we changed to char16_t. char in Java. u16 in Rust.
With a larger base unit, each trie node can encode more stuff in a single unit, which makes processing simpler and possibly a little faster. And Henri needs the 16-bit-units trie for the collation data. In terms of data size, a BytesTrie might be more compact for a small set of strings because the deltas to be encoded are small. A 16-bit-units trie might be smaller for large sets of strings that need large deltas. |
I see that it's "char16trie" now. That's probably ok. For ICU I chose plural CharsTrie / UCharsTrie because it matches sequences of characters. But with the "16" in the name I am not sure where I would stick the plural "s". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. Sorry for running so late!
utils/char16trie/README.md
Outdated
This component provides a data structure for an time-efficient lookup of values | ||
associated to code points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This component provides a data structure for an time-efficient lookup of values | |
associated to code points. | |
This component provides a data structure for a space-efficient and time-efficient lookup of | |
sequences of 16-bit units (commonly but not necessarily UTF-16 code units) | |
which map to integer values. |
utils/char16trie/README.md
Outdated
// A Char16Trie containing the ASCII characters 'a' and 'b'. | ||
let trie_data = vec![48, 97, 176, 98, 32868]; | ||
let trie = Char16Trie { | ||
data: ZeroVec::from_slice(trie_data.as_slice()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI Henri asked for a more direct way to create this object from an array slice, or really also a more direct way to create what's currently called the Char16TrieIterator from an array slice.
If it's not necessary to make a named type for what's really just a slice of an array or of a ZeroVec, then it would be nice to cut out this layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sffc Thoughts on this? Like Markus says, there's no real use for a separate struct from the iterator right now, but maybe you had something in mind for the future?
Data coming from a provider will most likely already be a ZeroVec, and there's no need to create a Char16Trie first, it's possible to create the Char16TrieIterator directly from a slice of ZeroVec data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in other areas, I tend to prefer the extra wrapper layer; it is a zero-cost abstraction that makes code a little bit more readable and maintainable. For example, I prefer this:
let trie: Char16Trie = // get trie from data provider
let mut iter = trie.iter();
over this:
let zv: ZeroVec<u16> = // get zerovec from data provider
let mut iter: Char16TrieIterator = zv.as_slice().into()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the data provider itself, I also prefer the readability of the extra type:
#[icu_provider::data_struct]
pub struct MyDataStruct<'data> {
/// A map from strings to values
trie: Char16Trie<'data>,
}
Otherwise, you have this:
#[icu_provider::data_struct]
pub struct MyDataStruct<'data> {
/// A vector of numbers that represent a Char16Trie, a map from strings to values
trie_data: ZeroVec<'data, u16>,
}
That being said, the extra type may not be a zero-cost abstraction in serde; but if this is a problem, we can add #[serde(flatten)]
to make it zero-cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression was that for the current use cases (collator, segmenter) the trie data is interleaved with other data at unpredictable offsets, so I think we'll end up constructing the iterator using a slice, rather than using the Char16Trie
structure directly. But I don't see any harm in keeping Char16Trie
, it might be more useful in the future, and I might be wrong about the current use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there can be both a named type for just the array slice and also an iterator constructor that takes an array slice, so that the intermediate type need not be explicitly constructed (just to unwrap it again).
Yes, in collation data and other structures, there are large arrays where certain slices carry string-trie data. For each one you just know where it starts, you don't even know (nor need to know) the length of the slice -- so the slices typically go from some start to the end of the array.
utils/char16trie/README.md
Outdated
data: ZeroVec::from_slice(trie_data.as_slice()), | ||
}; | ||
|
||
let mut itor = Char16TrieIterator::new(trie.data.as_slice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/itor/iter/ for more common naming?
utils/char16trie/README.md
Outdated
use icu_char16trie::char16trie::{Char16Trie, Char16TrieIterator, TrieResult}; | ||
use zerovec::ZeroVec; | ||
|
||
// A Char16Trie containing the ASCII characters 'a' and 'b'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// A Char16Trie containing the ASCII characters 'a' and 'b'. | |
// A Char16Trie mapping string "a" to 1 and string "ab" to 100. |
utils/char16trie/src/char16trie.rs
Outdated
} | ||
} | ||
|
||
/// This struct represents a de-serialized Char16Trie that was exported from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Way too much information for public API docs. For public docs, all that a user should care about is that the data is an array of 16-bit units, and it encodes a structure for mapping strings/sequences to values. That is, the documentation that is on the class top level here: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html
What you have here is from the private/internal details that explain the nitty-gritty details to someone trying to understand the code and the data structure. It should be here somewhere, but it should remain private/internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to have these docs but perhaps you can attach them to an internal module or somewhere like that.
utils/char16trie/src/char16trie.rs
Outdated
/// The input unit(s) continued a matching string and there is a value for | ||
/// the string so far. Another input byte/unit can continue a matching | ||
/// string. | ||
Intermediate(i32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intermediate(i32), | |
IntermediateValue(i32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Result type should have additional functions:
- matches()
- hasValue()
- hasNext()
See
- https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/BytesTrie.Result.html
- https://github.com/unicode-org/icu/blob/main/icu4j/main/classes/core/src/com/ibm/icu/util/BytesTrie.java#L215-L235
- https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/ustringtrie.h#L69-L95
Note how the order (and thus numeric values) of the enum values is chosen in both C++ and Java so that these important predicate functions are trivial to implement. (Each with a single conditional.)
utils/char16trie/src/char16trie.rs
Outdated
/// Once next() return TrieResult::NoMatch, all further calls to next() | ||
/// will also return TrieResult::NoMatch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Once next() return TrieResult::NoMatch, all further calls to next() | |
/// will also return TrieResult::NoMatch. | |
/// Once next() returns TrieResult::NoMatch, all further calls to next() | |
/// will also return TrieResult::NoMatch. |
utils/char16trie/src/char16trie.rs
Outdated
/// The input unit(s) continued a matching string and there is a value for | ||
/// the string so far. No further input byte/unit can continue a matching | ||
/// string. | ||
FinalValue(i32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C++ & Java, next() returns a Result that only expresses the state, not also the value. I chose that for performance. Some users will not care about the value of an intermediate result. Some users will set all values to 0 and only look at the state for a boolean "did match". These users can skip decoding the result, saving some cycles. That's why there is a separate getValue() function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Manishearth Do you think the compiler would optimize away retrieving the value if the user does not care about it? E.g. if the user is matching against FinalValue(_)
would we avoid calling get_value()
since we never use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on whether it gets inlined. Hard to say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is bigger for IntermediateValue. Whenever you have more input (a common case), you want to just skip over the array units that encode the value, rather than read and decode them. If the API always returns the value, then it will always read & decode. Maybe the compiler could throw out the decoding, but I doubt it would be equivalent to skipping.
utils/char16trie/src/char16trie.rs
Outdated
/// data: ZeroVec::from_slice(trie_data.as_slice()), | ||
/// }; | ||
/// | ||
/// let mut itor = Char16TrieIterator::new(trie.data.as_slice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, it should be possible to make this object directly from the array. To the user, there is no value in the intermediate type/object currently called Char16Trie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the primary way to construct this object should be a method on the Char16Trie. That said, it's harmless to also have an alternate constructor from the slice like @dminor has here.
utils/char16trie/src/char16trie.rs
Outdated
} | ||
|
||
fn skip_value(&self, pos: usize) -> usize { | ||
let lead_byte = self.get(pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change _byte suffixes to _unit: These are u16's. See https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/ucharstrie.h#L432
utils/char16trie/src/char16trie.rs
Outdated
/// Remaining length of a linear-match node, minus 1, or None if not in | ||
/// such a node. | ||
remaining_match_length: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a simple signed integer like in ICU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Option<>
is more idiomatic Rust code and also makes it impossible to have bugs where we forget to check that the value is -1 before using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is very little checking for non-negative before using the length, and what there is takes advantage of loading one value instead of two (boolean+length) for performance. And the transition to end-of-linear-match-node is trivial: Decrement and store the length, and if we are done that's automatically -1.
utils/char16trie/src/char16trie.rs
Outdated
/// Index of next trie unit to read, or None if there are no more matches. | ||
pos: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI -- ICU4J is also using a pos index rather than C++'s pos pointer, and uses a negative pos value for "none". Simpler than Option<usize>
in Rust, too? (Loads only one int, rather than bool-then-int as in the Option.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above, I think Option<> is more idiomatic Rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably ok for pos, more than for remaining_match_length.
utils/char16trie/src/char16trie.rs
Outdated
} | ||
} | ||
|
||
pub fn get_value(&self, pos: usize) -> i32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you always call this whenever you return a TrieResult that has a value, then the pos is always after the value, and then this public function, when called explicitly, will never return a real value, right? Making it useless to have this as public API?
My preference is for this code to work like in C++ & Java, and have the user call get_value() only when TrieResult.has_value() and the user cares to see the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should not be public if we're using the Result the way it is defined above. Let's iterate on that conversation before I change anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer the TrieResult-based API.
utils/char16trie/src/char16trie.rs
Outdated
Intermediate(i32), | ||
} | ||
|
||
impl<'a> Char16TrieIterator<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is very little of the public API here, compared to the ICU version. I understand that the ICU CharsTrie.Iterator (which iterates over the (string, value) pairs reachable from the current state) is out of scope here, but what about the rest?
https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html
At a minimum, I would expect to see nextForCodePoint(int cp):
https://github.com/unicode-org/icu/blob/main/icu4j/main/classes/core/src/com/ibm/icu/util/CharsTrie.java#L250-L263
I think Henri will need that for collation.
And since you intentionally omit API that is well-defined and exists in ICU, I think that somewhere in this file there should be a note pointing to ICU for the complete set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code started with what @makotokato implemented to support segmentation, and added what @hsivonen asked for to support the collator. I wasn't trying for a complete port, just trying to address our immediate needs.
Since we're not at 1.0 yet, we can expand the API later as needed. But maybe we should just do it now. I'd be interested in other's thoughts on this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important method is nextForCodePoint()
. In fact, the method labeled next()
is dangerous, because it looks like you can pass a code point in there, but as soon as you hit a supplementary code point, you will get unexpected results.
In Rust, I suggest the following methods:
next(char cp)
--> primary, "safe" methodnext_u16(u16 code_unit)
--> for lower-level use; the range of the int is enforced by the u16 typenext_u32(u32 cp)
--> for when you have a u32 and don't want to convert it to a char
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data structure is defined as mapping from a sequence of 16-bit units to an int32_t. It should support that, even when the 16-bit units have nothing to do with UTF-16 strings, and so next(unit) should be the primary entry point.
"nextCodePoint()" is named suggestively, but there is no enforcement that the inputs must be Unicode strings; it's for a gentleman's agreement between data builder and user to store and provide surrogate pairs.
I found it very handy for next() to take an int32_t with values outside 0..ffff simply not matching: Neither the caller nor the implementation needs to do a range check, error values simply yield NO_MATCH. Easy to use, and efficient.
Remember that this is a low-level data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea:
- TrieResult is just a four-value enum like in C++ & Java; no built-in value
- no public get_value()
- next(unit) returns TrieResult, no way to get the value
- next_with_value(unit) returns a (TrieResult, value) pair
- next_code_point(cp) calls next(lead surrogate) + next(trail surrogate)
- next_code_point_with_value(cp) calls next(lead surrogate) + next_with_value(trail surrogate)
- the not-with-value functions could eagerly skip the value units, if the node has any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems like a reasonable suggestion to me. I actually don't have that strong of an opinion on the API surface, but I would like for us to come to a decision on this before I write any more code :)
@sffc what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about when TrieResult and the value are or are not returned. I just want to avoid the footgun of the default next()
method taking a i32
of which only 16 bits are ever used.
Since it seems like we can't agree on that, can we agree on suffixing all of the next
methods?
next_u16(u16)
next_code_point(u32)
next_char(char)
I may be okay with next(u16)
. I am mainly hesitant to next(i32)
because that's just an invitation for someone to misuse the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion 2021-12-09:
- Rust may perform optimizations over what C++ had done
- C++ APIs are less ergonomic / have footguns
- Conclusion: Start with Rusty APIs, and file a follow-up issue to investigate codegen and experiment with Markus's suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with suffixes for the input type. How about the following then:
- TrieResult is just a four-value enum like in C++ & Java; no built-in value
- no public get_value()
- next_for_u16(u16 unit) returns TrieResult, no way to get the value
- next_for_char(char c) calls next(lead surrogate) + next(trail surrogate) if supplementary, and returns TrieResult, no way to get the value
- next_with_value_for_u16(unit) returns a (TrieResult, value) pair
- next_with_value_for_char(char c) ...
- the not-with-value functions could eagerly skip the value units, if the node has any
From much work in C/C++/Java, where you frequently end up with "int" anyway, I would also want ..._for_i32(i32_unit) versions of these functions; same as the ..._for_u16(u16 unit) versions except if the unit is outside 0..ffff then no match. Note that in the trie implementation this happens automatically -- no range checks needed on the unit. I don't know whether the same kind of coding leads Rust devs to end up with i32 or u32 types, so we could omit these versions for now and see if call sites end up with inconvenient and unnecessary range checks.
PS: Taking an i32 or u32 does not mean taking a code point. It means being user-friendly in accepting a wider type for the 16-bit input unit, and if the value in the wider type is outside of the u16 range, then the trie code automagically yields no match, without requiring anyone to do an explicit range check. Accepting a code point or scalar value (char
) means checking for BMP vs. supplementary, and taking the latter apart into a surrogate pair and calling next() twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My plan is to land this in experimental for now, since it has been difficult for me to find the time to work on this and this will unblock other people from using this code. I've filed #1439 to track the performance related points that Markus brought up during the review.
utils/char16trie/tests/trie_tests.rs
Outdated
assert_eq!(res, TrieResult::NoMatch); | ||
assert_eq!(res, TrieResult::NoMatch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate line
utils/char16trie/tests/trie_tests.rs
Outdated
let res = itor.next('h' as i32); | ||
assert_eq!(res, TrieResult::NoMatch); | ||
|
||
let mut itor = Char16TrieIterator::new(trie.data.as_slice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ported more of the API (e.g., first(unit) and/or reset()) then you could reuse the same object and wouldn't have to reload the test data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't reloading the data, it's just setting a pointer. My expectation for a Rust iterator is that once it is consumed, it won't be reused, which is why I didn't implement reset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also hope we can also do this:
let mut itor = Char16TrieIterator::new(trie.data.as_slice()); | |
let mut iter = trie.iter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(still working)
utils/char16trie/src/char16trie.rs
Outdated
/// The input unit(s) continued a matching string and there is a value for | ||
/// the string so far. No further input byte/unit can continue a matching | ||
/// string. | ||
FinalValue(i32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: In CodePointTrie, we return an upgraded type (like Script
) from the trie, rather than a plain integer. It would be nice if we could do the same thing here. In other words, add a type parameter T
onto Char16Trie<T>
and all other types, and then use the T
here when returning the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hsivonen What do you think of this suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For collation, the result is a uint32_t which encodes a compact collation element value.
I don't know if the added complexity of genericizing the value type is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils/char16trie/src/char16trie.rs
Outdated
/// The input unit(s) continued a matching string and there is a value for | ||
/// the string so far. No further input byte/unit can continue a matching | ||
/// string. | ||
FinalValue(i32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: where does the type i32
come from? Why is it signed? Does the data structure support other values? Does it support storing a u16 or u8 to save on space? @markusicu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used i32
because that is what the ICU4C implementation uses for getValue
: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/ucharstrie.h#L258
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data structure is defined as mapping sequences of uint16_t to int32_t, leaving it up to the user to interpret the values. Some don't care about the value at all, only about the TrieResult. In many cases, these will be small-ish integers. For collation, it will be interpreted as a uint32_t. (AFAIR the internal encoding favors non-negative values, essentially storing uint32_t in a variable-length way -- but programmers like to use signed integers by default.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just some suggestions regarding API shape. Otherwise it looks quite good!
utils/char16trie/src/char16trie.rs
Outdated
/// - [ICU4J CharsTrie](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html) API. | ||
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] | ||
#[derive(Clone)] | ||
pub struct Char16Trie<'data> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: a higher-level API that may be more user-friendly in certain situations would be
impl<'data, T> Char16Trie<'data, T> {
pub fn get(s: &str) -> Option<T>
}
This would expand to a loop walking over the Char16TrieIterator with TrieResult.
Note that this is something nice that a standalone Char16Trie
enables us to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: a higher-level API that may be more user-friendly in certain situations would be
impl<'data, T> Char16Trie<'data, T> { pub fn get(s: &str) -> Option<T> }This would expand to a loop walking over the Char16TrieIterator with TrieResult.
Note that this is something nice that a standalone
Char16Trie
enables us to do.
Is there a follow-up issue to explore this possibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that, filed #1445.
utils/char16trie/tests/trie_tests.rs
Outdated
let res = itor.next('h' as i32); | ||
assert_eq!(res, TrieResult::NoMatch); | ||
|
||
let mut itor = Char16TrieIterator::new(trie.data.as_slice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also hope we can also do this:
let mut itor = Char16TrieIterator::new(trie.data.as_slice()); | |
let mut iter = trie.iter(); |
Import of the UCharsTrie implementation as of: makotokato/dictionary_segmenter@4d79bbc
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Unblocking code that is landing in experimental, as discussed.
@sffc I've filed issues for the outstanding comments from you and Markus and moved the code from util to experimental. I'm struggling to find enough time to work on this and I would prefer to land it under experimental to unblock Henri and Makoto from being able to use and improve on this as necessary. |
@dminor There are still some open comments about API shape. Can you make sure there are follow-ups filed for those? Also, please fix the build/test failures found by CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(setting review bit)
I filed #1444 for making the value type generic. I think the main discussion about API shape is covered by #1439 since we're trying to decide if a less rusty api would give us significant performance benefits. |
Initial work in progress. This implements the iterator interface which is hopefully sufficient to unblock collator work.
Fixes #1202