Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port UCharsTrie #1264

Merged
merged 23 commits into from
Dec 23, 2021
Merged

Port UCharsTrie #1264

merged 23 commits into from
Dec 23, 2021

Conversation

dminor
Copy link
Contributor

@dminor dminor commented Nov 4, 2021

Initial work in progress. This implements the iterator interface which is hopefully sufficient to unblock collator work.

Fixes #1202

@zbraniecki
Copy link
Member

This reads well! Great work Dan!

I would like to see a bit of an overdocumentation on the decision to use u16 in a Rust project over u8. I think we should strongly document our decision both in README and in lib.rs docstrings to explain our rationale, and make it much easier for someone in the future to reason about revisiting it and switching to u8.

Cargo.toml Outdated
@@ -37,6 +37,7 @@ members = [
"utils/fixed_decimal",
"utils/litemap",
"utils/pattern",
"utils/ucharstrie",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: To be consistent with naming, this component should be named just charstrie / CharsTrie.

The 'U' prefix convention in C/C++ wasn't necessary and considered less readable, ex: we went with the ICU4J name CodePointTrie over the ICU4C UCPTrie.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it as well. I ended up shyed away from suggesting this and my rationale is that char in Rust is u8. UChar is a unique Unicode type backed by u16.
Since this crate operates exclusively on u16 then chartrie may be slightly misleading.

My position here is very weakly held and in particular if we envision switching/extending the crate to support u8 then I'd be strongly in favor of following your advice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was afraid that char would imply u8 to a lot of people and this data structure is operating on u16.

@zbraniecki What do you see as the advantage of using u8 rather than u16 when the underlying data structure we're exporting from ICU4C uses u16?

Copy link
Member

@zbraniecki zbraniecki Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just alignment with how Rust thinks about chars. And if we were to explore it I'd expect us to convert the underlying data structure to u8.

Beyond that, maybe there is some performance/memory consideration - @hsivonen @sffc @Manishearth @markusicu @iainireland ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have plans to implement BytesTrie and that will support u8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps Char16Trie ?

@markusicu convinced me last week that it is better to use u16 units rather than u8 units in the trie even if we are dealing with UTF-8 source text. Basically we can pack a lot more data into the u16 units.

@iainireland
Copy link
Contributor

Dumb question: what are the differences between CharsTrie/BytesTrie/CodepointTrie? Is there a way to unify some/all of them instead of having multiple trie types?

@sffc
Copy link
Member

sffc commented Nov 5, 2021

Dumb question: what are the differences between CharsTrie/BytesTrie/CodepointTrie? Is there a way to unify some/all of them instead of having multiple trie types?

CharsTrie = map from strings to integers, with 16-bit units
BytesTrie = map from strings to integers, with 8-bit units
CodePointTrie = map from single code points to integers

CharsTrie vs BytesTrie is largely an implementation detail. I have asked that we run benchmarks to figure out if we need both or if we should just do one or the other. @markusicu suggested starting with CharsTrie because the implementation is simpler.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • utils/char16trie/Cargo.toml is now changed in the branch
  • utils/char16trie/LICENSE is now changed in the branch
  • utils/char16trie/README.md is now changed in the branch
  • utils/char16trie/src/char16trie.rs is now changed in the branch
  • utils/char16trie/src/lib.rs is now changed in the branch
  • utils/char16trie/tests/test_util.rs is now changed in the branch
  • utils/char16trie/tests/testdata/empty.toml is now changed in the branch
  • utils/char16trie/tests/testdata/months.toml is now changed in the branch
  • utils/char16trie/tests/testdata/test_a_ab.toml is now changed in the branch
  • utils/char16trie/tests/testdata/test_a.toml is now changed in the branch
  • utils/char16trie/tests/testdata/test_branches.toml is now changed in the branch
  • utils/char16trie/tests/testdata/test_compact.toml is now changed in the branch
  • utils/char16trie/tests/testdata/test_long_branch.toml is now changed in the branch
  • utils/char16trie/tests/testdata/test_long_sequence.toml is now changed in the branch
  • utils/char16trie/tests/testdata/test_shortest_branch.toml is now changed in the branch
  • utils/char16trie/tests/trie_tests.rs is now changed in the branch
  • utils/ucharstrie/Cargo.toml is no longer changed in the branch
  • utils/ucharstrie/LICENSE is no longer changed in the branch
  • utils/ucharstrie/README.md is no longer changed in the branch
  • utils/ucharstrie/src/lib.rs is no longer changed in the branch
  • utils/ucharstrie/src/ucharstrie.rs is no longer changed in the branch
  • utils/ucharstrie/tests/test_util.rs is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/empty.toml is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/months.toml is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/test_a_ab.toml is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/test_a.toml is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/test_branches.toml is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/test_compact.toml is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/test_long_branch.toml is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/test_long_sequence.toml is no longer changed in the branch
  • utils/ucharstrie/tests/testdata/test_shortest_branch.toml is no longer changed in the branch
  • utils/ucharstrie/tests/trie_tests.rs is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@dminor dminor marked this pull request as ready for review November 16, 2021 14:42
@dminor dminor requested a review from a team as a code owner November 16, 2021 14:42
@markusicu
Copy link
Member

Dumb question: what are the differences between CharsTrie/BytesTrie/CodepointTrie? Is there a way to unify some/all of them instead of having multiple trie types?

CharsTrie = map from strings to integers, with 16-bit units BytesTrie = map from strings to integers, with 8-bit units CodePointTrie = map from single code points to integers

Right. For details and design docs see #1202 (comment)

UCharsTrie like in ICU4C might work. In my post on issue 1202 I speculated about U16Trie as a plausible name in Rust.

Basically, this type of "string trie" matches sequences of 16-bit units. Used to be UChar in ICU4C before we changed to char16_t. char in Java. u16 in Rust.

CharsTrie vs BytesTrie is largely an implementation detail. I have asked that we run benchmarks to figure out if we need both or if we should just do one or the other. @markusicu suggested starting with CharsTrie because the implementation is simpler.

With a larger base unit, each trie node can encode more stuff in a single unit, which makes processing simpler and possibly a little faster. And Henri needs the 16-bit-units trie for the collation data.

In terms of data size, a BytesTrie might be more compact for a small set of strings because the deltas to be encoded are small. A 16-bit-units trie might be smaller for large sets of strings that need large deltas.

@markusicu
Copy link
Member

UCharsTrie like in ICU4C might work. In my post on issue 1202 I speculated about U16Trie as a plausible name in Rust.

Basically, this type of "string trie" matches sequences of 16-bit units. Used to be UChar in ICU4C before we changed to char16_t. char in Java. u16 in Rust.

I see that it's "char16trie" now. That's probably ok.

For ICU I chose plural CharsTrie / UCharsTrie because it matches sequences of characters. But with the "16" in the name I am not sure where I would stick the plural "s".

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. Sorry for running so late!

Comment on lines 7 to 8
This component provides a data structure for an time-efficient lookup of values
associated to code points.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This component provides a data structure for an time-efficient lookup of values
associated to code points.
This component provides a data structure for a space-efficient and time-efficient lookup of
sequences of 16-bit units (commonly but not necessarily UTF-16 code units)
which map to integer values.

// A Char16Trie containing the ASCII characters 'a' and 'b'.
let trie_data = vec![48, 97, 176, 98, 32868];
let trie = Char16Trie {
data: ZeroVec::from_slice(trie_data.as_slice()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Henri asked for a more direct way to create this object from an array slice, or really also a more direct way to create what's currently called the Char16TrieIterator from an array slice.

If it's not necessary to make a named type for what's really just a slice of an array or of a ZeroVec, then it would be nice to cut out this layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sffc Thoughts on this? Like Markus says, there's no real use for a separate struct from the iterator right now, but maybe you had something in mind for the future?

Data coming from a provider will most likely already be a ZeroVec, and there's no need to create a Char16Trie first, it's possible to create the Char16TrieIterator directly from a slice of ZeroVec data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in other areas, I tend to prefer the extra wrapper layer; it is a zero-cost abstraction that makes code a little bit more readable and maintainable. For example, I prefer this:

let trie: Char16Trie = // get trie from data provider
let mut iter = trie.iter();

over this:

let zv: ZeroVec<u16> = // get zerovec from data provider
let mut iter: Char16TrieIterator = zv.as_slice().into()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the data provider itself, I also prefer the readability of the extra type:

#[icu_provider::data_struct]
pub struct MyDataStruct<'data> {
    /// A map from strings to values
    trie: Char16Trie<'data>,
}

Otherwise, you have this:

#[icu_provider::data_struct]
pub struct MyDataStruct<'data> {
    /// A vector of numbers that represent a Char16Trie, a map from strings to values
    trie_data: ZeroVec<'data, u16>,
}

That being said, the extra type may not be a zero-cost abstraction in serde; but if this is a problem, we can add #[serde(flatten)] to make it zero-cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression was that for the current use cases (collator, segmenter) the trie data is interleaved with other data at unpredictable offsets, so I think we'll end up constructing the iterator using a slice, rather than using the Char16Trie structure directly. But I don't see any harm in keeping Char16Trie, it might be more useful in the future, and I might be wrong about the current use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there can be both a named type for just the array slice and also an iterator constructor that takes an array slice, so that the intermediate type need not be explicitly constructed (just to unwrap it again).

Yes, in collation data and other structures, there are large arrays where certain slices carry string-trie data. For each one you just know where it starts, you don't even know (nor need to know) the length of the slice -- so the slices typically go from some start to the end of the array.

data: ZeroVec::from_slice(trie_data.as_slice()),
};

let mut itor = Char16TrieIterator::new(trie.data.as_slice());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/itor/iter/ for more common naming?

use icu_char16trie::char16trie::{Char16Trie, Char16TrieIterator, TrieResult};
use zerovec::ZeroVec;

// A Char16Trie containing the ASCII characters 'a' and 'b'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A Char16Trie containing the ASCII characters 'a' and 'b'.
// A Char16Trie mapping string "a" to 1 and string "ab" to 100.

}
}

/// This struct represents a de-serialized Char16Trie that was exported from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way too much information for public API docs. For public docs, all that a user should care about is that the data is an array of 16-bit units, and it encodes a structure for mapping strings/sequences to values. That is, the documentation that is on the class top level here: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html

What you have here is from the private/internal details that explain the nitty-gritty details to someone trying to understand the code and the data structure. It should be here somewhere, but it should remain private/internal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to have these docs but perhaps you can attach them to an internal module or somewhere like that.

/// The input unit(s) continued a matching string and there is a value for
/// the string so far. Another input byte/unit can continue a matching
/// string.
Intermediate(i32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Intermediate(i32),
IntermediateValue(i32),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Result type should have additional functions:

  • matches()
  • hasValue()
  • hasNext()

See

Note how the order (and thus numeric values) of the enum values is chosen in both C++ and Java so that these important predicate functions are trivial to implement. (Each with a single conditional.)

Comment on lines 131 to 114
/// Once next() return TrieResult::NoMatch, all further calls to next()
/// will also return TrieResult::NoMatch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Once next() return TrieResult::NoMatch, all further calls to next()
/// will also return TrieResult::NoMatch.
/// Once next() returns TrieResult::NoMatch, all further calls to next()
/// will also return TrieResult::NoMatch.

/// The input unit(s) continued a matching string and there is a value for
/// the string so far. No further input byte/unit can continue a matching
/// string.
FinalValue(i32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++ & Java, next() returns a Result that only expresses the state, not also the value. I chose that for performance. Some users will not care about the value of an intermediate result. Some users will set all values to 0 and only look at the state for a boolean "did match". These users can skip decoding the result, saving some cycles. That's why there is a separate getValue() function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Manishearth Do you think the compiler would optimize away retrieving the value if the user does not care about it? E.g. if the user is matching against FinalValue(_) would we avoid calling get_value() since we never use it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on whether it gets inlined. Hard to say.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is bigger for IntermediateValue. Whenever you have more input (a common case), you want to just skip over the array units that encode the value, rather than read and decode them. If the API always returns the value, then it will always read & decode. Maybe the compiler could throw out the decoding, but I doubt it would be equivalent to skipping.

/// data: ZeroVec::from_slice(trie_data.as_slice()),
/// };
///
/// let mut itor = Char16TrieIterator::new(trie.data.as_slice());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, it should be possible to make this object directly from the array. To the user, there is no value in the intermediate type/object currently called Char16Trie.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the primary way to construct this object should be a method on the Char16Trie. That said, it's harmless to also have an alternate constructor from the slice like @dminor has here.

}

fn skip_value(&self, pos: usize) -> usize {
let lead_byte = self.get(pos);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 123 to 107
/// Remaining length of a linear-match node, minus 1, or None if not in
/// such a node.
remaining_match_length: Option<usize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Option<> is more idiomatic Rust code and also makes it impossible to have bugs where we forget to check that the value is -1 before using it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is very little checking for non-negative before using the length, and what there is takes advantage of loading one value instead of two (boolean+length) for performance. And the transition to end-of-linear-match-node is trivial: Decrement and store the length, and if we are done that's automatically -1.

Comment on lines 121 to 104
/// Index of next trie unit to read, or None if there are no more matches.
pos: Option<usize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI -- ICU4J is also using a pos index rather than C++'s pos pointer, and uses a negative pos value for "none". Simpler than Option<usize> in Rust, too? (Loads only one int, rather than bool-then-int as in the Option.)

https://github.com/unicode-org/icu/blob/main/icu4j/main/classes/core/src/com/ibm/icu/util/CharsTrie.java#L697

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, I think Option<> is more idiomatic Rust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably ok for pos, more than for remaining_match_length.

}
}

pub fn get_value(&self, pos: usize) -> i32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you always call this whenever you return a TrieResult that has a value, then the pos is always after the value, and then this public function, when called explicitly, will never return a real value, right? Making it useless to have this as public API?

My preference is for this code to work like in C++ & Java, and have the user call get_value() only when TrieResult.has_value() and the user cares to see the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should not be public if we're using the Result the way it is defined above. Let's iterate on that conversation before I change anything here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer the TrieResult-based API.

Intermediate(i32),
}

impl<'a> Char16TrieIterator<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is very little of the public API here, compared to the ICU version. I understand that the ICU CharsTrie.Iterator (which iterates over the (string, value) pairs reachable from the current state) is out of scope here, but what about the rest?

https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html

At a minimum, I would expect to see nextForCodePoint(int cp):
https://github.com/unicode-org/icu/blob/main/icu4j/main/classes/core/src/com/ibm/icu/util/CharsTrie.java#L250-L263

I think Henri will need that for collation.

And since you intentionally omit API that is well-defined and exists in ICU, I think that somewhere in this file there should be a note pointing to ICU for the complete set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code started with what @makotokato implemented to support segmentation, and added what @hsivonen asked for to support the collator. I wasn't trying for a complete port, just trying to address our immediate needs.

Since we're not at 1.0 yet, we can expand the API later as needed. But maybe we should just do it now. I'd be interested in other's thoughts on this as well.

Copy link
Member

@sffc sffc Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important method is nextForCodePoint(). In fact, the method labeled next() is dangerous, because it looks like you can pass a code point in there, but as soon as you hit a supplementary code point, you will get unexpected results.

In Rust, I suggest the following methods:

  1. next(char cp) --> primary, "safe" method
  2. next_u16(u16 code_unit) --> for lower-level use; the range of the int is enforced by the u16 type
  3. next_u32(u32 cp) --> for when you have a u32 and don't want to convert it to a char

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data structure is defined as mapping from a sequence of 16-bit units to an int32_t. It should support that, even when the 16-bit units have nothing to do with UTF-16 strings, and so next(unit) should be the primary entry point.

"nextCodePoint()" is named suggestively, but there is no enforcement that the inputs must be Unicode strings; it's for a gentleman's agreement between data builder and user to store and provide surrogate pairs.

I found it very handy for next() to take an int32_t with values outside 0..ffff simply not matching: Neither the caller nor the implementation needs to do a range check, error values simply yield NO_MATCH. Easy to use, and efficient.

Remember that this is a low-level data structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea:

  • TrieResult is just a four-value enum like in C++ & Java; no built-in value
  • no public get_value()
  • next(unit) returns TrieResult, no way to get the value
  • next_with_value(unit) returns a (TrieResult, value) pair
  • next_code_point(cp) calls next(lead surrogate) + next(trail surrogate)
  • next_code_point_with_value(cp) calls next(lead surrogate) + next_with_value(trail surrogate)
  • the not-with-value functions could eagerly skip the value units, if the node has any

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems like a reasonable suggestion to me. I actually don't have that strong of an opinion on the API surface, but I would like for us to come to a decision on this before I write any more code :)

@sffc what do you think?

Copy link
Member

@sffc sffc Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about when TrieResult and the value are or are not returned. I just want to avoid the footgun of the default next() method taking a i32 of which only 16 bits are ever used.

Since it seems like we can't agree on that, can we agree on suffixing all of the next methods?

  • next_u16(u16)
  • next_code_point(u32)
  • next_char(char)

I may be okay with next(u16). I am mainly hesitant to next(i32) because that's just an invitation for someone to misuse the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion 2021-12-09:

  • Rust may perform optimizations over what C++ had done
  • C++ APIs are less ergonomic / have footguns
  • Conclusion: Start with Rusty APIs, and file a follow-up issue to investigate codegen and experiment with Markus's suggestions

Copy link
Member

@markusicu markusicu Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with suffixes for the input type. How about the following then:

  • TrieResult is just a four-value enum like in C++ & Java; no built-in value
  • no public get_value()
  • next_for_u16(u16 unit) returns TrieResult, no way to get the value
  • next_for_char(char c) calls next(lead surrogate) + next(trail surrogate) if supplementary, and returns TrieResult, no way to get the value
  • next_with_value_for_u16(unit) returns a (TrieResult, value) pair
  • next_with_value_for_char(char c) ...
  • the not-with-value functions could eagerly skip the value units, if the node has any

From much work in C/C++/Java, where you frequently end up with "int" anyway, I would also want ..._for_i32(i32_unit) versions of these functions; same as the ..._for_u16(u16 unit) versions except if the unit is outside 0..ffff then no match. Note that in the trie implementation this happens automatically -- no range checks needed on the unit. I don't know whether the same kind of coding leads Rust devs to end up with i32 or u32 types, so we could omit these versions for now and see if call sites end up with inconvenient and unnecessary range checks.

PS: Taking an i32 or u32 does not mean taking a code point. It means being user-friendly in accepting a wider type for the 16-bit input unit, and if the value in the wider type is outside of the u16 range, then the trie code automagically yields no match, without requiring anyone to do an explicit range check. Accepting a code point or scalar value (char) means checking for BMP vs. supplementary, and taking the latter apart into a surrogate pair and calling next() twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan is to land this in experimental for now, since it has been difficult for me to find the time to work on this and this will unblock other people from using this code. I've filed #1439 to track the performance related points that Markus brought up during the review.

Comment on lines 18 to 14
assert_eq!(res, TrieResult::NoMatch);
assert_eq!(res, TrieResult::NoMatch);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate line

let res = itor.next('h' as i32);
assert_eq!(res, TrieResult::NoMatch);

let mut itor = Char16TrieIterator::new(trie.data.as_slice());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ported more of the API (e.g., first(unit) and/or reset()) then you could reuse the same object and wouldn't have to reload the test data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't reloading the data, it's just setting a pointer. My expectation for a Rust iterator is that once it is consumed, it won't be reused, which is why I didn't implement reset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also hope we can also do this:

Suggested change
let mut itor = Char16TrieIterator::new(trie.data.as_slice());
let mut iter = trie.iter();

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(still working)

/// The input unit(s) continued a matching string and there is a value for
/// the string so far. No further input byte/unit can continue a matching
/// string.
FinalValue(i32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: In CodePointTrie, we return an upgraded type (like Script) from the trie, rather than a plain integer. It would be nice if we could do the same thing here. In other words, add a type parameter T onto Char16Trie<T> and all other types, and then use the T here when returning the values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsivonen What do you think of this suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For collation, the result is a uint32_t which encodes a compact collation element value.

I don't know if the added complexity of genericizing the value type is worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// The input unit(s) continued a matching string and there is a value for
/// the string so far. No further input byte/unit can continue a matching
/// string.
FinalValue(i32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: where does the type i32 come from? Why is it signed? Does the data structure support other values? Does it support storing a u16 or u8 to save on space? @markusicu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used i32 because that is what the ICU4C implementation uses for getValue: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/ucharstrie.h#L258

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data structure is defined as mapping sequences of uint16_t to int32_t, leaving it up to the user to interpret the values. Some don't care about the value at all, only about the TrieResult. In many cases, these will be small-ish integers. For collation, it will be interpreted as a uint32_t. (AFAIR the internal encoding favors non-negative values, essentially storing uint32_t in a variable-length way -- but programmers like to use signed integers by default.)

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just some suggestions regarding API shape. Otherwise it looks quite good!

/// - [ICU4J CharsTrie](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/CharsTrie.html) API.
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[derive(Clone)]
pub struct Char16Trie<'data> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: a higher-level API that may be more user-friendly in certain situations would be

impl<'data, T> Char16Trie<'data, T> {
    pub fn get(s: &str) -> Option<T>
}

This would expand to a loop walking over the Char16TrieIterator with TrieResult.

Note that this is something nice that a standalone Char16Trie enables us to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: a higher-level API that may be more user-friendly in certain situations would be

impl<'data, T> Char16Trie<'data, T> {
    pub fn get(s: &str) -> Option<T>
}

This would expand to a loop walking over the Char16TrieIterator with TrieResult.

Note that this is something nice that a standalone Char16Trie enables us to do.

Is there a follow-up issue to explore this possibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, filed #1445.

let res = itor.next('h' as i32);
assert_eq!(res, TrieResult::NoMatch);

let mut itor = Char16TrieIterator::new(trie.data.as_slice());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also hope we can also do this:

Suggested change
let mut itor = Char16TrieIterator::new(trie.data.as_slice());
let mut iter = trie.iter();

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • utils/char16trie/Cargo.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

sffc
sffc previously requested changes Dec 7, 2021
utils/char16trie/src/char16trie.rs Outdated Show resolved Hide resolved
utils/char16trie/src/char16trie.rs Outdated Show resolved Hide resolved
@dminor dminor added the discuss-priority Discuss at the next ICU4X meeting label Dec 8, 2021
@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Dec 9, 2021
@echeran echeran removed their request for review December 9, 2021 18:30
@sffc sffc dismissed their stale review December 10, 2021 20:25

Unblocking code that is landing in experimental, as discussed.

@dminor dminor requested a review from sffc December 22, 2021 19:06
@dminor
Copy link
Contributor Author

dminor commented Dec 22, 2021

@sffc I've filed issues for the outstanding comments from you and Markus and moved the code from util to experimental. I'm struggling to find enough time to work on this and I would prefer to land it under experimental to unblock Henri and Makoto from being able to use and improve on this as necessary.

@sffc
Copy link
Member

sffc commented Dec 22, 2021

@dminor There are still some open comments about API shape. Can you make sure there are follow-ups filed for those?

Also, please fix the build/test failures found by CI.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(setting review bit)

@dminor
Copy link
Contributor Author

dminor commented Dec 23, 2021

@dminor There are still some open comments about API shape. Can you make sure there are follow-ups filed for those?

Also, please fix the build/test failures found by CI.

I filed #1444 for making the value type generic. I think the main discussion about API shape is covered by #1439 since we're trying to decide if a less rusty api would give us significant performance benefits.

@dminor dminor merged commit 1a87cea into unicode-org:main Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port UCharsTrie to ICU4X
7 participants