Port UCharsTrie to ICU4X #1202

dminor · 2021-10-21T16:37:09Z

UCharsTrie is a data structure in ICU4C that will be needed for the Collator implementation.

Similar to #131, we should port this to ICU4X.

sffc · 2021-10-21T16:42:41Z

From discussion with @markusicu: UCharsTrie and UBytesTrie are similar, but UCharsTrie is more straightforward and he recommends we start there. The learnings from UCharsTrie should be used to influence UBytesTrie. We can decide later whether we want both, or whether we want to keep just UCharsTrie.

dminor · 2021-10-21T19:29:31Z

Makoto has an implementation here already: https://github.com/makotokato/dictionary_segmenter/blob/main/src/uchars_trie.rs

dminor · 2021-10-22T19:19:04Z

My first thought is to import Makoto's implementation, refactor it along the lines of the CodePointTrie implementation, and then add documentation, further tests, etc.

@sffc @makotokato thoughts on this?

sffc · 2021-10-22T19:59:43Z

It looks fairly straightforward to import Makoto's implementation. That should solve the "hard part", which is writing the algorithms to handle the UCharsTrie data model.

The question we'll need to answer on our end is the API. Note that with UCharsTrie, we need both the immutable data as well as a mutable "iterator" type object. This is different than UnicodeSet and CodePointTrie, which have only the immutable type. We should consider something like

/// Main immutable type that can be serialized into a data struct
#[derive(Serialize, Deserialize, Clone)]
pub struct UCharsTrie<'data> {
    #[serde(borrow)]
    pub data: ZeroVec<'data, u16>,
}

/// Ephemeral, mutable runtime type
pub struct UCharsTrieIterator<'a> {
    trie: &'a [u16::ULE],
    pos: Option<usize>,
    root: usize,
    remaining_match_length: Option<usize>,
}

impl<'data> UCharsTrie<'data> {
    pub fn new_iter<'a>(&'a self) -> UCharsTrieIterator<'a> {
        UCharsTrieIterator {
            trie: self.data.as_slice(),
            // ...
        }
    }
}

CC @markusicu

hsivonen · 2021-10-26T09:10:35Z

The collator data interleaves trie data and other data into a long sequence of 16-bit values, so at least for the collator, the UCharsTrie struct from the previous comment wouldn't be useful, and it would be useful to create a UCharsTrieIterator directly from a &[u16] (or &[u16::ULE]; I'm not quite up-to-speed with the ULE stuff) such that the iterator just ignores excess data at the end of the slice the way ICU4C ignores trailing data.

sffc · 2021-10-26T09:37:22Z

Note that in my post UCharsTrie is basically a newtype wrapper over &[u16::ULE]. If you have a &[u16::ULE] then you can get UCharsTrie for free.

From the data provider you only have &[u16::ULE], not &[u16], since the latter type is not encodable in a binary data file without making it sensitive to endianness and alignment.

hsivonen · 2021-10-26T15:15:06Z

Note that in my post UCharsTrie is basically a newtype wrapper over &[u16::ULE]. If you have a &[u16::ULE] then you can get UCharsTrie for free.

For free in terms of run-time performance, yes. In terms of ergonomics, for the collator use cases it would make more sense to be able to construct an UCharsIterator from the slice directly.

(Also, it would make sense not to have a root: usize but to assume that the trie starts from the start of the slice and you set the root position by subslicing the data accordingly.)

From the data provider you only have &[u16::ULE], not &[u16], since the latter type is not encodable in a binary data file without making it sensitive to endianness and alignment.

I had noticed the endianness aspect. Not guaranteeing alignment, either, is news to me.

sffc · 2021-10-26T16:00:06Z

Note that in my post UCharsTrie is basically a newtype wrapper over &[u16::ULE]. If you have a &[u16::ULE] then you can get UCharsTrie for free.

For free in terms of run-time performance, yes. In terms of ergonomics, for the collator use cases it would make more sense to be able to construct an UCharsIterator from the slice directly.

I'd be fine with adding an extra constructor to UCharsTrieIterator that takes &[u16::ULE] and maybe another that takes &[u16] if it's easier to read and write.

markusicu · 2021-10-29T00:49:14Z

Docs:

Naming:
I called it UCharsTrie in C++ because of our typedef UChar, from before we could reliably use char16_t.
By contrast, in Java I called it CharsTrie because of Java type char (also equivalent to uint16_t).
See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/util/CharsTrie.html

API:
There are two types of iterators here. In ICU4C/J, the "trie" object is a string-matching iterator. But there is also a nested class Iterator “for all of the (string, value) pairs in a CharsTrie.”
See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/util/CharsTrie.Iterator.html
I suggest that ICU4X roughly follow that naming convention. Looking at Shane's API types, the first could be (U?)CharsTrieData, the second one (U?)CharsTrie, and then there would be another one like (U?)CharsTrieIterator if Rust doesn't do nested types.
And maybe Rust wants to use U16Trie... as a prefix??

sffc · 2021-10-29T15:01:49Z

Naming:
CodePointTrie is an immutable data structure that you can use repeatedly. I don't understand why [U]CharsTrie is mutable and needs to be recreated every time you try to use it. I feel like the mutable version should be named [U]CharsTrieIterator, or if that name is reserved for something else, then maybe [U]CharsTrieCursor or something like that.

markusicu · 2021-11-18T00:08:06Z

Naming: CodePointTrie is an immutable data structure that you can use repeatedly. I don't understand why [U]CharsTrie is mutable and needs to be recreated every time you try to use it.

The immutable data structure is just an array of u16. It didn't seem necessary in C++/Java to wrap that into a named type.

Walking the structure while matching requires an iterator. The API exposes that "close to the metal". There is a next(CharSequence) but it's still stateful. There could be a const convenience version that starts from the initial state and ends with a match result, but various use cases really do want to know what happens with intermediate matches, so they need the stateful iterator API.

I feel like the mutable version should be named [U]CharsTrieIterator, or if that name is reserved for something else, then maybe [U]CharsTrieCursor or something like that.

I agree that calling the stateful thing an ...Iterator makes it clear that that thing is stateful, but it would be nice to just use the same name (modulo language-specific UChars/Chars/U16 prefix) as in C++/Java.

The nested Iterator class in ICU iterates over the contents (sequence/value pairs) reachable from the current trie object state.

dminor added T-core Type: Required functionality A-scope Area: Project scope, feature coverage C-unicode Component: Props, sets, tries S-large Size: A few weeks (larger feature, major refactoring) labels Oct 21, 2021

dminor added this to the 2021 Q4 0.5 Sprint B milestone Oct 21, 2021

dminor self-assigned this Oct 21, 2021

dminor modified the milestones: 2021 Q4 0.5 Sprint B, ICU4X 0.5 Oct 21, 2021

dminor modified the milestones: ICU4X 0.5, 2021 Q4 0.5 Sprint A Oct 21, 2021

dminor changed the title ~~Port UCharTrie to ICU4X~~ Port UCharsTrie to ICU4X Oct 21, 2021

dminor mentioned this issue Nov 4, 2021

Port UCharsTrie #1264

Merged

sffc modified the milestones: 2021 Q4 0.5 Sprint A, 2021 Q4 0.5 Sprint B Nov 5, 2021

sffc modified the milestones: 2021 Q4 0.5 Sprint B, 2021 Q4 0.5 Sprint C Nov 18, 2021

sffc modified the milestones: 2021 Q4 0.5 Sprint C, 2021 Q4 0.5 Sprint D Dec 9, 2021

dminor closed this as completed in #1264 Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port UCharsTrie to ICU4X #1202

Port UCharsTrie to ICU4X #1202

dminor commented Oct 21, 2021

sffc commented Oct 21, 2021

dminor commented Oct 21, 2021

dminor commented Oct 22, 2021

sffc commented Oct 22, 2021 •

edited

Loading

hsivonen commented Oct 26, 2021

sffc commented Oct 26, 2021

hsivonen commented Oct 26, 2021

sffc commented Oct 26, 2021

markusicu commented Oct 29, 2021 •

edited

Loading

sffc commented Oct 29, 2021

markusicu commented Nov 18, 2021

Port UCharsTrie to ICU4X #1202

Port UCharsTrie to ICU4X #1202

Comments

dminor commented Oct 21, 2021

sffc commented Oct 21, 2021

dminor commented Oct 21, 2021

dminor commented Oct 22, 2021

sffc commented Oct 22, 2021 • edited Loading

hsivonen commented Oct 26, 2021

sffc commented Oct 26, 2021

hsivonen commented Oct 26, 2021

sffc commented Oct 26, 2021

markusicu commented Oct 29, 2021 • edited Loading

sffc commented Oct 29, 2021

markusicu commented Nov 18, 2021

sffc commented Oct 22, 2021 •

edited

Loading

markusicu commented Oct 29, 2021 •

edited

Loading