Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port UCharsTrie to ICU4X #1202

Closed
dminor opened this issue Oct 21, 2021 · 11 comments · Fixed by #1264
Closed

Port UCharsTrie to ICU4X #1202

dminor opened this issue Oct 21, 2021 · 11 comments · Fixed by #1264
Assignees
Labels
A-scope Area: Project scope, feature coverage C-unicode Component: Props, sets, tries S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality

Comments

@dminor
Copy link
Contributor

dminor commented Oct 21, 2021

UCharsTrie is a data structure in ICU4C that will be needed for the Collator implementation.

Similar to #131, we should port this to ICU4X.

@dminor dminor added T-core Type: Required functionality A-scope Area: Project scope, feature coverage C-unicode Component: Props, sets, tries S-large Size: A few weeks (larger feature, major refactoring) labels Oct 21, 2021
@dminor dminor added this to the 2021 Q4 0.5 Sprint B milestone Oct 21, 2021
@dminor dminor self-assigned this Oct 21, 2021
@sffc
Copy link
Member

sffc commented Oct 21, 2021

From discussion with @markusicu: UCharsTrie and UBytesTrie are similar, but UCharsTrie is more straightforward and he recommends we start there. The learnings from UCharsTrie should be used to influence UBytesTrie. We can decide later whether we want both, or whether we want to keep just UCharsTrie.

@dminor dminor changed the title Port UCharTrie to ICU4X Port UCharsTrie to ICU4X Oct 21, 2021
@dminor
Copy link
Contributor Author

dminor commented Oct 21, 2021

@dminor
Copy link
Contributor Author

dminor commented Oct 22, 2021

My first thought is to import Makoto's implementation, refactor it along the lines of the CodePointTrie implementation, and then add documentation, further tests, etc.

@sffc @makotokato thoughts on this?

@sffc
Copy link
Member

sffc commented Oct 22, 2021

It looks fairly straightforward to import Makoto's implementation. That should solve the "hard part", which is writing the algorithms to handle the UCharsTrie data model.

The question we'll need to answer on our end is the API. Note that with UCharsTrie, we need both the immutable data as well as a mutable "iterator" type object. This is different than UnicodeSet and CodePointTrie, which have only the immutable type. We should consider something like

/// Main immutable type that can be serialized into a data struct
#[derive(Serialize, Deserialize, Clone)]
pub struct UCharsTrie<'data> {
    #[serde(borrow)]
    pub data: ZeroVec<'data, u16>,
}

/// Ephemeral, mutable runtime type
pub struct UCharsTrieIterator<'a> {
    trie: &'a [u16::ULE],
    pos: Option<usize>,
    root: usize,
    remaining_match_length: Option<usize>,
}

impl<'data> UCharsTrie<'data> {
    pub fn new_iter<'a>(&'a self) -> UCharsTrieIterator<'a> {
        UCharsTrieIterator {
            trie: self.data.as_slice(),
            // ...
        }
    }
}

CC @markusicu

@hsivonen
Copy link
Member

The collator data interleaves trie data and other data into a long sequence of 16-bit values, so at least for the collator, the UCharsTrie struct from the previous comment wouldn't be useful, and it would be useful to create a UCharsTrieIterator directly from a &[u16] (or &[u16::ULE]; I'm not quite up-to-speed with the ULE stuff) such that the iterator just ignores excess data at the end of the slice the way ICU4C ignores trailing data.

@sffc
Copy link
Member

sffc commented Oct 26, 2021

Note that in my post UCharsTrie is basically a newtype wrapper over &[u16::ULE]. If you have a &[u16::ULE] then you can get UCharsTrie for free.

From the data provider you only have &[u16::ULE], not &[u16], since the latter type is not encodable in a binary data file without making it sensitive to endianness and alignment.

@hsivonen
Copy link
Member

Note that in my post UCharsTrie is basically a newtype wrapper over &[u16::ULE]. If you have a &[u16::ULE] then you can get UCharsTrie for free.

For free in terms of run-time performance, yes. In terms of ergonomics, for the collator use cases it would make more sense to be able to construct an UCharsIterator from the slice directly.

(Also, it would make sense not to have a root: usize but to assume that the trie starts from the start of the slice and you set the root position by subslicing the data accordingly.)

From the data provider you only have &[u16::ULE], not &[u16], since the latter type is not encodable in a binary data file without making it sensitive to endianness and alignment.

I had noticed the endianness aspect. Not guaranteeing alignment, either, is news to me.

@sffc
Copy link
Member

sffc commented Oct 26, 2021

Note that in my post UCharsTrie is basically a newtype wrapper over &[u16::ULE]. If you have a &[u16::ULE] then you can get UCharsTrie for free.

For free in terms of run-time performance, yes. In terms of ergonomics, for the collator use cases it would make more sense to be able to construct an UCharsIterator from the slice directly.

I'd be fine with adding an extra constructor to UCharsTrieIterator that takes &[u16::ULE] and maybe another that takes &[u16] if it's easier to read and write.

@markusicu
Copy link
Member

markusicu commented Oct 29, 2021

Docs:

Naming:
I called it UCharsTrie in C++ because of our typedef UChar, from before we could reliably use char16_t.
By contrast, in Java I called it CharsTrie because of Java type char (also equivalent to uint16_t).
See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/util/CharsTrie.html

API:
There are two types of iterators here. In ICU4C/J, the "trie" object is a string-matching iterator. But there is also a nested class Iterator “for all of the (string, value) pairs in a CharsTrie.”
See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/util/CharsTrie.Iterator.html
I suggest that ICU4X roughly follow that naming convention. Looking at Shane's API types, the first could be (U?)CharsTrieData, the second one (U?)CharsTrie, and then there would be another one like (U?)CharsTrieIterator if Rust doesn't do nested types.
And maybe Rust wants to use U16Trie... as a prefix??

@sffc
Copy link
Member

sffc commented Oct 29, 2021

Naming:
CodePointTrie is an immutable data structure that you can use repeatedly. I don't understand why [U]CharsTrie is mutable and needs to be recreated every time you try to use it. I feel like the mutable version should be named [U]CharsTrieIterator, or if that name is reserved for something else, then maybe [U]CharsTrieCursor or something like that.

@markusicu
Copy link
Member

Naming: CodePointTrie is an immutable data structure that you can use repeatedly. I don't understand why [U]CharsTrie is mutable and needs to be recreated every time you try to use it.

The immutable data structure is just an array of u16. It didn't seem necessary in C++/Java to wrap that into a named type.

Walking the structure while matching requires an iterator. The API exposes that "close to the metal". There is a next(CharSequence) but it's still stateful. There could be a const convenience version that starts from the initial state and ends with a match result, but various use cases really do want to know what happens with intermediate matches, so they need the stateful iterator API.

I feel like the mutable version should be named [U]CharsTrieIterator, or if that name is reserved for something else, then maybe [U]CharsTrieCursor or something like that.

I agree that calling the stateful thing an ...Iterator makes it clear that that thing is stateful, but it would be nice to just use the same name (modulo language-specific UChars/Chars/U16 prefix) as in C++/Java.

The nested Iterator class in ICU iterates over the contents (sequence/value pairs) reachable from the current trie object state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-scope Area: Project scope, feature coverage C-unicode Component: Props, sets, tries S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants