-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port UCharsTrie to ICU4X #1202
Comments
From discussion with @markusicu: UCharsTrie and UBytesTrie are similar, but UCharsTrie is more straightforward and he recommends we start there. The learnings from UCharsTrie should be used to influence UBytesTrie. We can decide later whether we want both, or whether we want to keep just UCharsTrie. |
Makoto has an implementation here already: https://github.com/makotokato/dictionary_segmenter/blob/main/src/uchars_trie.rs |
My first thought is to import Makoto's implementation, refactor it along the lines of the @sffc @makotokato thoughts on this? |
It looks fairly straightforward to import Makoto's implementation. That should solve the "hard part", which is writing the algorithms to handle the UCharsTrie data model. The question we'll need to answer on our end is the API. Note that with UCharsTrie, we need both the immutable data as well as a mutable "iterator" type object. This is different than UnicodeSet and CodePointTrie, which have only the immutable type. We should consider something like /// Main immutable type that can be serialized into a data struct
#[derive(Serialize, Deserialize, Clone)]
pub struct UCharsTrie<'data> {
#[serde(borrow)]
pub data: ZeroVec<'data, u16>,
}
/// Ephemeral, mutable runtime type
pub struct UCharsTrieIterator<'a> {
trie: &'a [u16::ULE],
pos: Option<usize>,
root: usize,
remaining_match_length: Option<usize>,
}
impl<'data> UCharsTrie<'data> {
pub fn new_iter<'a>(&'a self) -> UCharsTrieIterator<'a> {
UCharsTrieIterator {
trie: self.data.as_slice(),
// ...
}
}
} CC @markusicu |
The collator data interleaves trie data and other data into a long sequence of 16-bit values, so at least for the collator, the |
Note that in my post From the data provider you only have |
For free in terms of run-time performance, yes. In terms of ergonomics, for the collator use cases it would make more sense to be able to construct an (Also, it would make sense not to have a
I had noticed the endianness aspect. Not guaranteeing alignment, either, is news to me. |
I'd be fine with adding an extra constructor to |
Docs:
Naming: API: |
Naming: |
The immutable data structure is just an array of u16. It didn't seem necessary in C++/Java to wrap that into a named type. Walking the structure while matching requires an iterator. The API exposes that "close to the metal". There is a next(CharSequence) but it's still stateful. There could be a const convenience version that starts from the initial state and ends with a match result, but various use cases really do want to know what happens with intermediate matches, so they need the stateful iterator API.
I agree that calling the stateful thing an ...Iterator makes it clear that that thing is stateful, but it would be nice to just use the same name (modulo language-specific UChars/Chars/U16 prefix) as in C++/Java. The nested Iterator class in ICU iterates over the contents (sequence/value pairs) reachable from the current trie object state. |
UCharsTrie is a data structure in ICU4C that will be needed for the Collator implementation.
Similar to #131, we should port this to ICU4X.
The text was updated successfully, but these errors were encountered: