-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use byte indices in Rust core #21
Comments
I'd be in favour of providing both index bases, if they are already available internally. This avoids re-parsing bytes to characters on the user side. |
Do you actually use character indices or byte indices in
For internal nlprule computation char indices are never needed (I think) so converting from char to byte as early as possible (i.e. when building the binaries) is possible. Using byte indices everywhere in Rust and only converting from byte to char at the boundary to Python made sense to me. But you're right, it's worth thinking about providing both in the public API. I agree that that would make it nicer for a user. But for computation in nlprule I would like to be consistent in whether byte or char indices are used and ideally use bytes. |
Yes, it's converted early on into character indices, and soon™ will be grapheme aware as well, but that can be a layer on-top of characters. It simplifies iterations significantly and is also required to properly align to spans provided by
I am just saying that having character based APIs is a nice feature, since it won't break with simple emojis which are multibyte characters.
👍 makes sense |
It would be more idiomatic to Rust to use byte indices everywhere internally (and everywhere in the public Rust API) and only convert to char indices at the boundary to Python.
The text was updated successfully, but these errors were encountered: