You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To make it easier & more efficient to use unicode-bidi in an environment (such as Gecko) where text is handled as UTF-16, I would like to extend the API here to provide a UTF-16 interface, and do the processing directly on UTF-16 code units as an alternative to UTF-8 code units (bytes).
This would not change the existing API in any way, or affect existing users.
Proposal:
Introduce versions of the BidiInfo and InitialInfo structs where the text field is &[u16] instead of &str. I'm suggesting these could be named BidiInfoU16 and InitialInfoU16. Except for the type of their text, these will be identical to the existing UTF-8-based versions.
We'll also need ParagraphU16, because its info will be a &BidiInfoU16.
To allow the actual implementation of the bidi algorithm to be shared between the 8- and 16-bit versions of these structs, I propose a TextSource trait that abstracts access to and iteration over the text, with implementations for str and for [u16]. Only minor adaptation of the InitialInfo, BidiInfo, and Paragraph methods is needed to work with this.
@Manishearth Does this sound like a reasonable way forward? I have a prototype implementation working locally, which I can put up as a PR for review if you think the overall idea is acceptable.
One factor to consider is that while we know, when using the str-based API, that the text must be well-formed Unicode, this will not be the case for a [u16]-based API; there could be unpaired surrogate code units present. There are a few ways we could handle this:
(a) Require the text to be valid UTF-16; panic!() if unpaired surrogates are encountered
(b) Have the 16-bit methods return Result()s everywhere, so that invalid text can return an error
(c) Treat any unpaired surrogate as REPLACEMENT_CHARACTER for all bidi processing
I'm currently leaning toward (c), but happy to listen to arguments for other options.
The text was updated successfully, but these errors were encountered:
Yeah, I'm fine with this, though I may not have time to review it soon.
In general I would like this crate to be encoding agnostic (and also be able to support e.g. ill-formed UTF8).
A thing I would like to see solved here is #86: whatever we do to implement this should abstract over indexing well enough that we no longer need to care about it.
I think we could easily adapt this to handle ill-formed UTF-8. We'd need to create an alternative API using [u8] instead of str; then we provide a suitable implementation of TextSource for [u8], and it should "just work".
We could then make the existing str API into a trivial shim on top of the [u8] API, provided the additional validity-checking is cheap enough to ignore.
To make it easier & more efficient to use unicode-bidi in an environment (such as Gecko) where text is handled as UTF-16, I would like to extend the API here to provide a UTF-16 interface, and do the processing directly on UTF-16 code units as an alternative to UTF-8 code units (bytes).
This would not change the existing API in any way, or affect existing users.
Proposal:
Introduce versions of the
BidiInfo
andInitialInfo
structs where thetext
field is&[u16]
instead of&str
. I'm suggesting these could be namedBidiInfoU16
andInitialInfoU16
. Except for the type of theirtext
, these will be identical to the existing UTF-8-based versions.We'll also need
ParagraphU16
, because itsinfo
will be a&BidiInfoU16
.To allow the actual implementation of the bidi algorithm to be shared between the 8- and 16-bit versions of these structs, I propose a
TextSource
trait that abstracts access to and iteration over the text, with implementations forstr
and for[u16]
. Only minor adaptation of theInitialInfo
,BidiInfo
, andParagraph
methods is needed to work with this.@Manishearth Does this sound like a reasonable way forward? I have a prototype implementation working locally, which I can put up as a PR for review if you think the overall idea is acceptable.
One factor to consider is that while we know, when using the
str
-based API, that the text must be well-formed Unicode, this will not be the case for a[u16]
-based API; there could be unpaired surrogate code units present. There are a few ways we could handle this:(a) Require the text to be valid UTF-16;
panic!()
if unpaired surrogates are encountered(b) Have the 16-bit methods return
Result()
s everywhere, so that invalid text can return an error(c) Treat any unpaired surrogate as
REPLACEMENT_CHARACTER
for all bidi processingI'm currently leaning toward (c), but happy to listen to arguments for other options.
The text was updated successfully, but these errors were encountered: