Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 support #109

Merged
merged 20 commits into from
Oct 30, 2023
Merged

UTF-16 support #109

merged 20 commits into from
Oct 30, 2023

Conversation

jfkthame
Copy link
Contributor

This aims to provide "native" UTF-16 support, with *U16 versions of InitialInfo, BidiInfo, and Paragraph structs that are parallel to the existing types but work on a [u16] text buffer instead of a str.

Fixes #108

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@Manishearth
Copy link
Member

This is pretty big and I'm unlikely to be able to review it soon, fwiw.

@jfkthame
Copy link
Contributor Author

This is pretty big and I'm unlikely to be able to review it soon, fwiw.

Yeah - sorry, I know there's a lot of it!

To make it more approachable, I tried to post it as a series of logical steps that build on each other (and I think it builds and passes tests at all stages of the patch queue, IIRC).

Much of it is moving (rather than modifying) existing code, and then duplicating (with str -> [u16] changes) the structs that are the public API. So there's actually not much new code here; once the TextSource trait is available, the rest is pretty mechanical.

@jfkthame jfkthame requested a review from Manishearth October 16, 2023 17:58
@jfkthame jfkthame force-pushed the utf-16 branch 2 times, most recently from 3907bc8 to 94ba2ab Compare October 23, 2023 13:56
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review has reached e3bad1f, which seems rather big and might take a while to review. If you can splti it into smaller commits that would be appreciated.. Already working through it now.

Please avoid force pushing changes to earlier commits since I've already reviewed them, github shows interdiffs but that's only useful once one has reviewed the entire PR. I'll let you know when it's okay to force-push again; in case you care about maintaining a clean git history.

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/implicit.rs Outdated Show resolved Hide resolved
src/implicit.rs Outdated Show resolved Hide resolved
src/implicit.rs Outdated
let mut actual_index = level_run.start;

while actual_index < level_run.end {
let (ch, len) = text.char_at(actual_index).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: no panics, please.

The older structure of the code was intentional to use char_indices, please retain something similar. I think we can add an indexing method that returns a &Self to TextSource and then call char_indices() on it

src/lib.rs Outdated Show resolved Hide resolved
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Please fix anything marked issue/nit and we can merge this.

(Fine to force push again, now that I've reviewed the entire thing)

Thanks for doing this!

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated
/// [Rule L4]: https://www.unicode.org/reports/tr9/#L4
#[cfg_attr(feature = "flame_it", flamer::flame)]
pub fn reorder_line(&self, para: &ParagraphInfo, line: Range<usize>) -> Cow<'text, [u16]> {
if !level::has_rtl(&self.levels[line.clone()]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: would be nice to also refactor this out, perhaps in a way that allows the caller to pass in the encode_utf16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do this for now, as it's a pretty short function and it wasn't really clear to me how to do it most tidily. But maybe as a followup...

src/lib.rs Outdated
@@ -1590,6 +1597,19 @@ mod tests {
}
);

let text = &to_utf16(text);
assert_eq!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: would be nice to refactor this double-assert into a helper function

src/lib.rs Outdated
///
/// Contains the text paragraphs and `BidiClass` of its characters.
#[derive(PartialEq, Debug)]
pub struct InitialInfoU16<'text> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: it would be ideal for us to have a GenericBidiInfo, GenericInitialInfo, and GenericParagraph types that abstract over an encoding, where the UTF8 one is just a type alias.

This might be more work. I'm fine with having separate UTF16 types for now as long as they're in a module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that'd be nice; my grasp of Rust is still pretty weak, though, so I haven't attempted this as yet.

@jfkthame jfkthame force-pushed the utf-16 branch 2 times, most recently from 548ab42 to f20df42 Compare October 27, 2023 16:55
@jfkthame
Copy link
Contributor Author

Huh, I wanted to completely remove pub from the TextSource trait, and locally that works fine for me, but the CI build here doesn't seem to like it. Sigh.

…xt being processed.

The core bidi algorithms will be adapted to operate on TextSource,
and we will provide implementations for both `str` and `[u16]`
text buffers.
…-based helper.

This is in preparation for creating a UTF-16 version of InitialInfo,
which will rely on the same TextSource-based helper function.
In preparation for creating a UTF-16 version.
Because Paragraph has a BidiInfo field, and we're going to create a separate
BidiInfoU16 to handle UTF-16 data, we'll need a separate ParagraphU16 as well.
@jfkthame jfkthame force-pushed the utf-16 branch 2 times, most recently from 2679f95 to bdc0822 Compare October 27, 2023 17:20
@Manishearth
Copy link
Member

I'm comfortable increasing to 1.56, but yes if we have an alternative we should use it

I didn't even realize we depended on syn, I would love for this to be a leaf crate

@mbrubeck
Copy link
Contributor

Currently we depend on syn only with the optional serde or flamer features enabled. By default this crate has zero dependencies.

@Manishearth
Copy link
Member

ah, yeah, I'm fine with breaking msrv for those features only

we can tweak CI to not run those

@jfkthame
Copy link
Contributor Author

I took @mbrubeck's suggestion (thanks!) to implement the sealed-trait pattern manually, which seems straightforward, so that removes the dependency on the sealed crate.

Unfortunately, 1.36.0 still fails, because REPLACEMENT_CHARACTER, from_u32(), and decode_utf16() are all in the std version of char, not provided by the primitive char type.

@Manishearth Should we just provide local versions of these, too? They seem like they should be simple enough...

@mbrubeck
Copy link
Contributor

Unfortunately, 1.36.0 still fails, because REPLACEMENT_CHARACTER, from_u32(), and decode_utf16() are all in the std version of char, not provided by the primitive char type.

You can use these as core::char::REPLACEMENT_CHARACTER, core::char::from_u32, etc.

@jfkthame
Copy link
Contributor Author

Unfortunately, 1.36.0 still fails, because REPLACEMENT_CHARACTER, from_u32(), and decode_utf16() are all in the std version of char, not provided by the primitive char type.

You can use these as core::char::REPLACEMENT_CHARACTER, core::char::from_u32, etc.

Huh, you're right -- the error message told me I needed std, but core also works. So it's sufficient to just say use core::char; at the beginning of the module to resolve this.

Next up is that I put a lot of the tests and expected-results into arrays, and then use for t in tests { ... } to run the same calls & assertions over a whole set of tests. That works fine for me with current Rust, but again, 1.36.0 doesn't like it, so looks like that'll need to be reworked. I'll try to have a look at it, but any hints/suggestions are welcome!

@mbrubeck
Copy link
Contributor

There are various options, but one simple one is to change the arrays to vecs: let tests = vec![ /* ... */ ];

It looks like Rust 1.36 also chokes on one use of format! to create an &str in the test module. This can probably be replaced by a string literal with Unicode character escapes, or you can move the format! to a binding outside of the tests vector like:

        let fsi_rtl_pdi_ltr = format!("{}א{}a", chars::FSI, chars::PDI);
        let tests = vec![
            // ...
            (
                &fsi_rtl_pdi_ltr,
                // ...

@jfkthame jfkthame force-pushed the utf-16 branch 2 times, most recently from c533080 to 604fdb0 Compare October 30, 2023 11:42
@jfkthame
Copy link
Contributor Author

Thanks for all the feedback & suggestions -- I don't know my way around the Rust ecosystem much yet, so it takes me a while to find things!
At this point I think this should be ready to merge, AFAICS.

@Manishearth
Copy link
Member

Manishearth commented Oct 30, 2023

Apologies for the repeated force-push spam! It was all building fine for me locally, but took some repeated attempts to get the CI jobs happy. Not sure about the failing 1.36.0 job? Please let me know what I ought to do, if anything, or can we ignore it? Thanks!

When dealing with a PR this big it's better to just make fixup commits and clean up the commit history later (or at least explain the changes between force pushes in a comment)

@Manishearth
Copy link
Member

Hmm, I went through all of the new interdiffs and could not see any of the issues being addressed. I might have missed an interdiff or a commit, I might have to go through this commit by commit again.

(This is why I was discouraging force pushes)

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, still a couple of my comments that haven't been completely addressed.

I didn't redo the commit by commit review but I went through my old comments and looked at what was fixed.

For future fixes please only make new commits. You can use git commit --fixup <commit hash> to stage a commit for being squashed into a previous commit later; we can clean up the commit history before merging.

Though the changes I have requested probably are fine in their own commits.

src/implicit.rs Outdated
let mut actual_index = level_run.start;

for ch in text.subrange(level_run.clone()).chars() {
let char_len = T::char_len(ch);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's use char_indices here instead of recomputing char_len. there's a chance the optimizer will figure this out, I'd rather not take that chance.

Again, the old code had the right structure: run char_indices and add to level_run.start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7866637

src/lib.rs Outdated
/// If doing so, you may prefer to use [`reordered_levels_per_char()`] instead
/// to avoid non-character indices.
///
/// For an all-in-one reordering solution, consider using [`Self::reorder_visual()`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this line and the one above it is not relevant for a private API.

This should just be a low level explanation of the function: what it does (the first line of docs) and what spec stuff it covers (the mention of rule L1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 69aba46

src/lib.rs Outdated Show resolved Hide resolved
src/implicit.rs Outdated
@@ -256,6 +256,7 @@ pub fn resolve_neutral<'a, D: BidiDataSource, T: TextSource<'a> + ?Sized>(
original_classes: &[BidiClass],
processing_classes: &mut [BidiClass],
) where
<T as TextSource<'a>>::CharIndexIter: Iterator<Item = (usize, char)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: all of these bounds should live on the TextSource trait, not on all these where clauses

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good with one small change. feel free to force push / squash and I'll merge

src/lib.rs Outdated
/// This trait is sealed and cannot be implemented for types outside this crate.
pub trait TextSource<'text>: private::Sealed
where
<Self as TextSource<'text>>::CharIndexIter: Iterator<Item = (usize, char)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not: not in the where bounds, they can go on the associated types directly: type CharIndexIterator: Iterator<...>

@Manishearth Manishearth merged commit e699bf1 into servo:master Oct 30, 2023
6 checks passed
@Manishearth
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Native UTF-16 support
3 participants