-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-16 support #109
UTF-16 support #109
Conversation
This is pretty big and I'm unlikely to be able to review it soon, fwiw. |
Yeah - sorry, I know there's a lot of it! To make it more approachable, I tried to post it as a series of logical steps that build on each other (and I think it builds and passes tests at all stages of the patch queue, IIRC). Much of it is moving (rather than modifying) existing code, and then duplicating (with |
3907bc8
to
94ba2ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review has reached e3bad1f, which seems rather big and might take a while to review. If you can splti it into smaller commits that would be appreciated.. Already working through it now.
Please avoid force pushing changes to earlier commits since I've already reviewed them, github shows interdiffs but that's only useful once one has reviewed the entire PR. I'll let you know when it's okay to force-push again; in case you care about maintaining a clean git history.
src/implicit.rs
Outdated
let mut actual_index = level_run.start; | ||
|
||
while actual_index < level_run.end { | ||
let (ch, len) = text.char_at(actual_index).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: no panics, please.
The older structure of the code was intentional to use char_indices
, please retain something similar. I think we can add an indexing method that returns a &Self
to TextSource
and then call char_indices()
on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Please fix anything marked issue/nit and we can merge this.
(Fine to force push again, now that I've reviewed the entire thing)
Thanks for doing this!
src/lib.rs
Outdated
/// [Rule L4]: https://www.unicode.org/reports/tr9/#L4 | ||
#[cfg_attr(feature = "flame_it", flamer::flame)] | ||
pub fn reorder_line(&self, para: &ParagraphInfo, line: Range<usize>) -> Cow<'text, [u16]> { | ||
if !level::has_rtl(&self.levels[line.clone()]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: would be nice to also refactor this out, perhaps in a way that allows the caller to pass in the encode_utf16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do this for now, as it's a pretty short function and it wasn't really clear to me how to do it most tidily. But maybe as a followup...
src/lib.rs
Outdated
@@ -1590,6 +1597,19 @@ mod tests { | |||
} | |||
); | |||
|
|||
let text = &to_utf16(text); | |||
assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: would be nice to refactor this double-assert into a helper function
src/lib.rs
Outdated
/// | ||
/// Contains the text paragraphs and `BidiClass` of its characters. | ||
#[derive(PartialEq, Debug)] | ||
pub struct InitialInfoU16<'text> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: it would be ideal for us to have a GenericBidiInfo, GenericInitialInfo, and GenericParagraph types that abstract over an encoding, where the UTF8 one is just a type alias.
This might be more work. I'm fine with having separate UTF16 types for now as long as they're in a module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that'd be nice; my grasp of Rust is still pretty weak, though, so I haven't attempted this as yet.
548ab42
to
f20df42
Compare
Huh, I wanted to completely remove |
…xt being processed. The core bidi algorithms will be adapted to operate on TextSource, and we will provide implementations for both `str` and `[u16]` text buffers.
…-based helper. This is in preparation for creating a UTF-16 version of InitialInfo, which will rely on the same TextSource-based helper function.
In preparation for creating a UTF-16 version.
Because Paragraph has a BidiInfo field, and we're going to create a separate BidiInfoU16 to handle UTF-16 data, we'll need a separate ParagraphU16 as well.
2679f95
to
bdc0822
Compare
I'm comfortable increasing to 1.56, but yes if we have an alternative we should use it I didn't even realize we depended on syn, I would love for this to be a leaf crate |
Currently we depend on |
ah, yeah, I'm fine with breaking msrv for those features only we can tweak CI to not run those |
I took @mbrubeck's suggestion (thanks!) to implement the sealed-trait pattern manually, which seems straightforward, so that removes the dependency on the Unfortunately, 1.36.0 still fails, because @Manishearth Should we just provide local versions of these, too? They seem like they should be simple enough... |
You can use these as |
Huh, you're right -- the error message told me I needed Next up is that I put a lot of the tests and expected-results into arrays, and then use |
There are various options, but one simple one is to change the arrays to vecs: It looks like Rust 1.36 also chokes on one use of let fsi_rtl_pdi_ltr = format!("{}א{}a", chars::FSI, chars::PDI);
let tests = vec![
// ...
(
&fsi_rtl_pdi_ltr,
// ... |
c533080
to
604fdb0
Compare
Thanks for all the feedback & suggestions -- I don't know my way around the Rust ecosystem much yet, so it takes me a while to find things! |
When dealing with a PR this big it's better to just make fixup commits and clean up the commit history later (or at least explain the changes between force pushes in a comment) |
Hmm, I went through all of the new interdiffs and could not see any of the issues being addressed. I might have missed an interdiff or a commit, I might have to go through this commit by commit again. (This is why I was discouraging force pushes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, still a couple of my comments that haven't been completely addressed.
I didn't redo the commit by commit review but I went through my old comments and looked at what was fixed.
For future fixes please only make new commits. You can use git commit --fixup <commit hash>
to stage a commit for being squashed into a previous commit later; we can clean up the commit history before merging.
Though the changes I have requested probably are fine in their own commits.
src/implicit.rs
Outdated
let mut actual_index = level_run.start; | ||
|
||
for ch in text.subrange(level_run.clone()).chars() { | ||
let char_len = T::char_len(ch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's use char_indices here instead of recomputing char_len. there's a chance the optimizer will figure this out, I'd rather not take that chance.
Again, the old code had the right structure: run char_indices and add to level_run.start
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 7866637
src/lib.rs
Outdated
/// If doing so, you may prefer to use [`reordered_levels_per_char()`] instead | ||
/// to avoid non-character indices. | ||
/// | ||
/// For an all-in-one reordering solution, consider using [`Self::reorder_visual()`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this line and the one above it is not relevant for a private API.
This should just be a low level explanation of the function: what it does (the first line of docs) and what spec stuff it covers (the mention of rule L1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 69aba46
src/implicit.rs
Outdated
@@ -256,6 +256,7 @@ pub fn resolve_neutral<'a, D: BidiDataSource, T: TextSource<'a> + ?Sized>( | |||
original_classes: &[BidiClass], | |||
processing_classes: &mut [BidiClass], | |||
) where | |||
<T as TextSource<'a>>::CharIndexIter: Iterator<Item = (usize, char)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: all of these bounds should live on the TextSource trait, not on all these where clauses
…ke the utf16 submodule public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good with one small change. feel free to force push / squash and I'll merge
src/lib.rs
Outdated
/// This trait is sealed and cannot be implemented for types outside this crate. | ||
pub trait TextSource<'text>: private::Sealed | ||
where | ||
<Self as TextSource<'text>>::CharIndexIter: Iterator<Item = (usize, char)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not: not in the where bounds, they can go on the associated types directly: type CharIndexIterator: Iterator<...>
Thanks! |
This aims to provide "native" UTF-16 support, with
*U16
versions ofInitialInfo
,BidiInfo
, andParagraph
structs that are parallel to the existing types but work on a[u16]
text buffer instead of astr
.Fixes #108