Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is unic-bidi a suitable dependency for BiDi? #20

Open
raphlinus opened this issue Sep 3, 2020 · 17 comments
Open

Is unic-bidi a suitable dependency for BiDi? #20

raphlinus opened this issue Sep 3, 2020 · 17 comments

Comments

@raphlinus
Copy link

Apologies if posting this issue here is out of scope, feel free to close. A large part of the reason I post is to raise awareness of issues I've found, both bugs that will result in incorrect behavior (open-i18n/rust-unic#272) and concerns about the API (open-i18n/rust-unic#273). So far I haven't gotten any feedback, which is not a very encouraging sign.

I think there should be a single place where BiDi work happens. I also think unic is the natural place for it, but if they're not responsive, we need to consider other alternatives. I should also point out that BiDi support in Druid/Piet is pretty far out on the roadmap. Part of the reason I'm investigating this is to do roadmap planning, and also to avoid making API choices now that will make BiDi that much harder in the future.

@dhardy
Copy link
Contributor

dhardy commented Sep 3, 2020

My very limited research found both unicode-bidi and unic-bidi apparently unmaintained, and both with bugs. I regret to say that I didn't investigate further and can't make fixing these a priority, but I share your concern about a gap in the feature-space here.

IMO the "reordering" part of that lib is not useful (I'm about to publish a blog post on roughly that topic) and the only useful part is determining embedding levels. This still has bugs.

@dhardy
Copy link
Contributor

dhardy commented Sep 5, 2020

In case you didn't see it, this is the mentioned article. It's light on details but gives the gist of why character-reordering is not useful.

@raphlinus
Copy link
Author

raphlinus commented Sep 5, 2020

The article seems to me correct why character reordering is not useful; it seems to me it's a legacy concept back from a time where you might do reordering but not full OpenType style shaping to get reasonably good results. In 2020 I think it's fair to say that a foundational crate should support the real shaping case but none of these half measures.

Otherwise the article seemed a bit confused to me. It's obviously necessary to do BiDi analysis before line breaking to figure out which runs are RTL and which are LTR, as that affects width. But it shouldn't be necessary, or even desirable, to do reordering of level runs until after the line breaks are computed. The width of a line should be the same no matter what order the level runs appear in.

For completeness, I should point out that the "reset" logic in the BiDi algorithm technically reverses the order of some runs. But (a) that's whitespace, so the measurement (and rendering) should ideally be the same in both directions, and (b) it's trailing whitespace, so even it were direction-dependent, it shouldn't affect line breaking.

If I were getting massively into the weeds, I'd also mention that the "reset" logic appears to set trailing whitespace up to a tab character to the paragraph direction. I'm actually not 100% sure of the implications of this, as it seems possible to me that this part is not really dependent on line breaking, and could be computed earlier. I wish I understood better the rationale the BiDi spec authors for putting it after line breaking. In any case, tabs are one of the many reasons text rendering hates you, and I'm personally fine with deferring them to a later cycle of development.

Also, in case you didn't see, on the linked unic-bidi issue (273), the icu4x people invited us to file an issue. I'm very tempted to do that, and icu4x may be the right place where this happens, but I'm also a bit wary; it seems like it might be a pretty long road til there's something usable.

@dhardy
Copy link
Contributor

dhardy commented Sep 5, 2020

But it shouldn't be necessary, or even desirable, to do reordering of level runs until after the line breaks are computed.

Unicode TR9 is clear that reordering happens after line breaking, but I agree, it does sound wrong. Frankly, I didn't find a good way of testing this. The shaping code handles the RTL reversal and further reversals beyond this are rare.

I ignored the reset stuff: as you say, it only applies to whitespace. Testing showed that I get the right results anyway (after fixing some other bugs), in what I've tested. (I'd rather take the "lazy approach" and wait for someone else to show why I'm wrong — that way I already have an okay product and a user, instead of being stuck in the details. Even so there are more details I'm stuck in.)

There's one case I still need to handle better: line-breaking an embedded section with opposite direction to the line's base direction. Currently I handle that by forcing a line-break and letting the next line go the other way, but I believe that is not the "expected" approach (e.g. adjust the width of this example until "Unicode Conference" is forced over a line break). I didn't find anything on this in TR9. This is also an issue with "no break runs": e.g. in this sample there should not be a line-break between the parentheses and "Unicode Conference", which implies that if a line break happen here, then my current behaviour (pushing to a new LTR line) is inappropriate.

@raphlinus
Copy link
Author

raphlinus commented Sep 5, 2020

I think UAX 9 covers that pretty well; it's the meat and potatoes of why this is "BiDi" and not simply RTL.

Here's my take: If your text is "ABCDEF (Unicode Conference) GHIJKL", then your line breaks are 7, 16, 28, and 34 (EOL). If the paragraph direction is RTL, then the level runs are 0..8 level 1, 8..26 level 2, and 26..34 level 1. In other words, the text "Unicode Conference" is level 2, and everything else (including the parentheses) are level 1.

If you take the line break at 16 and query for levels reordered in visual order, then the first line gives you 15..16 (level 1), 8..15 (level 2), 0..8 (level 1). The first one is a consequence of the reset logic (ie unic-bidi will give you the wrong answer due to its bug 272). This works out to " Unicode) FEDCBA" after reordering and mirroring, which I believe is correct. I could work the second line, but the logic is similar. All this is a direct consequence of applying UAX 9.

I think you might be confused when you use the term "line's base direction." That's not really a thing. There is a "base direction," but that's another name for the paragraph direction, and is the same for all lines in the paragraph.

This stuff is pretty arcane. I had the good fortune of working closely with Roozbeh Pournader and Behdad Esfahbod for several years, among others. Without their guidance, I'd have basically no clue.

And even professional systems are still likely to get the details wrong. Marijn Haverbeke's blog Cursor motion & bi-directional text is a pretty good introduction to cursor motion. I think in our stack we're likely to have multiple cursors, due to the DNA from its xi-editor lineage, and we'll use that to represent visually contiguous but logically disjoint selection ranges. But I want to check with my BiDi colleagues before going too far down that route.

@dhardy
Copy link
Contributor

dhardy commented Sep 5, 2020

This stuff is pretty arcane. I had the good fortune of working closely with Roozbeh Pournader and Behdad Esfahbod

Good for you. I'm more the jack of all trades and minimise the jargon, but I've developed some interest in GUIs.

Thanks for the cursor motion link; that might actually solve another difficulty.

"line's base direction."

I'm making terms up here. But there is such a thing as a paragraph embedding level, determining the initial direction. Apparently I was confused by rule L1: on each line, reset the embedding level .... Thanks for clearing that up.

@dhardy
Copy link
Contributor

dhardy commented Sep 5, 2020

I think in our stack we're likely to have multiple cursors

I don't know if you've tried the editor in the layout demo, but KAS uses multiple cursors at visually split positions, marking one of them grey. (This includes line-wrap positions — I probably need to change the 'end' key to go to the last position not also on the next line!) I've never seen it in an editor before but it seems to work well.

@dhardy
Copy link
Contributor

dhardy commented Sep 5, 2020

Marijn Haverbeke makes cursor movement follow visual order. Is that actually a good thing vs logical order? I notice Qt moves in logical order, but reverses the direction if the paragraph direction is RTL (which is extremely confusing when navigating embedded LTR regions). One problem with navigation in visual order is that a single visual location can correspond to two logical locations; this might make it impossible to select the desired one.

@raphlinus
Copy link
Author

Marijn Haverbeke makes cursor movement follow visual order. Is that actually a good thing vs logical order?

That's a really good question. He recently did a survey on Twitter and found that a small majority (maybe a plurality) preferred visual order. Visual order is what macOS native text does, and I consider that something of a gold standard. But there are enough tools out there that do logical order that maybe people are getting more used to that. I think it's necessary to do some user research to figure out what people actually prefer.

@NightMachinery
Copy link

I noticed that the readme says the bidi problem has been solved? Can I use your solution to reshape and reorder some text for other apps to use? I have a simple CLI tool that does this using arabic_reshaper and unic-bidi, but those packages are buggy.

@dhardy
Copy link
Contributor

dhardy commented Mar 17, 2021

@NightMachinary I don't know exactly which bidi problem you are referring to? Bidi support is implemented but, like many complex things, has bugs. I also didn't get around to implementing visual-order navigation so that's logical-order only.

But this is APL2 so feel free to copy and adapt code, and I may be able to help if you have questions (probably best in a new issue). Possibly you could also improve the code in this library.

I'm not quite sure what problem you are trying to solve, but one of the big problems here is that calculating line-length basically requires type-setting your line before re-ordering, just so that you know where to wrap lines. Maybe your CLI tool assumes fixed-width fonts and thus can calculate wrap points more easily?

@NightMachinery
Copy link

@dhardy commented on Mar 17, 2021, 1:06 PM GMT+3:30:

@NightMachinary I don't know exactly which bidi problem you are referring to? Bidi support is implemented but, like many complex things, has bugs. I also didn't get around to implementing visual-order navigation so that's logical-order only.

But this is APL2 so feel free to copy and adapt code, and I may be able to help if you have questions (probably best in a new issue). Possibly you could also improve the code in this library.

I'm not quite sure what problem you are trying to solve, but one of the big problems here is that calculating line-length basically requires type-setting your line before re-ordering, just so that you know where to wrap lines. Maybe your CLI tool assumes fixed-width fonts and thus can calculate wrap points more easily?

I think I got this completely wrong, sorry. I was trying to just reorder the text to use it in my terminal, and I have not looked into possible wrapping problems (though wrapping will probably not be a problem as fonts are monospace in the terminal, and using a simple | fold -w100 will suffice).

@wez
Copy link

wez commented May 12, 2021

Sorry to parachute in here out of nowhere with a slightly OT comment, but it seems like this thread has folks with relevant insights!

wez/wezterm#784 is a request to support BiDi in my terminal emulator project.

I already use harfbuzz for shaping, and thanks to https://terminal-wg.pages.freedesktop.org/bidi/ I feel like I have a reasonable handle on the sort of code changes I need to make in the context of a terminal emulator.

I was hoping to do this without having to become an expert on bidi(!), but after reading through the issues that Raph opened in unic-bidi and unicode-bidi I worry about the respective completeness and suitability of those crates for my needs given my current level of understanding of this topic.

I was kinda hoping for a thing that I feed text into and it gives me runs with direction info that I can feed to harfbuzz. Is that actually what I need, and what is the current best available implementation of that?

@raphlinus
Copy link
Author

Something else to take a look at is swash_demo, which has a from-scratch bidi implementation. I'm evaluating that seriously now, including the possibility of splitting it out into its own crate so it doesn't necessarily have to be used with swash.

@mbrubeck
Copy link

By the way, while I'm not actively working on unicode-bidi, I am still reviewing and merging PRs. I'd be happy to accept any changes and improvements, even including major API changes. We can also add additional maintainers to the crate if somebody wants to take over development. (However, if swash's implementation is a better starting point, that's fine.)

@raphlinus
Copy link
Author

Good to know. I haven't carefully reviewed the two, but think that would be good (including performance analysis). I was lacking the context that unicode-bidi was actually better maintained than unic-bidi, I thought the center of gravity had shifted to the latter and didn't realize it was more of an experiment.

@mbrubeck
Copy link

mbrubeck commented May 20, 2021

The two codebases are still largely identical. (unic-bidi is a fork of unicode-bidi with very few substantial changes.) The fork definitely split the ecosystem and led to unicode-bidi getting less attention. However, it seems that the rust-unic maintainers have not had much time available lately, leading to the fork being abandoned (?) while the original is still in maintenance mode.

wez added a commit to wez/wezterm that referenced this issue Jan 25, 2022
In order to support RTL/BIDI, wezterm needs a bidi implementation.  I
don't think a well-conforming rust implementation exists today; what I
found were implementations that didn't pass 100% of the conformance
tests.

So I decided to port "bidiref", the reference implementation of the UBA
described in http://unicode.org/reports/tr9/ to Rust.

This implementation focuses on conformance: no special measures have
been taken to optimize it so far, with my focus having been to ensure
that all of the approx 780,000 test cases in the unicode data for
unicode 14 pass.  Having the tests passing 100% allows for making
performance improvements with confidence in the future.

The API isn't completely designed/fully baked.  Until I get to hooking
it up to wezterm's shaper, I'm not 100% sure exactly what I'll need.
There's a good discussion on API in
open-i18n/rust-unic#273 that suggests omitting
"legacy" operations such as reordering. I suspect that wezterm may
actually need that function to support monospace text layout in some
terminal scenarios, but regardless: reordering is part of the
conformance test suite so it remains a part of the API.

That said: the API does model the major operations as separate
phases, so you should be able to pay for just what you use:

* Resolving the embedding levels from a paragraph
* Returning paragraph runs of those levels (and their directions)
* Returning the whitespace-level-reset runs for a line-slice within the
  paragraph
* Returning the reordered indices + levels for a line-slice within the
  paragraph.

refs: #784
refs: kas-gui/kas-text#20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants