Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTL text in conhost is no longer rendered correctly #12294

Closed
j4james opened this issue Jan 30, 2022 · 24 comments · Fixed by #12722
Closed

RTL text in conhost is no longer rendered correctly #12294

j4james opened this issue Jan 30, 2022 · 24 comments · Fixed by #12722
Assignees
Labels
Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Help Wanted We encourage anyone to jump in on these. Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-1 A description (P1) Product-Conhost For issues in the Console codebase Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. zInbox-Bug Ignore me!

Comments

@j4james
Copy link
Collaborator

j4james commented Jan 30, 2022

Windows Terminal version

Commit eb75597

Windows build number

10.0.19041.1415

Other Software

No response

Steps to reproduce

  1. Build a recent version of OpenConsole.
  2. Open a conhost bash shell.
  3. Execute the following command: printf "\u05ea\u05d7\u05d0\n"

Expected Behavior

RTL characters should be displayed in the exact order they were output, and not reversed. This is what it looks like in my inbox conhost (10.0.19041.1415):

image

This also matches the behaviour of XTerm.

Actual Behavior

In the current version of OpenConsole (I think since PR #10478), RTL characters are reversed, like this:

image

I realise that some people might consider this a good thing, since it gives the superficial appearance that it's rendering RTL languages correctly, but it is not compatible with the original conhost and breaks genuine RTL-aware applications (which rely on characters being displayed exactly where they've been positioned).

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Jan 30, 2022
@j4james
Copy link
Collaborator Author

j4james commented Jan 30, 2022

I'm not suggesting we revert PR #10478, since I'd hate to lose the benefits we get from that, but I think the RTL behaviour could be fixed by inserting an additional step to calculate the glyph indexes with GetCharacterPlacementW, before calling ExtTextOutW with ETO_GLYPH_INDEX. As long as we don't set the GCP_REORDER flag, the characters should be displayed in the original buffer order.

@j4james
Copy link
Collaborator Author

j4james commented Mar 1, 2022

@DHowett Don't want to nag, but note that this is a regression in conhost, and I'm a little concerned that it hasn't been triaged and may have been overlooked.

@zadjii-msft
Copy link
Member

Sorry, yes this was overlooked. I think mentally I kinda go "yep, I'm sure that's a real bug" when I see your name as the filer 😋 I'll toss this in 1.14. We should fix this for the OS version of conhost.

@zadjii-msft zadjii-msft added Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Help Wanted We encourage anyone to jump in on these. Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-2 A description (P2) Product-Conhost For issues in the Console codebase labels Mar 7, 2022
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Mar 7, 2022
@zadjii-msft zadjii-msft added this to the Terminal v1.14 milestone Mar 7, 2022
@zadjii-msft zadjii-msft added the zInbox-Bug Ignore me! label Mar 10, 2022
@DHowett DHowett removed the Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting label Mar 12, 2022
@DHowett
Copy link
Member

DHowett commented Mar 12, 2022

Pulled triage. Sorry @j4james, I've been snowed in on e-mail as I had to leave to take care of some family stuff. Thanks for the first pass, Mike.
d

@DHowett
Copy link
Member

DHowett commented Mar 15, 2022

Yes. This is very important for us to fix. /cc @alabuzhev for thoughts on how ExtTextOut makes our lives more difficult here.

@j4james
Copy link
Collaborator Author

j4james commented Mar 15, 2022

FYI, my quick hack fix for this was to replace the ExtTextOutW call here:

if (!ExtTextOutW(_hdcMemoryContext, t.x, t.y, t.uiFlags, &t.rcl, t.lpstr, t.n, t.pdx))

with something like this:

std::array<wchar_t, 1000> glyphs;
GCP_RESULTS results{};
results.lStructSize = sizeof(results);
results.lpGlyphs = glyphs.data();
results.nGlyphs = gsl::narrow_cast<UINT>(glyphs.size());
GetCharacterPlacementW(_hdcMemoryContext, t.lpstr, t.n, GCP_MAXEXTENT, &results, 0);
if (!ExtTextOutW(_hdcMemoryContext, t.x, t.y, t.uiFlags | ETO_GLYPH_INDEX, &t.rcl, results.lpGlyphs, results.nGlyphs, t.pdx))

Obviously not intended to be production code, but you get the idea.

@alabuzhev
Copy link
Contributor

RTL characters should be displayed in the exact order they were output, and not reversed

I wish this was true. And also each character occupied exactly one cell. And no zero width. And no surrogates. And no clusters. And so on and so forth. Unfortunately, text processing is a PITA.

but it is not compatible with the original conhost and breaks genuine RTL-aware applications (which rely on characters being displayed exactly where they've been positioned)

Then the new and shiny Windows Terminal is also not compatible with the original conhost and breaks such RTL-aware applications. Are there any complaints from their maintainers? Should it be fixed there too? And in conhost DX renderer? And in conemu, console2 and other similar frontends?

Overall, my experience here is extremely limited, I don't work with RTL and can't say how it should be. @trexinc, I remember somewhat related discussions eons ago on the forum about how the console should behave with RTL to make life less painful. Do you have any opinion about this?

@alabuzhev
Copy link
Contributor

Speaking about compatibility with the original conhost: as mentioned here, font fallback used to work in pre-Windows 7 days, when NtGdiConsoleTextOut was used. I've just checked this on Windows XP and RTL is also reversed there:

image

@j4james
Copy link
Collaborator Author

j4james commented Mar 16, 2022

Then the new and shiny Windows Terminal is also not compatible with the original conhost and breaks such RTL-aware applications.

It's been fixed in the new atlas render.

@lhecker
Copy link
Member

lhecker commented Mar 16, 2022

@j4james I believe this doesn't work with font fallback. I think if you try to draw Japanese text for instance, it'll show just blank / whitespace glyphs.

As far as I can see the only way to resolve this issue, while having both, font fallback and broken RTL support, is to use ScriptItemize, then ScriptShape, ScriptPlace and finally ScriptTextOut. That way we can set fLogicalOrder in SCRIPT_ANALYSIS to TRUE, ensuring we skip glyph reordering (if I understand the docs correctly). I don't even see any undocumented escape hatches for ExtTextOutW internally unfortunately.

@alabuzhev Wait... Did you paste them exactly as \u05ea\u05d7\u05d0 into the XP console? I would be somewhat surprised if we had supported RTL reordering back then... But if it used to work, then I wonder what the actually correct path forward is.

@alabuzhev
Copy link
Contributor

Did you paste them exactly as \u05ea\u05d7\u05d0 into the XP console?

Yes, this as is: תחא

Moreover:

image

@lhecker
Copy link
Member

lhecker commented Mar 16, 2022

It's been fixed in the new atlas render.

While I would love to take that compliment as the author of the engine as is, I have to confess that this is unfortunately more like a side-effect from me not implementing RTL/BiDi support at all. 😕
Up until today I simply assumed that people are really really disappointed in Windows not properly supporting BiDi text and glyph reordering. Practically most popular UNIX terminals reorder their glyphs after all... Also I can't quite imagine how manually reordered Arabic glyphs would work... But I guess applications rely on this now?


@alabuzhev Wow! This is impressive! On one hand I think I now understand that we'll likely have to revert the ExtTextOutW benefits at least in parts, so that we don't break applications which rely on our broken behavior, but on the other hand... Just wow! This makes me at least personally somewhat conflicted about re-breaking glyph reordering. 😅

@alabuzhev
Copy link
Contributor

so that we don't break applications which rely on our broken behavior

That's kinda my point - do you already have dozens of reports like "things are broken, the world is falling apart, do something now"?
Support for anything non-ASCII in Windows Console has always been like "it depends" - on the OS version, current font, system locale, console codepage, output method, the phase of the Moon etc. Personally I haven't seen any applications even trying to cover non-trivial cases, but YMMV of course.

@j4james
Copy link
Collaborator Author

j4james commented Mar 16, 2022

@j4james I believe this doesn't work with font fallback. I think if you try to draw Japanese text for instance, it'll show just blank / whitespace glyphs.

Yeah, you're right. I've just tested and that's not working for me either. Oh well.

I would be somewhat surprised if we had supported RTL reordering back then... But if it used to work, then I wonder what the actually correct path forward is.

Yeah, that's weird. It's definitely not reordering RTL characters for me in the legacy console.

But I guess applications rely on this now?

That would be because it's almost impossible to write an RTL application on terminals that don't work this way. Give it a try. See if you can write some basic RTL applications on one of those terminals that reorders RTL characters. Like a simple RTL form entry system, or something that pops up a dialog or drop-down menu over existing RTL text. Maybe I'm just an idiot, but I can't see how you can make that work, but it's fairly straightforward on terminals that leave RTL characters exactly where you put them.

@lhecker
Copy link
Member

lhecker commented Mar 16, 2022

So I think we have 3 options here with various benefits:

  1. Keep ExtTextOutW - No work needed
    According to Wikipedia's web statistics I can guess that about 50% of Windows users don't use Latin characters for their primary language and about 10% use RTL scripts. The addition of font fallback has a very far reaching positive impact on our users, which so far were unable to use fonts like Consolas.
  2. Revert to PolyTextOutW - Minutes of work required
    Normally stability trumps anything else when it comes to conhost. Keeping the output of glyphs in their logical order, ensures we don't accidentally break Hebrew TUI applications.
  3. Use Uniscribe manually - Potentially days of work required
    This would fix the issue and show glyphs in their logical order.

Regarding 2. and 3.: I'm pretty sure that this will re-break Arabic scripts, since those heavily lean on ligatures and glyph reordering to render correctly. So basically conhost using logical order and the TUI application writing the glyphs backwards manually will only (practically at least) work for Hebrew basically as far as I can see, since Hebrew is a bit like "Latin in RTL".
Allowing Hebrew to work with Bidi-aware TUI applications, but making it impossible (or very hard) to use Arabic correctly, despite the latter being 10x more common, leaves a bit of a bad aftertaste in my opinion. Personally I'm leaning towards breaking Bidi-aware TUI applications, but allowing Arabic users to read their language.

However I understand that we'd not want to take any chances in regressions and would thus consider opting for 2. @miniksa?

@miniksa
Copy link
Member

miniksa commented Mar 16, 2022

As far as I can discern, no one ever actually concerned themselves with Arabic nor Hebrew support in the console host. The targeted languages were basically LTR European type character sets + the CJK trio. Beyond that... it looks like anything else that worked or didn't was a happy accident.

Furthermore, when our localization team tells us what languages we can pay for in terms of translations for developer utilities today... they limit it to: German, English, Spanish, French, Italian, Japanese, Korean, Brazilian Portuguese, Russian, Simplified Chinese, and Traditional Chinese. I'd therefore have to believe to some degree that research was performed to determine that was the appropriate balance between resources and developer market was to focus on those languages.

Therefore, my consideration here is happiness of those languages as primary goal with anything else being secondary.

Further, one of the most popular issues filed against conhost.exe is the lack of font fallback for Chinese, Japanese, and Korean languages. Switching to ExtTextOut (Option 1) to restore font fallback, therefore, dramatically reduced our inbound bug flow and solved an issue for four of the targeted languages.
An issue I've never seen filed in Feedback Hub, directly from our OEM customers, our business partners, or otherwise in the last 7-8 years of working on this is anything about Hebrew or Arabic. I know that's super scientific... to rely on my past experience.

But with the combination of those reasons, I would have to personally opt for Option 1.

I would offer to @lhecker, if he's interested, that next week is our organization's "Fix Hack Learn" week again. If he wants to spend a few days hacking Option 3 using Uniscribe to solve this problem and learn more about language processing... he would be free to do so. I think it would be better, though, long-term to focus efforts on supporting those languages fully in the Terminal and the Atlas renderer.

The discussion can continue, I'm not shutting it down. This is just my opinion on the situation.

@j4james
Copy link
Collaborator Author

j4james commented Mar 16, 2022

I have a suggestion for another possible solution which may keep everyone happy.

We carry on using ExtTextOut, but if we detect that the string contains RTL characters, we switch to a slower rendering branch that outputs one character at a time. That way the characters should all be displayed in the right place (by which I mean they won't be reordered).

@lhecker
Copy link
Member

lhecker commented Mar 16, 2022

How do you detect runs of RTL glyphs? If the answer is Uniscribe, I think we can just go all the way and use it for text drawing too...
Especially since we can't output them characterwise, as that would break ligatures, ZWJ, etc. and all the other fun Unicode stuff. Rendering glyphs in their logical order with Unicode support is only possible if we opt into Uniscribe 100% I think.

@j4james
Copy link
Collaborator Author

j4james commented Mar 17, 2022

How do you detect runs of RTL glyphs?

I was thinking of something simple like a range check. Haven't looked at the unicode blocks in detail, but you could start with everything from U+0590 to U+08FF, and maybe another block covering supplementals. It doesn't really matter if we get false positives, because they're still going to render correctly - just a bit slower - and they ought to be rare.

Especially since we can't output them characterwise, as that would break ligatures, ZWJ, etc. and all the other fun Unicode stuff.

OK ligatures I can see being a problem for Arabic text. I've only really dealt with Hebrew so I'm not sure how well that works. The other unicode stuff seems less of a problem. Does any of that stuff work now? Are we ever expecting it to work in the GDI renderer?

@lhecker
Copy link
Member

lhecker commented Mar 17, 2022

Does any of that stuff work now? Are we ever expecting it to work in the GDI renderer?

Yeah it does! While Uniscribe doesn't seem to support "liga" in fonts, it does correctly handle ligatures in most languages. You gotta say, this is pretty nice to see in good old conhost, right?
(The output isn't perfect mind you, but this is still really good IMO...)

image

I'd be quite sad if we lost that, but I'd understand that it'd be for a good cause.
I've already told @miniksa that I'll try to attempt to implement a solution for logical glyph ordering next week. I'm at least curious how much we'd "loose" if we disable glyph reordering (aka use "logical ordering"), since I'm not a Unicode expert and I can only guestimate that it'd probably break Arabic without any possibility to fix it in a TUI application.

Given that most terminals apart from xterm seem to not draw glyphs in their logical order either, I do wonder however whether it's reasonable to merge my fix, even if I submit such a PR later.
I'd go with your opinion @j4james since you're vastly more experienced in this field than me, but I get the feeling that we'd be more consistent with other terminals if we'd actively not support TUI applications implementing their own BiDi (by reordering characters themselves) even if it breaks the TUI's layout...

BTW for my own curiosity: Do you happen have a specific application at hand that positions Hebrew text manually? This would allow me to better test my Uniscribe- (or char-range-) experiments.

@j4james
Copy link
Collaborator Author

j4james commented Mar 17, 2022

Yeah it does! While Uniscribe doesn't seem to support "liga" in fonts, it does correctly handle ligatures in most languages. You gotta say, this is pretty nice to see in good old conhost, right?

Yeah, I saw the ligatures were working. I meant the other things you were refering to when you said "ZWJ, etc. and all the other fun Unicode stuff".

And while it looks nice at first glance, it's not particularly useful as is as far I'm concerned. You've got no hope of editing the text - all it's really good for is displaying a single line of content at best.

I'd be quite sad if we lost that, but I'd understand that it'd be for a good cause.

Yeah, ideally we'd have a solution that was realistically usable and also looked pretty, but I don't know how feasible that is for languages with ligatures. I thought with something like Arabic, an application might be able to output the appropriate form of each character manually, which might make up for the loss of ligature support, but I don't know enough about the subject to know if that's nonsense.

Given that most terminals apart from xterm seem to not draw glyphs in their logical order either

I wouldn't have said "most", but I haven't checked recently. And for those that do draw the glyphs in RTL order, there's not a standard of any sort that they're following - they all do things differently. Thankfully some of them at least have a way to turn that functionality off.

if we'd actively not support TUI applications implementing their own BiDi

You realise that just means we're saying we don't support TUI applications fullstop (at least for BiDi languages). If that's the route we want to take, that's fine - I seem to be in the minority in wanting support for RTL TUI apps. I'd just like to know for definite where we're going with this, so I can make my own plans accordingly.

BTW for my own curiosity: Do you happen have a specific application at hand that positions Hebrew text manually?

Well there is the command line utility from fribidi, which can be used as a kind bidi-aware version of cat (amongst other things). And there's also a Hebrew mode in vim (i.e. vim -H). The other applications I have are unfortunately not open source.

@lhecker
Copy link
Member

lhecker commented Mar 17, 2022

So instead of creating a Unicode standard at some point to standardize the (cell) width per grapheme cluster, applications started to straight up write "characters" in reverse. That's actually scary. 😨

Thanks for the tip with vim -H. I wasn't aware about that functionality.
You're also right that other terminals make this configurable.

In either case I'm convinced now and will make sure to build something that restores the previous behavior as soon as I can. I mean I already planned to do it, but now I'm doing it out of conviction. 😄
I don't think I'll go with your idea however (drawing text character-wise if RTL is detected), as I think that's categorically the wrong approach. At least I'm pretty sure from all I know, that character-wise drawing has significant flaws. The simplest example for that I can think of are ligatures again, where א‎ and ל‎ are two separate characters in Hebrew, but can form ﭏ if written next to each other. (I seriously wonder how vim -H supports that... I guess they just assume the font doesn't support this, since ﭏ is old Hebrew.)
Uniscribe is generally a lot faster than DirectWrite and I don't think we'll run into any performance problems any time soon, even if we make full use of it here. This is especially so, since ExtTextOut is implemented in terms of Uniscribe anyways.

I know approximately how to draw Unicode with Uniscribe, but it'll probably take me ages to integrate that into the GdiEngine... Implementing the ligature and wide glyph support in AtlasEngine was simple since it was a new project after all (which is why it handles Emojis with ZWJs for instance).
But I'll manage... I hope. 😅

@j4james
Copy link
Collaborator Author

j4james commented Mar 17, 2022

I don't think I'll go with your idea however (drawing text character-wise if RTL is detected), as I think that's categorically the wrong approach.

I agree with you there. I just thought it might be better than nothing, but if there's a way to do things properly with Uniscribe, I would be thrilled.

One thing you may need to watch out for when testing, is support for horizontal scrolling in conhost (disable the "wrap text output on resize" option, and make the buffer size wider than the window size). When the viewport isn't at the left margin, it can start rendering half way through the buffer, which breaks things completely in the current implement when RTL text is reordered.

Hopefully it won't be a problem if we're going back to rendering in logical order again, but you may get some weird artifacts when ligatures are split across the viewport border. You'll likely have similar problems when selecting text, and cursoring over text (depending on the cursor type). But I don't think it's the end of the world if we don't have all the edge cases working perfectly to start with.

Also note that the DECDWL double-width sequence has an effect on the horizontal offsets in the viewport, so that needs to be accounted for too.

And one last tip for testing. If, like me, you don't speak any RTL languages, I've found it helpful to use nonsense Hebrew content that looks vaguely like English, so I can more easily tell when something has gone wrong.

For example, the phrase below looks a bit like "young puppy won't nip on jogging pony".

ץחסק פחופפסנ חס קוח ז׳חסש ץקקטק פחטסץ

And the equivalent reversed text (which should look correct when the renderer doesn't do RTL reordering):

ץסטחפ קטקקץ שסח׳ז חוק סח נספפוחפ קסחץ

@lhecker
Copy link
Member

lhecker commented Mar 18, 2022

@j4james Damn that was almost too easy - took like 5 minutes: https://github.com/microsoft/terminal/compare/dev/lhecker/12294-bidi-override

diff --git a/src/renderer/gdi/paint.cpp b/src/renderer/gdi/paint.cpp
index 598ce3489..bc4c4e95a 100644
--- a/src/renderer/gdi/paint.cpp
+++ b/src/renderer/gdi/paint.cpp
@@ -445,9 +445,21 @@ using namespace Microsoft::Console::Render;
         for (size_t i = 0; i != _cPolyText; ++i)
         {
             const auto& t = _pPolyText[i];
-            if (!ExtTextOutW(_hdcMemoryContext, t.x, t.y, t.uiFlags, &t.rcl, t.lpstr, t.n, t.pdx))
+
+            SCRIPT_STATE ss{};
+            ss.fOverrideDirection = TRUE;
+
+            SCRIPT_STRING_ANALYSIS ssa;
+            hr = ScriptStringAnalyse(_hdcMemoryContext, t.lpstr, t.n, 0, -1, SSA_GLYPHS | SSA_FALLBACK | SSA_LINK, 0, nullptr, &ss, t.pdx, nullptr, nullptr, &ssa);
+            if (FAILED(hr))
+            {
+                break;
+            }
+
+            hr = ScriptStringOut(ssa, t.x, t.y, t.uiFlags, &t.rcl, 0, 0, FALSE);
+            ScriptStringFree(&ssa);
+            if (FAILED(hr))
             {
-                hr = E_FAIL;
                 break;
             }
         }

I think typing this message took longer than writing that code. 😄

ScriptStringAnalyse is a handy function that calls ScriptItemize, ScriptShape, ScriptPlace, and ScriptBreak for you.
Due to the lack of batching this approach is a lot slower than ExtTextOut though. My plan is to call those 4 functions myself (well 3, because we don't really need ScriptBreak) and call ScriptIsComplex. If it's false I can just straight up call TextExtOut to ensure the expected performance in the general case.

If I can't make it for whatever reason though, I think this is what we could ship, since it works.

@lhecker lhecker self-assigned this Mar 18, 2022
@lhecker lhecker added Priority-1 A description (P1) and removed Priority-2 A description (P2) labels Mar 18, 2022
@ghost ghost added the In-PR This issue has a related PR label Mar 18, 2022
@ghost ghost closed this as completed in #12722 Mar 21, 2022
ghost pushed a commit that referenced this issue Mar 21, 2022
Some applications like `vim -H` implement their own BiDi reordering.
Previously we used `PolyTextOutW` which supported such arrangements,
but with a0527a1 and the switch to `ExtTextOutW` we broke such applications.
This commit restores the old behavior by reimplementing the basics
of `ExtTextOutW`'s internal workings while enforcing LTR ordering.

## Validation Steps Performed
* Create a text file with "ץחסק פחופפסנ חס קוח ז׳חסש ץקקטק פחטסץ"
  Viewing the text file with `vim -H` presents the contents as expected ✅
* Printing enwik8 is as fast as before ✅
* Font fallback for various eastern scripts in enwik8 works as expected ✅
* `DECDWL` double-width sequences ✅
* Horizontal scrolling (apart from producing expected artifacts) ✅

Closes #12294
@ghost ghost added Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. and removed In-PR This issue has a related PR labels Mar 21, 2022
DHowett pushed a commit that referenced this issue Mar 24, 2022
Some applications like `vim -H` implement their own BiDi reordering.
Previously we used `PolyTextOutW` which supported such arrangements,
but with a0527a1 and the switch to `ExtTextOutW` we broke such applications.
This commit restores the old behavior by reimplementing the basics
of `ExtTextOutW`'s internal workings while enforcing LTR ordering.

## Validation Steps Performed
* Create a text file with "ץחסק פחופפסנ חס קוח ז׳חסש ץקקטק פחטסץ"
  Viewing the text file with `vim -H` presents the contents as expected ✅
* Printing enwik8 is as fast as before ✅
* Font fallback for various eastern scripts in enwik8 works as expected ✅
* `DECDWL` double-width sequences ✅
* Horizontal scrolling (apart from producing expected artifacts) ✅

Closes #12294

(cherry picked from commit d97d9f0)
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Help Wanted We encourage anyone to jump in on these. Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-1 A description (P1) Product-Conhost For issues in the Console codebase Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. zInbox-Bug Ignore me!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants