Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update wcwidth/CharWidth.ts #1707

Closed
wants to merge 10 commits into from
Closed

Update wcwidth/CharWidth.ts #1707

wants to merge 10 commits into from

Conversation

dnfield
Copy link

@dnfield dnfield commented Sep 24, 2018

The current wcwidth implementation isn't up to date with the latest from Unicode. For many emojis, this means that they're treated as a single terminal cell when they really need to take up 2, resulting in lots of strange input/rendering problems for people who like moons, eggplans, and smiley faces.

This PR updates it. One thing I'm not 100% sure about is whether the wideBMP methods are still needed - it looks to me like they're meant to be a performance optimization, but I'm not 100% sure about what's going on there and didn't want to mess with it too much.

This fixes #1059 as well as microsoft/vscode#51385 as well as #1502

@dnfield
Copy link
Author

dnfield commented Sep 24, 2018

Tests were passing locally but looks like I didn't update the build.. Will fix.

@dnfield
Copy link
Author

dnfield commented Sep 24, 2018

There's still one bug with translateBufferLineToString - if tthe emoji happens on a boundary (which the failing test case just happens to demonstrate - if you add one character or take one out before the emoji it passes), it's adding whitespace.

@dnfield
Copy link
Author

dnfield commented Sep 24, 2018

I believe this fix is correct but would appreciate some review on it, especially since I had to update a few tests to make it consistently work.

It looks like print had a bug wherein it would insert an extra space character when wrapping was involved... Latest commit has relevant changes.

@jerch
Copy link
Member

jerch commented Sep 24, 2018

@dnfield Wow thx! You are kinda tackling two of the most annoying subtile core bugs in xterm.js at once.

  1. The wcwidth version problem:
    This is not an easy one, not sure if it can be resolved at all. Problem behind the scenes is the unicode version hell with no reliable way to detect the correct unicode set. So far we delivered the wcwidth definitions of v5.2 (about 10 years old) which was not a problem until recently when ppl started to use emojis in terminals. But those are pretty new in the standards and, even worse, some made a shift from halfwidth to fullwidth in the specs. More about the background story, halfwidths, fullwidths and ambigious chars here revamp wcwidth #1467
    Since there are systems/applications out there that dont yet support unicode 11, we will have to deliver multiple definitions and make it configurable. Related to this are the other unicode features like bidi support and the new grapheme rules. (see Grapheme support #1468).
    So yeah, we need a way to provide configurable unicode rule sets first. Pure unicode mess lol.

  2. About the translateBufferLineToString and the InputHandler.print:
    Thats discussed here fix empty cell representation #1685. The FIXMEs in the comments you might have stumbled over are all related to this. Although your change to .print fixes it for the last cell + a fullwidth char, it will not work with grapheme clusters anymore once we decide to support them.

I think we should not mix both issues into one PR, since each might have side effects on the rest of the code base (our test coverage isnt that good for some core parts, we are working on it). Imho your PR is a good starter for the unicode mess 👍
For the other issue, please grab #1685 and start another PR if you like.

@Tyriar
Copy link
Member

Tyriar commented Sep 24, 2018

Nice, line wrapping seems to work in both macOS HS and Ubuntu 18.04 👍 these changes will end up closing off a bunch of issues 😍

@@ -508,7 +508,7 @@ describe('Buffer', () => {
// const s = terminal.buffer.contents(true).toArray()[0];
// assert.equal(input, s);
for (let i = 10; i < input.length; ++i) {
const bufferIndex = terminal.buffer.stringIndexToBufferIndex(0, i + 1); // TODO: remove +1 after fix
const bufferIndex = terminal.buffer.stringIndexToBufferIndex(0, i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this with a separate PR. The test case tested the wrong behavior by purpose to track it.

const s = terminal.buffer.iterator(true).next().content;
assert.equal(input, s);
assert.equal(getStringCellWidth(s), 12);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to #1685 you can temporarily fix your test case by placing a whitespace after every second emoji char.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test wasn't breaking, it was another existing one ("Lots of makes me ." - smiley just happend to occur after a break in the buffer, leaving an extra space)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yupp my bad, the line isnt wrapping (mixed cols and rows in my head).

@@ -407,6 +407,9 @@ export class InputHandler extends Disposable implements IInputHandler {
// autowrap - DECAWM
// automatically wraps to the beginning of the next line
if (wraparoundMode) {
if (buffer.x === cols - 1) {
bufferRow.set(buffer.x, [curAttr, '', 0, undefined]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said above, please move this to a separate PR.

[0x1f9c0, 0x1f9c2], // (nil) ..
[0x1f9d0, 0x1f9ff], // (nil) ..
[0x20000, 0x2fffd], // Cjk Unified Ideograph-20..
[0x30000, 0x3fffd] // (nil) ..
Copy link
Member

@jerch jerch Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need some version selection thing on top of this and support both rule sets not to leave ppl with 3ys old systems behind.
Edit: This also will need a setting for the ambigious chars.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure how to best make that work in this codebase. Am I right in thinking the concern is that if someone is running xterm.js with a font that hasn't been updated in three years it would cause problems? Or something else?

Copy link
Member

@jerch jerch Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first place we have to make sure, the active spec is in line with the pty downwards (the shell/foreground app at the slave side + the pty itself), since they expect a certain string to fit into xy terminal cells. Doing otherwise weird things will happen to the cursor.
The rendering metrics are secondary, and with the canvas renderer we can force them fit to some degree even if the font is not monospace. The best case is the local app where both frontend and backend run on the same system. It gets really interesting for remote sessions with ssh, there is literally no way to tell which unicode version the backend runs. Therefore I think we need some option for that.
Imho the easiest way for now would be to put both definitions into some key-value storage and expose the key as the option.
Edit: Or even easier - both versions closured behind an access key. The option (key) would have to be managed by the InputHandler, so if someone request the other version, the input handler could swap the implmentation. Something like that.

@jerch
Copy link
Member

jerch commented Sep 24, 2018

@dnfield Btw the isWideBMP cannot be remove, it is used to create a lookup table for BMP chars (gives like 5x speedup compared to the bisect, which is crucial since every single char has to go through this).

@jerch
Copy link
Member

jerch commented Sep 24, 2018

@Tyriar We gonna need at least two other API options for this - one to request an unicode version (which loads to correct rule set) and another to set the current locale setting for the ambiguous chars.
For the unicode versions we might get away with supporting only two, the old one and v11. (Older OSX versions still might show problems with emojis, imho they adopted a drafted spec in between with different definitions).

@dnfield
Copy link
Author

dnfield commented Sep 24, 2018

I've moved the requested commits to another branch (https://github.com/dnfield/xterm.js/tree/wrapping_unicode), but I may hold off to open a PR for it until this lands to avoid confusion (it would still show all the changes in this PR, minus the additional space I added to make the "Lots of yen" test pass). Let me know what you'd prefer there.

@jerch
Copy link
Member

jerch commented Sep 24, 2018

@dnfield About the "other" issue - hmm maybe we can go with it fixing only the last cells for now, it surely will resolve a bunch of reported issues. Also I think it should be covered quit good by test cases (those you had to change in Terminal.test.ts). Things that will need some extra love are the selection manager, the search addon and the linkifier, they heavily rely on translateBufferLineToString and I am not sure, if there are not workarounds in place that are likely to break after the change.

@jerch
Copy link
Member

jerch commented Sep 24, 2018

@dnfield Just lemme know if you get stuck or need some pointers. I kinda feel responsible for that stuff, have it way to long on the todo list. Many thx in advance!

@dnfield
Copy link
Author

dnfield commented Sep 24, 2018

So should I un-revert those changes that (partially?) relate to the wrapping bug?

@jerch
Copy link
Member

jerch commented Sep 24, 2018

Ah I'd still prefer them in a separate PR since they address a quite different issue. Its just those changes might be almost ready to go, I think you should not go for the whole problem behind #1685 as I kinda suggest above.

@dnfield
Copy link
Author

dnfield commented Sep 24, 2018

Makes sense to me.

As far as the toggle/configurability for unicode, I could use some pointers as to where that kind of configuration would be stored/used. It sounds like it wouldn't be a huge deal to implement, but I don't want to reinvent something you already have.

@jerch
Copy link
Member

jerch commented Sep 24, 2018

My thinking is to keep it internal for now, the only place it is used is InputHandler, maybe some ctor arg + getter/setter will do there.
We can later decide whether this needs to be exposed to users. My hope is still that we can implement some good enough heuristics. Exposing to many low level options will only lead to headache in front of the computer lol.

@dnfield
Copy link
Author

dnfield commented Sep 24, 2018

Added option - I think that's what you're asking for, but let me know.

@jerch
Copy link
Member

jerch commented Sep 25, 2018

Yes almost - just one more thing. To be able to switch from legacy to v11 forth and back at runtime, we will need 2 function symbols. I'd say just remove the immediately invoked function construct and build the v11 version as the default wcwidth symbol (new things first 😄) and also export the factory function. At runtime later on, the input handler could just create the other one if needed.

Edit: Sorry - its easier to create both versions at startup, maybe wcwidth and wcwidthLegacy. It will not hurt startup performance since the lookup tables are created on demand.

@Tyriar
Copy link
Member

Tyriar commented Sep 25, 2018

It gets really interesting for remote sessions with ssh, there is literally no way to tell which unicode version the backend runs.

@jerch if a separate pty is launched I suppose you could glean some information on the character width by printing y characters to a x wide pty and seeing when a \r comes back? Not sure if this is useful 😄

@Tyriar We gonna need at least two other API options for this - one to request an unicode version (which loads to correct rule set) and another to set the current locale setting for the ambiguous chars.
For the unicode versions we might get away with supporting only two, the old one and v11. (Older OSX versions still might show problems with emojis, imho they adopted a drafted spec in between with different definitions).

@jerch what would these options look like in the .d.ts file? My concern with your related PR (#1470) that ended up getting stale was that these APIs felt like they wouldn't ever be used and were there for completeness. What are the implications of shipping the latest version of unicode?

@dnfield
Copy link
Author

dnfield commented Sep 25, 2018

Redid the exports as suggested.

@jerch
Copy link
Member

jerch commented Sep 25, 2018

if a separate pty is launched I suppose you could glean some information on the character width by printing y characters to a x wide pty and seeing when a \r comes back? Not sure if this is useful 😄

Yepp this should work with the pty itself and we should get the right version for the system that is running the pty. We could basically test this on the native pty without any program attached yet or with a small helper preattached (more reliable). It might need a small change in node-pty, not sure yet how to integrate. Imho we cannot do that without changes to node-pty since it attaches the requested program right away and this test string would end up as stdin input.
Not sure yet how this could work with ssh though.

what would these options look like in the .d.ts file? My concern with your related PR (#1470) that ended up getting stale was that these APIs felt like they wouldn't ever be used and were there for completeness. What are the implications of shipping the latest version of unicode?

Well I hope that we dont have to expose the unicode version thing and can get away with the trick above. If we have to expose it this option would be a list with supported versions e.g. ['5.2', '11'].

The locale setting is needed to get the ambiguous chars working correctly. It is not part of this PR (which is good to keep it small). This option would be basically some language code (maybe ISO 639-1) that we have to grab to resolve the ambiguous chars to the actual terminal width. Since nodejs has no default notion of locales (icu or locales are not bound to nodejs by default), we will have to deploy those maps ourselves, for different unicode versions. Thats the main reason I stopped working on #1470, it will need a good abstraction to get the unicode awkwardities and big data sets handled without busting xterm.js' code base (remember the graphemes PR #1478, the data part really scared me).

Implications of shipping unicode v11:
The summary of changes are listed here https://en.wikipedia.org/wiki/Unicode#Versions
The major shift of emojis from half to fullwidth happened with UAX #11 (www.unicode.org/reports/tr11/tr11-31.html) which was first included in v9 in 2016. From v9 to v11 they added tons of new emojis and CJK chars. Regarding the wcwidth rules imho it is safe to deliver v11 as default and fall back to legacy if one of the older emojis reports a width of 1 from the system.

@Tyriar Will this width detection thing work with win-pty? How does it deal with different widths and the line ending?

@jerch
Copy link
Member

jerch commented Sep 25, 2018

@dnfield Thx alot, LGTM 👍
As you might have guessed we still need to decide how to integrate this in a neat way without breaking things.

@jerch
Copy link
Member

jerch commented Sep 25, 2018

@Tyriar About getting the default unicode settings, it might be sufficient to call icuinfo to grab the default settings for the pty machine:

$> icuinfo
 <icuSystemParams type="icu4c">
    <param name="copyright"> Copyright (C) 2013, International Business Machines Corporation and others. All Rights Reserved. </param>
    <param name="product">icu4c</param>
    <param name="product.full">International Components for Unicode for C/C++</param>
    <param name="version">52.1</param>
    <param name="version.unicode">6.3</param>
    <param name="platform.number">4000</param>
    <param name="platform.type">Linux</param>
    <param name="locale.default">de_DE</param>
    <param name="locale.default.bcp47">de-DE</param>
    <param name="converter.default">UTF-8</param>
    <param name="icudata.name">icudt52l</param>
    <param name="icudata.path"></param>
    <param name="cldr.version">24.0</param>
    <param name="tz.version">2013g</param>
    <param name="tz.default">Europe/Berlin</param>
    <param name="cpu.bits">64</param>
    <param name="cpu.big_endian">0</param>
    <param name="os.wchar_width">4</param>
    <param name="os.charset_family">0</param>
    <param name="os.host">x86_64-pc-linux-gnu</param>
    <param name="build.build">x86_64-pc-linux-gnu</param>
    <param name="build.cc">gcc</param>
    <param name="build.cxx">g++</param>
    <param name="uconfig.internal_digitlist">1</param>
    <param name="uconfig.have_parseallinput">1</param>
    <param name="uconfig.format_fastpaths_49">1</param>
 </icuSystemParams>

It still would not be able track the unicode version of a certain program that has been linked against a non standard unicode version, but tbh - thats the problem of that program.
Also I think we should not try to mess with ssh sessions. ssh itself can forward or receive at least the locale settings by using SendEnv and AcceptEnv, I did not found anything like that for icu/unicode settings. Once again I think its not our business - its an ssh config/capabilities problem.

@egmontkob Do we need to introduce an escape sequence, that gives programs the ability to report the unicode version/locale settings it emits to the terminal? This could also be used by ssh to change the emulator to the remote style. No more messed up emojis. Intriguing idea 😸

@egmontkob
Copy link

Nice idea! I don't think we'd implement something like this in VTE (currently we don't use an API that would provide us with different Unicode versions). I guess iTerm2 @gnachman might be more interested in this idea, it already has some config option to pick the Unicode version.

I'm not sure what escape sequence would be the best, but I recommend using the Unicode major version directly (i.e. no mapping like 1 = Unicode 8, 2 = Unicode 9 etc., but rather values 8 and 9 for them) (not sure if there's any chance for a width change in minor Unicode versions though), I recommend to specify that the emulator is supposed to pick the closest it supports (that's the one that likely leads to the smallest mismatch, e.g. if an app asks for Unicode 12 but the emulator supports 10 and 11 then pick 11). Or / in addition to this maybe add an asynchronous escape sequence where the emulator responds with a list of versions it supports. Also, a special value like 0 to return to the emulator's default.

On another note, take a look at the yellow paragraph added to Unicode 11.0's tr11, section 2 here. I'm not sure what it'll mean in practice. Looks to me that they release themselves from the responsibility of maintaining wcwidth and make it the realm of terminal emulators and apps, without a strict relation to the (various versions of the) Unicode database. Maybe something you should keep in mind while developing this extension, although I have no idea how. E.g. VTE has a config option (API) to toggle whether "ambiguous width" characters are narrow or wide, maybe this should fall to this category, and somehow handled under the same escape sequence? As far as I understand, this new note added to tr11 allows for further deviations along these lines. I guess we're all yet to see what it will actually mean.

@jerch
Copy link
Member

jerch commented Sep 25, 2018

rant
Oh dear, so lets just ignore that property and draw everything not legacy east asian halfwidth. Problem solved. Oh wait, every POSIX system out there has already adopted this for their libc wcwidth implementation. And we need font glyphs to draw something to the screen. Oops they also use this property? Blame them all...
They are stating "...this property was never intended to be used by modern terminal emulators..." - true (and clear since the first day of wcwidth), if they like it or not, it gets used for this. What is a consortium, that cares all day long for characters, language symbols and their relations to each other, good for? We specced here something but please dont use it for your stuff. Comon guys...
/rant

@egmontkob
Yeah they kinda feel not responsible, but the reality is different, otherwise we would not have this PR here lol. Not sure yet what to make out of this, with no reliable spec backing up the behavior it will just get messier. Exp. for a tool like xterm.js that should be easily portable without hard OS coupling. Hmm.
Anyway thx for your thoughts and insights.

@egmontkob
Copy link

@jerch I share your frustration here. What I totally don't get: If not for terminal emulators then what the hack is the width property for in Unicode??

As for wcwidth, see vteemoji too for another case where Unicode messes with us big time.

@gnachman
Copy link

See https://iterm2.com/documentation-escape-codes.html, search for Unicode Version. I guess I need to add support for new emoji, too. I'd be glad to have a more standard way to define unicode version. I like the stack-based approach I used for this one because it recovers reasonably well from an ssh session terminating.

@jerch
Copy link
Member

jerch commented Sep 26, 2018

@gnachman Thx, well you already have such an escape sequence, maybe we can adopt it to some degree. Can you elaborate on how this stack is supposed to work? Does this get reset automagically after a crashing ssh session or will the user have to call it explicitly?

As for wcwidth, see vteemoji too for another case where Unicode messes with us big time.

@egmontkob
Yupp thats the reason behind this PR any many other open issues. Also I was trying to cover the enhanced emoji capabilities (joiner, colored faces, flags etc.) with the introduction of grapheme support (#1478) but again ended at the that "doh we need some unicode version addressing feature first" since the grapheme rules made some big changes with the last releases. Now this idea/move to rely on certain unicode versions seams questionable.

@gnachman
Copy link

The idea is you'd write a wrapper around ssh that does a push and pop. If ssh dies, the pop still happens. Since each stack entry has a unique identifier, a pop can remove any entries added while you were ssh'ed. So:

ssh.sh:

#!/bin/bash
IDENTIFIER="$RANDOM$RANDOM"
it2universion push "$IDENTIFIER"
it2universion set 8
ssh $*
it2universion pop "$IDENTIFIER"

@jerch
Copy link
Member

jerch commented Sep 26, 2018

@gnachman Thats a pretty nice feature, very useful. It requires some config work by the users but tbh, this is already a more advanced ssh usage so the audiance might already be aware of those problems. And the script/alias could easily be tweaked into a version, where the unicode settings looks like an argument to ssh. Nice.
How do you deal with ambiguous chars in wcwidth, do you have something similar to allow changing the locale settings too? Is it even worth to bother with? (I think we have like 2 open issues about those chars not behaving correctly.)

@Tyriar
Copy link
Member

Tyriar commented Sep 26, 2018

Well I hope that we dont have to expose the unicode version thing and can get away with the trick above. If we have to expose it this option would be a list with supported versions e.g. ['5.2', '11'].

@jerch the trick above would be outside of xterm.js though as it depends on node access.

Will this width detection thing work with win-pty? How does it deal with different widths and the line ending?

👍 in winpty:

image

it might be sufficient to call icuinfo to grab the default settings for the pty machine:

If we could query and set that would be nice, this command doesn't work on macOS though.


Wow this is a mess, would it make sense to just have a setting like this which defaults to 9?

interface ITerminalOptions {
  unicodeVersion: 8 | 9; // or 11?
}

I don't know if we'd want to use the proprietary escape sequences route, it would be nice if we just behaved correctly most of the time (which I think is wide emoji as this point).

@mofux
Copy link
Contributor

mofux commented Sep 26, 2018

IMO we should definitely support setting the unicode version via a config (not sure if it can also work at runtime, but that would be best) - and then unicode detection can be handled by another module or addon (e.g. one that detects the iterm2 sequences), or even manually (for example, one could add a widget to the VSCode statusbar or add an action that allows to change the unicode version if the user detects weird behaviour).

@jerch
Copy link
Member

jerch commented Sep 26, 2018

@mofux: Doing this at runtime should be no problem, what not will work is changing data that already landed in the terminal buffer (after it is in the buffer we lost track how it got in there, so no chance to revert this).

@Tyriar: Yes the option could look like that. Im still unsure about the autodetection, it would need some "UnicodeGuesser" that acts on a lower level and yeah - outside of xterm.js. But once such an ominous unicode oracle got the version, xterm.js could set it correctly.

The escape sequence offers much more power to people, that often face such a problem as shown by the stack implementation of @gnachman. Imho it will not hurt anyone else (they can just ignore it) and it is really easy to implement, once we settled something workable in #1709.

@Tyriar
Copy link
Member

Tyriar commented Sep 26, 2018

and then unicode detection can be handled by another module or addon (e.g. one that detects the iterm2 sequences)

Love this idea, we expose a setting and we want to eventually make it possible to support custom escape sequences 👍


How much would supporting the 2 unicode versions add to package weight? If it's not too bad then we should just add a setting and let consumers deal with it for now. This would be a much better place than where we're at now, VS Code for example could default to the latter version and expose a setting to tweak if you want.

@jerch
Copy link
Member

jerch commented Sep 27, 2018

@Tyriar Currently we only would need an unicode aware wcwidth implementation, it basically adds the new table from @dnfield to the package if we can go with 2 versions. The tables could even be packed together, but this makes the code less readable.
If we need more versions up to 11, well it would add number of supported versions * average table size. I would not go for this in the first place, but if it ends up in tons of issue reports we might have to. Thats the point where future specific unicode version addons might come handy. (Also if we decide to support other unicode features the data part will explode, but thats a different story.)

@Tyriar
Copy link
Member

Tyriar commented Sep 27, 2018

I would not go for this in the first place, but if it ends up in tons of issue reports we might have to.

I suspect the main issue that people care about is the width of emoji, that's all I ever see people reporting.

@gnachman
Copy link

How do you deal with ambiguous chars in wcwidth, do you have something similar to allow changing the locale settings too? Is it even worth to bother with? (I think we have like 2 open issues about those chars not behaving correctly.)

I have a setting for this but it causes more trouble than it's worth. It turns out ambiguous characters should usually be narrow. Some people in east asian locales might prefer them wide, but most apps (vim, etc.) are not aware of this and render improperly.

The main source of problems is bash. It cares a lot about how wide your prompt is, and does horrible things if it guesses wrong.

Wow this is a mess

You got that right.

@jerch
Copy link
Member

jerch commented Sep 28, 2018

@gnachman Sorry for so many questions, your experience and insights will definitely help us to come up with something that will get the job done for most ppl out there. Thx alot.

To not just chitchat and whine about the unicode mess (the latter goes on my account) I suggest to move the discussion over to #1709. Lets get the ball rolling.

@jerch
Copy link
Member

jerch commented Oct 26, 2018

Closing this PR in favor of #1714. Many thx to all for the discussion and valuable background infos, esp. to @dnfield for starting it and providing the initial data and impl.

@jerch jerch closed this Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problem with emojis/unicode (assumed double-width?)
6 participants