Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readline: fix character width calculation #13918

Closed
wants to merge 2 commits into from

Conversation

TimothyGu
Copy link
Member

@TimothyGu TimothyGu commented Jun 26, 2017

Fixes width calculation of non-spacing marks, commonly seen in Unicode Normalization Form D. Example: 'a\u0301' ('á'.normalize('NFD')), 'ру́сский язы́к' (Unicode doesn't have many precomposed accented Cyrillic letters).

Outdated information

Not sure where to add tests for this feature though. readline has the following tests:

  • test-readline-csi.js
  • test-readline-emit-keypress-events.js
  • test-readline-interface.js
  • test-readline.js
  • test-readline-keys.js
  • test-readline-reopen.js
  • test-readline-set-raw-mode.js
  • test-readline-undefined-columns.js
  • test-icu-stringwidth.js

None of them seem to fit this bug, which is the glue between getStringWidth and readline, Hence the WIP.


The second commit changes how widths of certain characters are determined:

  • Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width
  • Categorize all spacing marks (Mc) as non-0-width.
  • Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that block.

These decisions are made, partially by following the behavior of GNOME Terminal. Testing on other terminals is of course welcome.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

readline

@nodejs-github-bot nodejs-github-bot added the readline Issues and PRs related to the built-in readline module. label Jun 26, 2017
@cjihrig
Copy link
Contributor

cjihrig commented Jun 26, 2017

None of them seem to fit this bug, which is the glue between getStringWidth and readline, Hence the WIP.

Why not just create a new test?

src/node_i18n.cc Outdated
@@ -603,9 +603,10 @@ static void ToASCII(const FunctionCallbackInfo<Value>& args) {
// consideration.
static int GetColumnWidth(UChar32 codepoint,
bool ambiguous_as_full_width = false) {
if (!u_isdefined(codepoint) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the u_isdefined() check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My terminal (GNOME Terminal) actually displays unassigned characters (in a weird box form) so they are more than 0-width. Indeed, UAX #11 specifies that

Unassigned code points in ranges intended for CJK ideographs are classified as Wide.

while

All other unassigned code points are by default classified as Neutral.

I removed the u_isdefined() check here so that these unassigned characters can use the generic ICU routine below, which does the right thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspected that was the case. Ok :)

Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice... thank you!

@TimothyGu
Copy link
Member Author

@cjihrig

Why not just create a new test?

Err... why did I not think of that...

@TimothyGu TimothyGu changed the title WIP readline: properly handle 0-width characters readline: fix character width calculation Jun 27, 2017
@TimothyGu
Copy link
Member Author

Tests for 0-width characters are added too.

CI: https://ci.nodejs.org/job/node-test-pull-request/8836/

@Trott
Copy link
Member

Trott commented Jun 27, 2017

Stopped the Raspberry Pi devices in Ci because they had been running for 2.5 hours and had no new console messages for over an hour. Not sure if this will self-correct or if we need Build WG intervention....

@Trott
Copy link
Member

Trott commented Jun 27, 2017

Not looking like the Pi issue is self-correcting. Will have to either wait for the issue to get resolved before proceeding, or decide that this can land without the Pi run.
¯\(ツ)

@TimothyGu
Copy link
Member Author

Will have to either wait for the issue to get resolved before proceeding, or decide that this can land without the Pi run.

Given that the other machines are universally green (except for macOS's fickle nature), nor does this PR have any architecture-specific components, I'd say this can land w/o confirmation from RPi.

- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as
  0-width
- Categorize all spacing marks (Mc) as non-0-width.
- Treat soft hyphens (a format character Cf) as non-0-width.
- Do not treat all unassigned code points as 0-width; instead, let ICU
  select the default for that character per UAX nodejs#11.
- Avoid getting the General_Category of a character multiple times as it
  is an intensive operation.

Refs: http://unicode.org/reports/tr11/
@jasnell
Copy link
Member

jasnell commented Jun 29, 2017

jasnell pushed a commit that referenced this pull request Jun 29, 2017
PR-URL: #13918
Reviewed-By: James M Snell <jasnell@gmail.com>
jasnell pushed a commit that referenced this pull request Jun 29, 2017
- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as
  0-width
- Categorize all spacing marks (Mc) as non-0-width.
- Treat soft hyphens (a format character Cf) as non-0-width.
- Do not treat all unassigned code points as 0-width; instead, let ICU
  select the default for that character per UAX #11.
- Avoid getting the General_Category of a character multiple times as it
  is an intensive operation.

Refs: http://unicode.org/reports/tr11/
PR-URL: #13918
Reviewed-By: James M Snell <jasnell@gmail.com>
@jasnell
Copy link
Member

jasnell commented Jun 29, 2017

Landed in 01aeb38 and f4b5b70

@jasnell jasnell closed this Jun 29, 2017
addaleax pushed a commit that referenced this pull request Jun 29, 2017
PR-URL: #13918
Reviewed-By: James M Snell <jasnell@gmail.com>
addaleax pushed a commit that referenced this pull request Jun 29, 2017
- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as
  0-width
- Categorize all spacing marks (Mc) as non-0-width.
- Treat soft hyphens (a format character Cf) as non-0-width.
- Do not treat all unassigned code points as 0-width; instead, let ICU
  select the default for that character per UAX #11.
- Avoid getting the General_Category of a character multiple times as it
  is an intensive operation.

Refs: http://unicode.org/reports/tr11/
PR-URL: #13918
Reviewed-By: James M Snell <jasnell@gmail.com>
@addaleax addaleax mentioned this pull request Jun 29, 2017
@TimothyGu TimothyGu deleted the readline-nsm branch June 30, 2017 06:23
addaleax pushed a commit that referenced this pull request Jul 11, 2017
PR-URL: #13918
Reviewed-By: James M Snell <jasnell@gmail.com>
addaleax pushed a commit that referenced this pull request Jul 11, 2017
- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as
  0-width
- Categorize all spacing marks (Mc) as non-0-width.
- Treat soft hyphens (a format character Cf) as non-0-width.
- Do not treat all unassigned code points as 0-width; instead, let ICU
  select the default for that character per UAX #11.
- Avoid getting the General_Category of a character multiple times as it
  is an intensive operation.

Refs: http://unicode.org/reports/tr11/
PR-URL: #13918
Reviewed-By: James M Snell <jasnell@gmail.com>
addaleax pushed a commit that referenced this pull request Jul 18, 2017
PR-URL: #13918
Reviewed-By: James M Snell <jasnell@gmail.com>
addaleax pushed a commit that referenced this pull request Jul 18, 2017
- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as
  0-width
- Categorize all spacing marks (Mc) as non-0-width.
- Treat soft hyphens (a format character Cf) as non-0-width.
- Do not treat all unassigned code points as 0-width; instead, let ICU
  select the default for that character per UAX #11.
- Avoid getting the General_Category of a character multiple times as it
  is an intensive operation.

Refs: http://unicode.org/reports/tr11/
PR-URL: #13918
Reviewed-By: James M Snell <jasnell@gmail.com>
@MylesBorins
Copy link
Contributor

Should this be backported to v6.x-staging? If yes please follow the guide and raise a backport PR, if no let me know or add the dont-land-on label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
readline Issues and PRs related to the built-in readline module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants