-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readline: use icu based string width calculation #9040
Conversation
Technically semver-minor because it allows now a single codepoint to be passed in as the argument to |
This change would make intl and non-intl builds behave differently, wouldn't it? I think I'd rather have a single semi-broken algorithm than two diverging implementations. As well, the non-intl path will be untested until nodejs/build#419 is resolved. |
Intl and non-intl builds already behave differently in many ways. This On Wednesday, October 12, 2016, Ben Noordhuis notifications@github.com
|
Oh? In what way?
"Everyone is doing it" is never a valid argument. V8 is not under our direct control but this is. |
Off the top of my head:
There's likely more. It's also not just about a more accurate implementation, performance is also significantly improved with the new algorithm. this PR:
vs. latest v6:
(the benchmark test benchmark/misc/stringwidth.js is not part of this PR currently, it's local on my machine) |
/** | ||
* Tries to remove all VT control characters. Use to estimate displayed | ||
* string width. May be buggy due to not running a real state machine | ||
*/ | ||
function stripVTControlCharacters(str) { | ||
str = str.replace(new RegExp(functionKeyCodeReAnywhere.source, 'g'), ''); | ||
return str.replace(new RegExp(metaKeyCodeReAnywhere.source, 'g'), ''); | ||
return str.replace(ansi, ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't need /g
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it uses /g
.
|
||
|
||
module.exports = { | ||
emitKeys, | ||
getStringWidth, | ||
isFullWidthCodePoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should initially assign no-ops with comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come? that does not seem very practical.
stripVTControlCharacters | ||
}; | ||
|
||
if (process.binding('config').hasIntl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not detectable from process.config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using process.config
is unreliable because there are userland packages that override it. That's why process.binding('config')
was introduced.
return icu.getStringWidth(stripVTControlCharacters(str)); | ||
}; | ||
module.exports.isFullWidthCodePoint = function isFullWidthCodePoint(code) { | ||
if (typeof code !== 'number') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this remain isNaN()
... or rather, Number.isNaN()
?
// if the key.sequence is half of a surrogate pair, | ||
// refresh the line so the character is displayed appropriately. | ||
const ch = key.sequence.codePointAt(0); | ||
if (ch >= 0xd800 && ch <= 0xdfff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment about these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the existing comment on line 128 and 129 not enough?
c133999
to
83c7a88
Compare
cc56bd3
to
5e5522c
Compare
New CI: https://ci.nodejs.org/job/node-test-pull-request/4611/ |
5e5522c
to
bf886f8
Compare
Only unrelated flaky failures in CI. @nodejs/collaborators would appreciate additional review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I still don't think this should land until we have a proper non-intl buildbot. So far, almost every intl-related change broke the non-intl build in one way or another.
// Adopted from https://github.com/chalk/ansi-regex/blob/master/index.js | ||
// License: MIT, authors: @sindresorhus, Qix-, and arjunmehta | ||
const ansi = | ||
/[\u001b\u009b][[()#;?]*(?:[0-9]{1,4}(?:;[0-9]{0,4})*)?[0-9A-ORZcf-nqry=><]/g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex could use a comment explaining what it tries to match/capture. Also, four space indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
four space indent would make the line longer than 80 chars, and splitting the regex into multiple lines would make it less readable. I'd prefer to leave the spacing as is.
// newer wide characters. wcwidth, on the other hand, uses a fixed | ||
// algorithm that does not take things like emoji into proper | ||
// consideration. | ||
static int GetColumnWidth(UChar32 codepoint, bool ambiguousFull = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: s/ambiguousFull/ambiguous_full/ here and elsewhere. I don't like the name very much, it doesn't really convey what it does.
// algorithm that does not take things like emoji into proper | ||
// consideration. | ||
static int GetColumnWidth(UChar32 codepoint, bool ambiguousFull = false) { | ||
const int eaw = u_getIntPropertyValue(codepoint, UCHAR_EAST_ASIAN_WIDTH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why UCHAR_EAST_ASIAN_WIDTH? This needs an explaining comment.
return 2; | ||
case U_EA_AMBIGUOUS: | ||
if (ambiguousFull) | ||
return 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fall-through is intentional, can you add // fallthrough
comments?
if (u_iscntrl(codepoint) || | ||
u_hasBinaryProperty(codepoint, UCHAR_EMOJI_MODIFIER) || | ||
u_getCombiningClass(codepoint) > 0) | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put braces around the body, slightly easier to read.
} | ||
|
||
bool ambiguousFull = args[1]->BooleanValue(); | ||
bool expandEmojiSequence = args[2]->BooleanValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: snake_case
bool expandEmojiSequence = args[2]->BooleanValue(); | ||
|
||
TwoByteValue value(env->isolate(), args[0].As<String>()); | ||
UChar* str = reinterpret_cast<UChar*>(*value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reinterpret_cast shouldn't be necessary, should it?
n > 0 && p == 0x200d && // 0x200d == emoji sequence continuation | ||
(u_hasBinaryProperty(c, UCHAR_EMOJI_PRESENTATION) || | ||
u_hasBinaryProperty(c, UCHAR_EMOJI_MODIFIER))) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces?
Updated to address the nits. |
@bnoordhuis ... added a nointl CI job we can run manually: https://ci.nodejs.org/job/node-test-commit-linux-nointl/1/ |
@jasnell: I don't think it should be added to the current job as-is. There's way too many workers involved. Do we need to test this on every linux flavor? I was hoping to have a small subset of workers for build permutation tests. |
Updated. Needs another review from @bnoordhuis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a comment and a style nit.
if (args[0]->IsNumber()) { | ||
args.GetReturnValue().Set( | ||
GetColumnWidth(args[0]->Uint32Value(), | ||
ambiguous_as_full_width)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you line up the argument?
return; | ||
} | ||
|
||
TwoByteValue value(env->isolate(), args[0].As<String>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TwoByteValue's constructor takes a Local<Value>
. It's arguably wrong to cast because there is no check that it's really a string so I'd just let the constructor deal with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, with a couple of minor comments
// The expand_emoji_sequence option allows the caller to skip this | ||
// check and count each code within an emoji sequence separately. | ||
if (!expand_emoji_sequence && | ||
n > 0 && p == 0x200d && // 0x200d == emoji sequence continuation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
U+200D
is a ZWJ
(zero width joiner), please use that term
// in advance if a particular sequence is going to be supported. | ||
// The expand_emoji_sequence option allows the caller to skip this | ||
// check and count each code within an emoji sequence separately. | ||
if (!expand_emoji_sequence && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasoable way of doing this calculation
return 2; | ||
case U_EA_AMBIGUOUS: | ||
if (ambiguous_as_full_width) { | ||
return 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth referencing something like http://www.unicode.org/reports/tr11/#Ambiguous here
5dcbf74
to
924eef3
Compare
Updated with the final nits addressed. New CI: https://ci.nodejs.org/job/node-test-pull-request/4655/ |
924eef3
to
518c711
Compare
had a compile nit pop up on windows... trying again https://ci.nodejs.org/job/node-test-pull-request/4664/ |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: nodejs#8075
518c711
to
876bb83
Compare
CI is green except for unrelated flaky failures. Landing |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>
Landed in 72547fe |
Shrinking it down ought to be fine. I'll do that this next week On Sunday, October 23, 2016, Johan Bergström notifications@github.com
|
@jasnell g++ spotted a bug:
I think you need a U16_NEXT call outside the loop. |
Hmm.. Ok, away from the laptop at the moment but will look at that shortly |
@bnoordhuis good catch - otherwise A |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>
/cc @srl295 currently passing on this being backported to v6.x. Do you think it should be considered? |
I'm not @srl295, of course, but I'll chime in to say that I see no pressing reason to backport this |
no pressing reason |
Checklist
make -j8 test
(UNIX), orvcbuild test nosign
(Windows) passesAffected core subsystem(s)
readline, internal
Description of change
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster.
This was originally part of a proposal to add a new unicode module, but has been split out.
Refs: #8075