-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of Reline::Unicode.get_mbchar_width
#632
Conversation
lib/reline/unicode.rb
Outdated
chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o } | ||
size = Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chunk_index
can be nil
and then line 67 will fail!
chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o } | |
size = Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index] | |
chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o } | |
size = chunk_index ? Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index] : 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. Can you provide an input character that chunk_index will be nil?
I think utf8_mbchar.ord
is always less than or equal to 0x10ffff
(maximum codepoint of unicode)
The last two values of CHUNK_LAST are: 0x10fffd
and 0x7fffffff
.
The max value of 32bit signed integer 0x7fffffff
is added at the end of CHUNK_LAST to prevent bsearch_index to return nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - I missed the additional integer at the end!
I only saw the comment about how the table was generated and missed the "non-unicode" addition…
Thanks for clarification!
Sorry for another question: Am I right that country flags (like '🇵🇹') are not supported? |
In my opinion, Reline supports calculating flag width correctly.
But flag rendering fails in many terminal emulators with various reasons. Mac Terminal.appRenders with half-width visually but calculates column as full-width internally. Can't support a bug. iTerm2Flag width can be changed by experimental configuration Alacritty, VSCode TerminalDoes not support rendering flags. "🇵🇹" is rendered as "P"+"T" |
IMHO, emojis of flags should be treated as double-width characters if they are combined in the same way as half-width katakana characters.
However, this does not seem to be well documented in UAX#11 and is handled differently by different implementations. I think there is still room for further discussion. So, I think it would be better to discuss the flag width as a separate Issue from this PR. |
Ok. thanks for your answer! Let's have this PR merged asap - which will help a lot to make things better. And maybe we handle the issue with some special chars which are different handled in terminals later… Thanks! |
@ima1zumi do you have time to give this a look? I think I don't have enough background knowledge to give a deep review. |
@tompng Could you rebase it? |
Rebase done |
f.each_line do |line| | ||
next unless m = line.match(/^(\h+)(?:\.\.(\h+))?\s*;\s*(\w+)\s+#.+/) | ||
next unless /^(?<first>\h+)(?:\.\.(?<last>\h+))?\s*;\s*(?<type>\w+)\s+# +(?<category>[^ ]+)/ =~ line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝
0021..0023 ; Na # Po [3] EXCLAMATION MARK..NUMBER SIGN
^first ^last ^type ^category
https://www.unicode.org/Public/15.1.0/ucd/EastAsianWidth.txt
|
||
chunks = widths.each_with_index.chunk { |width, _idx| width || 1 } | ||
chunk_last_ords = chunks.map { |width, chunk| [chunk.last.last, width] } | ||
chunk_last_ords << [0x7fffffff, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think chunk_last_ords
should be up to 0x10FFFFF
.
In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF
2.4 Code Points and Characters
https://www.unicode.org/versions/Unicode15.1.0/ch02.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this value(max value of int32) as an alternative of infinity. This way, we don't need to test 0x10ffff
not to raise no method error
.
I think it is not so obvious that chunk_index = EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o }
does not return nil when CHUNK_LAST.last == 0x10FFFF
and ord == 0x10FFFF
.
Of course we can change it to 0x10FFFF
and add a boundary value test. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I agree that using 0x7fffffff
is good, just as you suggested.
@@ -5,6 +5,18 @@ if ARGV.empty? | |||
exit 1 | |||
end | |||
|
|||
def unicode_width(type, category) | |||
return 0 if category == 'Mn' # Nonspacing Mark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why the Nonspacing Mark should be set to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old implementation, Mark(Mn (nonspacing mark) + Mc (spacing combining mark) + Me (enclosing mark)
) are all set to 0.
MBCharWidthRE = /
...
| (?<width_0>^\p{M})
...
/
I think this is just a simple mistake of \p{Mn}
because Mark is not always zero width.
I think there are two choices and I choose the latter one.
# Choice 1, Don't change the behavior now. Leave it, or fix it in another pull request
return 0 if category =~ '/Mn|Mc|Me/'
# Choice 2, Fix it to nonspacing
return 0 if category == 'Mn'
Actual width calculated by the script below are:
Terminal emulator | Mn | Other marks(Mc, Me) |
---|---|---|
Terminal.app | 0 | >=1 |
Alacritty | 0 | mostly(97%) >=1, some(3%) are 0 |
iTerm, VSCode Terminal | 1 | >= 1 |
mn = (0..0x10ffff).filter_map{_1.chr('utf-8') rescue nil}.grep(/\p{Mn}/)
mn.map{
print "\e[H"+_1+"\e[6n"
STDIN.raw{STDIN.readpartial 10}
}.tally
Mn is not zero-width in some terminal emulators, but I prefer not changing the original intention in this pull request except bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation. That makes sense to me now.
bin/generate_east_asian_width
Outdated
@@ -13,66 +25,32 @@ open(ARGV.first, 'rt') do |f| | |||
unicode_version = nil | |||
end | |||
|
|||
list = [] | |||
widths = [] | |||
letter_modifiers = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used.
letter_modifiers = [] |
elsif size == 1 && utf8_mbchar.size >= 2 | ||
second_char_ord = utf8_mbchar[1].ord | ||
# Halfwidth Dakuten Handakuten | ||
# Only these two character has Letter Modifier category and can be combined in a single grapheme cluster | ||
(second_char_ord == 0xFF9E || second_char_ord == 0xFF9F) ? 2 : 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 This is a special case specifically for when there is a character with width 1 before U+FF9E or U+FF9F. In other words, it allows the width to be calculated correctly for something like ‘ガ’. However, it cannot calculate the width correctly for ‘が’.
There are many other combining characters, and I’m not sure why this character alone is being handled, but there doesn’t seem to be a particular reason to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -5,6 +5,18 @@ if ARGV.empty? | |||
exit 1 | |||
end | |||
|
|||
def unicode_width(type, category) | |||
return 0 if category == 'Mn' # Nonspacing Mark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation. That makes sense to me now.
|
||
chunks = widths.each_with_index.chunk { |width, _idx| width || 1 } | ||
chunk_last_ords = chunks.map { |width, chunk| [chunk.last.last, width] } | ||
chunk_last_ords << [0x7fffffff, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I agree that using 0x7fffffff
is good, just as you suggested.
Performance and Regression check
Tested with this benchmark
Result with yjit
Result without yjit
Implementation
Uses
bsearch_index
. Time complexity of bsearch_index is O(log(N)), N is total count of unicode characters.We can also choose less-memory O(1) lookup (shallow tree with bignum, https://gist.github.com/tompng/6be795d487e1a0105ada41e24f9528c4) but the generated file will be unreadable.
I think this bsearch_index is a good choice because:
unicode.rb
code simplicity, and readability of generatedeast_asian_width.rb
are goodBug fixes
Fixed these two type of chars. these are excluded from the performance/regression benchmark.
Nonspacing Mark
Reline returned 0 for
/\p{M}/
(Mark). I think it was a mistake of/\p{Mn}/
(Nonspacing Mark).Three Em Dash
Reline returned 3 for three em dash
"\u2e3b"
.Reline returned 1 for two em dash
"\u2e3a"
.It's defined as N(Neutral) and shuold be 1. Terminal.app, VSCode, iTerm, Alacrytty uses width=1 (but overflows because font is very wide)