Improve performance of `Reline::Unicode.get_mbchar_width` #632

tompng · 2024-01-04T15:29:09Z

Performance and Regression check

Tested with this benchmark

# Ignore some character that reline was calculating wrong
chars = (0..0x10ffff).filter_map{_1.chr('utf-8') rescue nil}.reject { _1 =~ /\p{M}/ && _1 !~ /\p{Mn}/ } - ["\u2e3b"];
measure
def Reline.ambiguous_width = 3 # Set to any number except 0, 1, 2 to make the checksum work better.
chars.map{Reline::Unicode.get_mbchar_width(_1)*_1.ord}.sum # checksum to ensure no regression

Result with yjit

master branch
processing time: 0.784064s #=> 924024865779
this branch
processing time: 0.404117s #=> 924024865779

Result without yjit

master branch
processing time: 0.900042s #=> 924024865779
this branch
processing time: 0.566599s #=> 924024865779

Implementation

Uses bsearch_index. Time complexity of bsearch_index is O(log(N)), N is total count of unicode characters.
We can also choose less-memory O(1) lookup (shallow tree with bignum, https://gist.github.com/tompng/6be795d487e1a0105ada41e24f9528c4) but the generated file will be unreadable.

I think this bsearch_index is a good choice because:

Most chars are ascii, so multibyte width calculation is not so important
The balance of performance, unicode.rb code simplicity, and readability of generated east_asian_width.rb are good

Bug fixes

Fixed these two type of chars. these are excluded from the performance/regression benchmark.

Nonspacing Mark

Reline returned 0 for /\p{M}/ (Mark). I think it was a mistake of /\p{Mn}/ (Nonspacing Mark).

# Chars matches /\p{M}/ but not /\p{Mn}/, 465 chars
marks = (0..0x10ffff).filter_map{''<<_1 rescue nil}.select{_1 =~ /\p{M}/ && _1 !~ /\p{Mn}/}
# Measure actual width in terminal emulator by "\e[6n" (Device Status Report)
marks.count{ $><< "\ra#{_1}b\e[6n";STDIN.raw{STDIN.readpartial(10)[/\e\[\d+;(\d+)R/, 1]}.to_i - 1 == 2 }
# =>
# 0 means /\p{Mn}/ is correct, 465 means /\p{M}/ is correct.
# Terminal.app: 0
# iTerm2: 36
# Alacritty: 13
# VSCode Terminal: 14

Three Em Dash

Reline returned 3 for three em dash "\u2e3b".
Reline returned 1 for two em dash "\u2e3a".
It's defined as N(Neutral) and shuold be 1. Terminal.app, VSCode, iTerm, Alacrytty uses width=1 (but overflows because font is very wide)

mblumtritt · 2024-07-10T07:06:07Z

lib/reline/unicode.rb

+    chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o }
+    size = Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index]


The chunk_index can be nil and then line 67 will fail!

Suggested change

chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o }

size = Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index]

chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o }

size = chunk_index ? Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index] : 1

Thank you for the feedback. Can you provide an input character that chunk_index will be nil?

I think utf8_mbchar.ord is always less than or equal to 0x10ffff (maximum codepoint of unicode)

The last two values of CHUNK_LAST are: 0x10fffd and 0x7fffffff.
The max value of 32bit signed integer 0x7fffffff is added at the end of CHUNK_LAST to prevent bsearch_index to return nil.

Oh - I missed the additional integer at the end!

I only saw the comment about how the table was generated and missed the "non-unicode" addition…

Thanks for clarification!

mblumtritt · 2024-08-16T14:24:57Z

Sorry for another question: Am I right that country flags (like '🇵🇹') are not supported?

tompng · 2024-08-16T15:20:54Z

Am I right that country flags (like '🇵🇹') are not supported?

In my opinion, Reline supports calculating flag width correctly.
Flags are defined as N(Neutral == narrow or half), width=1 character in Unicode EastAsianWidth.txt.

# EastAsianWidth.txt
1F1E6..1F1FF   ; N  # So    [26] REGIONAL INDICATOR SYMBOL LETTER A..REGIONAL INDICATOR SYMBOL LETTER Z

But flag rendering fails in many terminal emulators with various reasons.

Mac Terminal.app

Renders with half-width visually but calculates column as full-width internally. Can't support a bug.

iTerm2

Flag width can be changed by experimental configuration Fag emoji render full-width: yes/no and the default is full-width.
Width of some half-width emojis(example: ⛴️) are also configurable.
There is no plan to detect this kind of configuration value from Reline.

Alacritty, VSCode Terminal

Does not support rendering flags. "🇵🇹" is rendered as "P"+"T"

elfham · 2024-08-17T03:16:37Z

IMHO, emojis of flags should be treated as double-width characters if they are combined in the same way as half-width katakana characters.

% irb
irb(main):001> Reline::Unicode.get_mbchar_width("ｶ")
=> 1
irb(main):002> Reline::Unicode.get_mbchar_width("ｶﾞ")
=> 2
irb(main):003>

However, this does not seem to be well documented in UAX#11 and is handled differently by different implementations.

I think there is still room for further discussion.

So, I think it would be better to discuss the flag width as a separate Issue from this PR.

mblumtritt · 2024-08-17T14:27:20Z

Ok. thanks for your answer!

Let's have this PR merged asap - which will help a lot to make things better. And maybe we handle the issue with some special chars which are different handled in terminals later…

Thanks!

st0012 · 2024-08-17T14:35:36Z

@ima1zumi do you have time to give this a look? I think I don't have enough background knowledge to give a deep review.

ima1zumi · 2024-08-19T13:57:03Z

@tompng Could you rebase it?

tompng · 2024-08-19T14:29:21Z

Rebase done

ima1zumi · 2024-08-21T14:22:10Z

bin/generate_east_asian_width

  f.each_line do |line|
-    next unless m = line.match(/^(\h+)(?:\.\.(\h+))?\s*;\s*(\w+)\s+#.+/)
+    next unless /^(?<first>\h+)(?:\.\.(?<last>\h+))?\s*;\s*(?<type>\w+)\s+# +(?<category>[^ ]+)/ =~ line


📝

0021..0023 ; Na # Po [3] EXCLAMATION MARK..NUMBER SIGN ^first ^last ^type ^category

https://www.unicode.org/Public/15.1.0/ucd/EastAsianWidth.txt

ima1zumi · 2024-08-26T17:31:02Z

bin/generate_east_asian_width

+
+  chunks = widths.each_with_index.chunk { |width, _idx| width || 1 }
+  chunk_last_ords = chunks.map { |width, chunk| [chunk.last.last, width] }
+  chunk_last_ords << [0x7fffffff, 1]


I think chunk_last_ords should be up to 0x10FFFFF.

In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF
2.4 Code Points and Characters
https://www.unicode.org/versions/Unicode15.1.0/ch02.pdf

I added this value(max value of int32) as an alternative of infinity. This way, we don't need to test 0x10ffff not to raise no method error.
I think it is not so obvious that chunk_index = EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o } does not return nil when CHUNK_LAST.last == 0x10FFFF and ord == 0x10FFFF .
Of course we can change it to 0x10FFFF and add a boundary value test. What do you think?

Ah, I see. I agree that using 0x7fffffff is good, just as you suggested.

ima1zumi · 2024-08-26T17:34:28Z

bin/generate_east_asian_width

@@ -5,6 +5,18 @@ if ARGV.empty?
  exit 1
 end

+def unicode_width(type, category)
+  return 0 if category == 'Mn' # Nonspacing Mark


Is there any reason why the Nonspacing Mark should be set to 0?

In the old implementation, Mark(Mn (nonspacing mark) + Mc (spacing combining mark) + Me (enclosing mark)) are all set to 0.

MBCharWidthRE = / ... | (?<width_0>^\p{M}) ... /

I think this is just a simple mistake of \p{Mn} because Mark is not always zero width.
I think there are two choices and I choose the latter one.

# Choice 1, Don't change the behavior now. Leave it, or fix it in another pull request return 0 if category =~ '/Mn|Mc|Me/' # Choice 2, Fix it to nonspacing return 0 if category == 'Mn'

Actual width calculated by the script below are:

Terminal emulator Mn Other marks(Mc, Me)

Terminal.app 0 >=1

Alacritty 0 mostly(97%) >=1, some(3%) are 0

iTerm, VSCode Terminal 1 >= 1

mn = (0..0x10ffff).filter_map{_1.chr('utf-8') rescue nil}.grep(/\p{Mn}/) mn.map{ print "\e[H"+_1+"\e[6n" STDIN.raw{STDIN.readpartial 10} }.tally

Mn is not zero-width in some terminal emulators, but I prefer not changing the original intention in this pull request except bug.

Thanks for your explanation. That makes sense to me now.

ima1zumi · 2024-08-26T16:49:47Z

bin/generate_east_asian_width

@@ -13,66 +25,32 @@ open(ARGV.first, 'rt') do |f|
    unicode_version = nil
  end

-  list = []
+  widths = []
+  letter_modifiers = []


This is not used.

Suggested change

letter_modifiers = []

ima1zumi · 2024-08-26T17:43:05Z

lib/reline/unicode.rb

+    elsif size == 1 && utf8_mbchar.size >= 2
+      second_char_ord = utf8_mbchar[1].ord
+      # Halfwidth Dakuten Handakuten
+      # Only these two character has Letter Modifier category and can be combined in a single grapheme cluster
+      (second_char_ord == 0xFF9E || second_char_ord == 0xFF9F) ? 2 : 1


📝 This is a special case specifically for when there is a character with width 1 before U+FF9E or U+FF9F. In other words, it allows the width to be calculated correctly for something like ‘ｶﾞ’. However, it cannot calculate the width correctly for ‘かﾞ’.
There are many other combining characters, and I’m not sure why this character alone is being handled, but there doesn’t seem to be a particular reason to remove it.

ima1zumi

LGTM!

ima1zumi · 2024-08-29T17:20:59Z

bin/generate_east_asian_width

@@ -5,6 +5,18 @@ if ARGV.empty?
  exit 1
 end

+def unicode_width(type, category)
+  return 0 if category == 'Mn' # Nonspacing Mark


Thanks for your explanation. That makes sense to me now.

ima1zumi · 2024-08-29T17:33:13Z

bin/generate_east_asian_width

+
+  chunks = widths.each_with_index.chunk { |width, _idx| width || 1 }
+  chunk_last_ords = chunks.map { |width, chunk| [chunk.last.last, width] }
+  chunk_last_ords << [0x7fffffff, 1]


Ah, I see. I agree that using 0x7fffffff is good, just as you suggested.

(ruby/reline#632) ruby/reline@0851e93640

mblumtritt reviewed Jul 10, 2024

View reviewed changes

tompng force-pushed the faster_mbwidth branch from 0ceb65d to 11a1f84 Compare August 19, 2024 14:18

ima1zumi reviewed Aug 26, 2024

View reviewed changes

Calculate mbchar width with bsearch

c77d709

tompng force-pushed the faster_mbwidth branch from 11a1f84 to c77d709 Compare August 26, 2024 19:05

ima1zumi approved these changes Aug 29, 2024

View reviewed changes

ima1zumi merged commit 0851e93 into ruby:master Aug 29, 2024
40 checks passed

matzbot pushed a commit to ruby/ruby that referenced this pull request Aug 29, 2024

[ruby/reline] Calculate mbchar width with bsearch

b74e0c5

(ruby/reline#632) ruby/reline@0851e93640

tompng deleted the faster_mbwidth branch August 29, 2024 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `Reline::Unicode.get_mbchar_width` #632

Improve performance of `Reline::Unicode.get_mbchar_width` #632

tompng commented Jan 4, 2024 •

edited

Loading

mblumtritt Jul 10, 2024 •

edited

Loading

tompng Jul 10, 2024

mblumtritt Jul 10, 2024

mblumtritt commented Aug 16, 2024 •

edited

Loading

tompng commented Aug 16, 2024

elfham commented Aug 17, 2024

mblumtritt commented Aug 17, 2024

st0012 commented Aug 17, 2024

ima1zumi commented Aug 19, 2024

tompng commented Aug 19, 2024

ima1zumi Aug 21, 2024

ima1zumi Aug 26, 2024

tompng Aug 26, 2024

ima1zumi Aug 29, 2024

ima1zumi Aug 26, 2024

tompng Aug 26, 2024

ima1zumi Aug 29, 2024

ima1zumi Aug 26, 2024

ima1zumi Aug 26, 2024

ima1zumi left a comment

ima1zumi Aug 29, 2024

ima1zumi Aug 29, 2024

		chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { \|o\| ord <= o }
		size = Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index]

Terminal emulator	Mn	Other marks(Mc, Me)
Terminal.app	0	>=1
Alacritty	0	mostly(97%) >=1, some(3%) are 0
iTerm, VSCode Terminal	1	>= 1

Improve performance of Reline::Unicode.get_mbchar_width #632

Improve performance of Reline::Unicode.get_mbchar_width #632

Conversation

tompng commented Jan 4, 2024 • edited Loading

Performance and Regression check

Implementation

Bug fixes

Nonspacing Mark

Three Em Dash

mblumtritt Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mblumtritt commented Aug 16, 2024 • edited Loading

tompng commented Aug 16, 2024

Mac Terminal.app

iTerm2

Alacritty, VSCode Terminal

elfham commented Aug 17, 2024

mblumtritt commented Aug 17, 2024

st0012 commented Aug 17, 2024

ima1zumi commented Aug 19, 2024

tompng commented Aug 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ima1zumi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Improve performance of `Reline::Unicode.get_mbchar_width` #632

Improve performance of `Reline::Unicode.get_mbchar_width` #632

tompng commented Jan 4, 2024 •

edited

Loading

mblumtritt Jul 10, 2024 •

edited

Loading

mblumtritt commented Aug 16, 2024 •

edited

Loading