Rework Emoji support …

- Enable by default - Only reduce Emoji string width for RGI Emoji (configurable) - VS16 turns Emoji characters of width 1 into full-width #27 - Add option to make Text Presentation Emoji full-width
janlelis · Nov 11, 2024 · 42fced8 · 42fced8
1 parent cf36c21
commit 42fced8
Show file tree

Hide file tree

Showing 6 changed files with 258 additions and 71 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # CHANGELOG
 
+## 3.0.0 (unreleased)
+
+Rework Emoji support:
+
+- Only reduce Emoji width of RGI Emoji (configurable)
+- VS16 turns Emoji characters of width 1 into full-width
+- Add option to make Text Presentation Emoji full-width
+- Emoji widths are now enabled by default
+
 ## 2.6.0
 
 - Unicode 16

diff --git a/Gemfile b/Gemfile
@@ -2,5 +2,4 @@ source "https://rubygems.org"
 
 gemspec
 
-gem "unicode-emoji"
 gem "irb"
diff --git a/README.md b/README.md
@@ -1,14 +1,18 @@
-## Unicode::DisplayWidth [![[version]](https://badge.fury.io/rb/unicode-display_width.svg)](https://badge.fury.io/rb/unicode-display_width) [<img src="https://github.com/janlelis/unicode-display_width/workflows/Test/badge.svg" />](https://github.com/janlelis/unicode-display_width/actions?query=workflow%3ATest)
+# Unicode::DisplayWidth [![[version]](https://badge.fury.io/rb/unicode-display_width.svg)](https://badge.fury.io/rb/unicode-display_width) [<img src="https://github.com/janlelis/unicode-display_width/workflows/Test/badge.svg" />](https://github.com/janlelis/unicode-display_width/actions?query=workflow%3ATest)
 
 Determines the monospace display width of a string in Ruby. Useful for all kinds of terminal-based applications. Implementation based on [EastAsianWidth.txt](https://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt) and other data, 100% in Ruby. It does not rely on the OS vendor (like [wcwidth()](https://github.com/janlelis/wcswidth-ruby)) to provide an up-to-date method for measuring string width.
 
 Unicode version: **16.0.0** (September 2024)
 
-## Version 2.4.2 — Performance Updates
+## Gem Version 3.0 — Improved Emoji Support
+
+**Emoji support is now enabled by default.** See below for description and configuration possibilities.
+
+## Gem Version 2.4.2 — Performance Updates
 
 **If you use this gem, you should really upgrade to 2.4.2 or newer. It's often 100x faster, sometimes even 1000x and more!**
 
-This is possible because the gem now detects if you use very basic (and common) characters, like ASCII characters. Furthermore, the charachter width lookup code has been optimized, so even when full-width characters are involved, the gem is much faster now.
+This is possible because the gem now detects if you use very basic (and common) characters, like ASCII characters. Furthermore, the charachter width lookup code has been optimized, so even when full-width or ambiguous characters are involved, the gem is much faster now.
 
 ## Introduction to Character Widths
 
@@ -20,7 +24,8 @@ Further at the top means higher precedence. Please expect changes to this algori
 
 Width  | Characters                   | Comment
 -------|------------------------------|--------------------------------------------------
-X      | (user defined)               | Overwrites any other values
+?      | (user defined)               | Overwrites any other values
+?      | Emoji                        | See "How this Library Handles Emoji Width" below
 -1     | `"\b"`                       | Backspace (total width never below 0)
 0      | `"\0"`, `"\x05"`, `"\a"`, `"\n"`, `"\v"`, `"\f"`, `"\r"`, `"\x0E"`, `"\x0F"` | [C0 control codes](https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C0_.28ASCII_and_derivatives.29) which do not change horizontal width
 1      | `"\u{00AD}"`                 | SOFT HYPHEN
@@ -46,16 +51,14 @@ Or add to your Gemfile:
 
 ## Usage
 
-### Classic API
-
 ```ruby
 require 'unicode/display_width'
 
 Unicode::DisplayWidth.of("⚀") # => 1
 Unicode::DisplayWidth.of("一") # => 2
 ```
 
-#### Ambiguous Characters
+### Ambiguous Characters
 
 The second parameter defines the value returned by characters defined as ambiguous:
 
@@ -64,7 +67,7 @@ Unicode::DisplayWidth.of("·", 1) # => 1
 Unicode::DisplayWidth.of("·", 2) # => 2
 ```
 
-#### Custom Overwrites
+### Custom Overwrites
 
 You can overwrite how to handle specific code points by passing a hash (or even a proc) as third parameter:
 
@@ -75,23 +78,53 @@ Unicode::DisplayWidth.of("a\tb", 1, "\t".ord => 10)) # => tab counted as 10, so
 Please note that using overwrites disables some perfomance optimizations of this gem.
 
 
-#### Emoji Support
+### Emoji Options
 
-Emoji width support is included, but in must be activated manually. It will adjust the string's size for modifier and zero-width joiner sequences. You also need to add the [unicode-emoji](https://github.com/janlelis/unicode-emoji) gem to your Gemfile:
+The RGI Emoji set is automatically detected to adjust the width of the string. This can be disabled by passing `emoji: false` as fourth parameter:
 
 ```ruby
-gem 'unicode-display_width'
-gem 'unicode-emoji'
+Unicode::DisplayWidth.of "🤾🏽‍♀️" # => 2
+Unicode::DisplayWidth.of "🤾🏽‍♀️", 1, {}, emoji: false # => 5
 ```
 
-Enable the emoji string width adjustments by passing `emoji: true` as fourth parameter:
+Disabling Emoji support yields wrong results, as illustrated in the example above, but increases performance of display width calculation by ~30%.
+
+You can configure Emoji options by passing a Hash like this:
 
 ```ruby
-Unicode::DisplayWidth.of "🤾🏽‍♀️" # => 5
-Unicode::DisplayWidth.of "🤾🏽‍♀️", 1, {}, emoji: true # => 2
+Unicode::DisplayWidth.of "string", 1, {}, emoji: { wide_text_presentation: false, sequences: :rgi_fqe }
 ```
 
-#### Usage with String Extension
+#### How this Library Handles Emoji Width
+
+There are many Emoji which get constructed by combining other Emoji in a sequence. This makes measuring the width complicated, since terminals might either display the combined Emoji or the separate parts of the Emoji individually.
+
+**Char Width** = No special handling, uses mechanism from table above
+
+Emoji Type  | Width / Comment | Configuration Options
+------------|-----------------|----------------------
+Basic/Single Emoji character without Variation Selector or with VS15 (Text) | Char Width | Use option `wide_text_presentation: true` to force textual Emoji to always be of width 2
+Basic/Single Emoji character with VS16 (Emoji) | 2 | -
+Emoji Sequence | Recommended Emoji sequences: 2, above rules otherwise | Option `sequences:` explained below
+
+The `sequences:` option can be used to configure which type of Emoji should be considered to have a width of 2. Other sequences are treated as non-combined Emoji, so the widths of all partial Emoji add up (e.g. width of one basic Emoji + one skin tone modifier + another basic Emoji)
+
+The value passed to the `sequences:` option can be one of:
+
+- `:none`: No width adjustments for Emoji sequences: all partial Emoji treated separately
+- `:rgi_fqe` (default): All fully-qualified RGI Emoji sequences are considered to have a width of 2
+- `:rgi_mqe`: All fully- and minimally-qualified RGI Emoji sequences are considered to have a width of 2
+- `:rgi_uqe`: All RGI Emoji sequences, regardless of qualification status are considered to have a width of 2
+- `:all`: All possible/well-formed Emoji sequences are considered to have a width of 2
+
+*RGI Emoji:* Emoji Recommended for General Interchange
+
+*Qualfication:* Whether an Emoji sequence has all required VS16 codepoints
+
+See [emoji-test.txt](https://www.unicode.org/Public/emoji/16.0/emoji-test.txt), the [unicode-emoji gem](https://github.com/janlelis/unicode-emoji) and [UTS-51](https://www.unicode.org/reports/tr51/#def_qualified_emoji_character) for more details about qualified and unqualified Emoji sequences.
+
+
+### Usage with String Extension
 
 ```ruby
 require 'unicode/display_width/string_ext'
@@ -110,11 +143,11 @@ require 'unicode/display_width'
 display_width = Unicode::DisplayWidth.new(
   # ambiguous: 1,
   overwrite: { "A".ord => 100 },
-  emoji: true,
+  emoji: { wide_text_presentation: true },
 )
 
 display_width.of "⚀" # => 1
-display_width.of "🤾🏽‍♀️" # => 2
+display_width.of "⏱" # => 2
 display_width.of "A" # => 100
 ```
 

diff --git a/lib/unicode/display_width.rb b/lib/unicode/display_width.rb
@@ -1,5 +1,7 @@
 # frozen_string_literal: true
 
+require "unicode/emoji"
+
 require_relative "display_width/constants"
 require_relative "display_width/index"
 
@@ -8,28 +10,61 @@ class DisplayWidth
     INITIAL_DEPTH = 0x10000
     ASCII_NON_ZERO_REGEX = /[\0\x05\a\b\n\v\f\r\x0E\x0F]/
     FIRST_4096 = decompress_index(INDEX[0][0], 1)
+    DEFAULT_EMOJI_OPTIONS = {
+      sequences: :rgi_fqe,
+      wide_text_presentation: false,
+    }
+    EMOJI_SEQUENCES_REGEX_MAPPING = {
+      rgi_fqe: :REGEX,
+      rgi_mqe: :REGEX_INCLUDE_MQE,
+      rgi_uqe: :REGEX_INCLUDE_MQE_UQE,
+      all: :REGEX_WELL_FORMED,
+    }
+    EMOJI_NOT_POSSIBLE = /\A[#*0-9]\z/
 
     def self.of(string, ambiguous = 1, overwrite = {}, options = {})
-      if overwrite.empty?
-        # Optimization for ASCII-only strings without certain control symbols
-        if string.ascii_only?
-          if string.match?(ASCII_NON_ZERO_REGEX)
-            res = string.gsub(ASCII_NON_ZERO_REGEX, "").size - string.count("\b")
-            res < 0 ? 0 : res
-          else
-            string.size
-          end
-        else
-          width_no_overwrite(string, ambiguous, options)
+      if !overwrite.empty?
+        return width_frame(string, options) do |string|
+          width_all_features(string, ambiguous, overwrite)
         end
+      end
+
+      if !string.ascii_only?
+        return width_frame(string, options) do |string|
+          width_no_overwrite(string, ambiguous)
+        end
+      end
+
+      # Optimization for ASCII-only strings without certain control symbols
+      if string.match?(ASCII_NON_ZERO_REGEX)
+        res = string.gsub(ASCII_NON_ZERO_REGEX, "").size - string.count("\b")
+        return res < 0 ? 0 : res
+      end
+
+      # Pure ASCII
+      string.size
+    end
+
+    def self.width_frame(string, options)
+      # Retrieve Emoji width
+      if options[:emoji] == false
+        res = 0
       else
-        width_all_features(string, ambiguous, overwrite, options)
+        emoji_options = ( options[:emoji] == true || !options ) ?
+          DEFAULT_EMOJI_OPTIONS :
+          options[:emoji]
+        res, string = emoji_width(string, **emoji_options)
       end
+
+      # Get general width
+      res += yield(string)
+
+      # Return result + prevent negative lengths
+      res < 0 ? 0 : res
     end
 
-    def self.width_no_overwrite(string, ambiguous, options = {})
-      # Sum of all chars widths
-      res = string.codepoints.sum{ |codepoint|
+    def self.width_no_overwrite(string, ambiguous, _ = {})
+      string.codepoints.sum{ |codepoint|
         if codepoint > 15 && codepoint < 161 # very common
           next 1
         elsif codepoint < 0x1001
@@ -45,18 +80,11 @@ def self.width_no_overwrite(string, ambiguous, options = {})
 
         width == :A ? ambiguous : (width || 1)
       }
-
-      # Substract emoji error
-      res -= emoji_extra_width_of(string, ambiguous) if options[:emoji]
-
-      # Return result + prevent negative lengths
-      res < 0 ? 0 : res
     end
 
     # Same as .width_no_overwrite - but with applying overwrites for each char
-    def self.width_all_features(string, ambiguous, overwrite, options)
-      # Sum of all chars widths
-      res = string.codepoints.sum{ |codepoint|
+    def self.width_all_features(string, ambiguous, overwrite)
+      string.codepoints.sum{ |codepoint|
         next overwrite[codepoint] if overwrite[codepoint]
 
         if codepoint > 15 && codepoint < 161 # very common
@@ -74,31 +102,54 @@ def self.width_all_features(string, ambiguous, overwrite, options)
 
         width == :A ? ambiguous : (width || 1)
       }
+    end
 
-      # Substract emoji error
-      res -= emoji_extra_width_of(string, ambiguous, overwrite) if options[:emoji]
 
-      # Return result + prevent negative lengths
-      res < 0 ? 0 : res
-    end
+    def self.emoji_width(string, sequences: :rgi_fqe, wide_text_presentation: false)
+      adjustments = 0
 
+      if regex = EMOJI_SEQUENCES_REGEX_MAPPING[sequences]
+        emoji_sequence_regex = Unicode::Emoji.const_get(regex)
+      else # sequences == :none
+        emoji_sequence_regex = /$^/
+      end
 
-    def self.emoji_extra_width_of(string, ambiguous = 1, overwrite = {}, _ = {})
-      require "unicode/emoji"
+      # For each string possibly an emoji
+      no_emoji_string = string.encode("utf-8").gsub(Unicode::Emoji::REGEX_POSSIBLE){ |emoji_candidate|
+        # Skip notorious false positives
+        if EMOJI_NOT_POSSIBLE.match?(emoji_candidate)
+          emoji_candidate
 
-      extra_width = 0
-      modifier_regex = /[#{ Unicode::Emoji::EMOJI_MODIFIERS.pack("U*") }]/
-      zwj_regex = /(?<=#{ [Unicode::Emoji::ZWJ].pack("U") })./
+        # Check if we have a combined Emoji with width 2
+        elsif emoji_candidate == emoji_candidate[emoji_sequence_regex]
+          adjustments += 2
+          ""
 
-      string.scan(Unicode::Emoji::REGEX){ |emoji|
-        extra_width += 2 * emoji.scan(modifier_regex).size
+        # We are dealing with a default text presentation emoji or a well-formed sequence not matching the above Emoji set
+        else
+          # Ensure all explicit VS16 sequences have width 2
+          emoji_candidate.gsub!(Unicode::Emoji::REGEX_BASIC){ |basic_emoji|
+            if basic_emoji.size == 2 # VS16 present
+              adjustments += 2
+              ""
+            else
+              basic_emoji
+            end
+          }
+
+          # Apply wide_text_presentation option if present
+          if wide_text_presentation
+            emoji_candidate.gsub!(Unicode::Emoji::REGEX_TEXT){ |text_emoji|
+              adjustments += 2
+              ""
+            }
+          end
 
-        emoji.scan(zwj_regex){ |zwj_succ|
-          extra_width += self.of(zwj_succ, ambiguous, overwrite)
-        }
+          emoji_candidate
+        end
       }
 
-      extra_width
+      [adjustments, no_emoji_string]
     end
 
     def initialize(ambiguous: 1, overwrite: {}, emoji: false)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,5 +2,4 @@ source "https://rubygems.org"

		gemspec

		gem "unicode-emoji"
		gem "irb"