Skip to content

Commit

Permalink
Rework Emoji support …
Browse files Browse the repository at this point in the history
- Enable by default
- Only reduce Emoji string width for RGI Emoji (configurable)
- VS16 turns Emoji characters of width 1 into full-width #27
- Add option to make Text Presentation Emoji full-width
  • Loading branch information
janlelis committed Nov 11, 2024
1 parent cf36c21 commit 42fced8
Show file tree
Hide file tree
Showing 6 changed files with 258 additions and 71 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# CHANGELOG

## 3.0.0 (unreleased)

Rework Emoji support:

- Only reduce Emoji width of RGI Emoji (configurable)
- VS16 turns Emoji characters of width 1 into full-width
- Add option to make Text Presentation Emoji full-width
- Emoji widths are now enabled by default

## 2.6.0

- Unicode 16
Expand Down
1 change: 0 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,4 @@ source "https://rubygems.org"

gemspec

gem "unicode-emoji"
gem "irb"
69 changes: 51 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
## Unicode::DisplayWidth [![[version]](https://badge.fury.io/rb/unicode-display_width.svg)](https://badge.fury.io/rb/unicode-display_width) [<img src="https://github.com/janlelis/unicode-display_width/workflows/Test/badge.svg" />](https://github.com/janlelis/unicode-display_width/actions?query=workflow%3ATest)
# Unicode::DisplayWidth [![[version]](https://badge.fury.io/rb/unicode-display_width.svg)](https://badge.fury.io/rb/unicode-display_width) [<img src="https://github.com/janlelis/unicode-display_width/workflows/Test/badge.svg" />](https://github.com/janlelis/unicode-display_width/actions?query=workflow%3ATest)

Determines the monospace display width of a string in Ruby. Useful for all kinds of terminal-based applications. Implementation based on [EastAsianWidth.txt](https://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt) and other data, 100% in Ruby. It does not rely on the OS vendor (like [wcwidth()](https://github.com/janlelis/wcswidth-ruby)) to provide an up-to-date method for measuring string width.

Unicode version: **16.0.0** (September 2024)

## Version 2.4.2 — Performance Updates
## Gem Version 3.0 — Improved Emoji Support

**Emoji support is now enabled by default.** See below for description and configuration possibilities.

## Gem Version 2.4.2 — Performance Updates

**If you use this gem, you should really upgrade to 2.4.2 or newer. It's often 100x faster, sometimes even 1000x and more!**

This is possible because the gem now detects if you use very basic (and common) characters, like ASCII characters. Furthermore, the charachter width lookup code has been optimized, so even when full-width characters are involved, the gem is much faster now.
This is possible because the gem now detects if you use very basic (and common) characters, like ASCII characters. Furthermore, the charachter width lookup code has been optimized, so even when full-width or ambiguous characters are involved, the gem is much faster now.

## Introduction to Character Widths

Expand All @@ -20,7 +24,8 @@ Further at the top means higher precedence. Please expect changes to this algori

Width | Characters | Comment
-------|------------------------------|--------------------------------------------------
X | (user defined) | Overwrites any other values
? | (user defined) | Overwrites any other values
? | Emoji | See "How this Library Handles Emoji Width" below
-1 | `"\b"` | Backspace (total width never below 0)
0 | `"\0"`, `"\x05"`, `"\a"`, `"\n"`, `"\v"`, `"\f"`, `"\r"`, `"\x0E"`, `"\x0F"` | [C0 control codes](https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C0_.28ASCII_and_derivatives.29) which do not change horizontal width
1 | `"\u{00AD}"` | SOFT HYPHEN
Expand All @@ -46,16 +51,14 @@ Or add to your Gemfile:

## Usage

### Classic API

```ruby
require 'unicode/display_width'

Unicode::DisplayWidth.of("") # => 1
Unicode::DisplayWidth.of("") # => 2
```

#### Ambiguous Characters
### Ambiguous Characters

The second parameter defines the value returned by characters defined as ambiguous:

Expand All @@ -64,7 +67,7 @@ Unicode::DisplayWidth.of("·", 1) # => 1
Unicode::DisplayWidth.of("·", 2) # => 2
```

#### Custom Overwrites
### Custom Overwrites

You can overwrite how to handle specific code points by passing a hash (or even a proc) as third parameter:

Expand All @@ -75,23 +78,53 @@ Unicode::DisplayWidth.of("a\tb", 1, "\t".ord => 10)) # => tab counted as 10, so
Please note that using overwrites disables some perfomance optimizations of this gem.


#### Emoji Support
### Emoji Options

Emoji width support is included, but in must be activated manually. It will adjust the string's size for modifier and zero-width joiner sequences. You also need to add the [unicode-emoji](https://github.com/janlelis/unicode-emoji) gem to your Gemfile:
The RGI Emoji set is automatically detected to adjust the width of the string. This can be disabled by passing `emoji: false` as fourth parameter:

```ruby
gem 'unicode-display_width'
gem 'unicode-emoji'
Unicode::DisplayWidth.of "🤾🏽‍♀️" # => 2
Unicode::DisplayWidth.of "🤾🏽‍♀️", 1, {}, emoji: false # => 5
```

Enable the emoji string width adjustments by passing `emoji: true` as fourth parameter:
Disabling Emoji support yields wrong results, as illustrated in the example above, but increases performance of display width calculation by ~30%.

You can configure Emoji options by passing a Hash like this:

```ruby
Unicode::DisplayWidth.of "🤾🏽‍♀️" # => 5
Unicode::DisplayWidth.of "🤾🏽‍♀️", 1, {}, emoji: true # => 2
Unicode::DisplayWidth.of "string", 1, {}, emoji: { wide_text_presentation: false, sequences: :rgi_fqe }
```

#### Usage with String Extension
#### How this Library Handles Emoji Width

There are many Emoji which get constructed by combining other Emoji in a sequence. This makes measuring the width complicated, since terminals might either display the combined Emoji or the separate parts of the Emoji individually.

**Char Width** = No special handling, uses mechanism from table above

Emoji Type | Width / Comment | Configuration Options
------------|-----------------|----------------------
Basic/Single Emoji character without Variation Selector or with VS15 (Text) | Char Width | Use option `wide_text_presentation: true` to force textual Emoji to always be of width 2
Basic/Single Emoji character with VS16 (Emoji) | 2 | -
Emoji Sequence | Recommended Emoji sequences: 2, above rules otherwise | Option `sequences:` explained below

The `sequences:` option can be used to configure which type of Emoji should be considered to have a width of 2. Other sequences are treated as non-combined Emoji, so the widths of all partial Emoji add up (e.g. width of one basic Emoji + one skin tone modifier + another basic Emoji)

The value passed to the `sequences:` option can be one of:

- `:none`: No width adjustments for Emoji sequences: all partial Emoji treated separately
- `:rgi_fqe` (default): All fully-qualified RGI Emoji sequences are considered to have a width of 2
- `:rgi_mqe`: All fully- and minimally-qualified RGI Emoji sequences are considered to have a width of 2
- `:rgi_uqe`: All RGI Emoji sequences, regardless of qualification status are considered to have a width of 2
- `:all`: All possible/well-formed Emoji sequences are considered to have a width of 2

*RGI Emoji:* Emoji Recommended for General Interchange

*Qualfication:* Whether an Emoji sequence has all required VS16 codepoints

See [emoji-test.txt](https://www.unicode.org/Public/emoji/16.0/emoji-test.txt), the [unicode-emoji gem](https://github.com/janlelis/unicode-emoji) and [UTS-51](https://www.unicode.org/reports/tr51/#def_qualified_emoji_character) for more details about qualified and unqualified Emoji sequences.


### Usage with String Extension

```ruby
require 'unicode/display_width/string_ext'
Expand All @@ -110,11 +143,11 @@ require 'unicode/display_width'
display_width = Unicode::DisplayWidth.new(
# ambiguous: 1,
overwrite: { "A".ord => 100 },
emoji: true,
emoji: { wide_text_presentation: true },
)

display_width.of "" # => 1
display_width.of "🤾🏽‍♀️" # => 2
display_width.of "" # => 2
display_width.of "A" # => 100
```

Expand Down
131 changes: 91 additions & 40 deletions lib/unicode/display_width.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# frozen_string_literal: true

require "unicode/emoji"

require_relative "display_width/constants"
require_relative "display_width/index"

Expand All @@ -8,28 +10,61 @@ class DisplayWidth
INITIAL_DEPTH = 0x10000
ASCII_NON_ZERO_REGEX = /[\0\x05\a\b\n\v\f\r\x0E\x0F]/
FIRST_4096 = decompress_index(INDEX[0][0], 1)
DEFAULT_EMOJI_OPTIONS = {
sequences: :rgi_fqe,
wide_text_presentation: false,
}
EMOJI_SEQUENCES_REGEX_MAPPING = {
rgi_fqe: :REGEX,
rgi_mqe: :REGEX_INCLUDE_MQE,
rgi_uqe: :REGEX_INCLUDE_MQE_UQE,
all: :REGEX_WELL_FORMED,
}
EMOJI_NOT_POSSIBLE = /\A[#*0-9]\z/

def self.of(string, ambiguous = 1, overwrite = {}, options = {})
if overwrite.empty?
# Optimization for ASCII-only strings without certain control symbols
if string.ascii_only?
if string.match?(ASCII_NON_ZERO_REGEX)
res = string.gsub(ASCII_NON_ZERO_REGEX, "").size - string.count("\b")
res < 0 ? 0 : res
else
string.size
end
else
width_no_overwrite(string, ambiguous, options)
if !overwrite.empty?
return width_frame(string, options) do |string|
width_all_features(string, ambiguous, overwrite)
end
end

if !string.ascii_only?
return width_frame(string, options) do |string|
width_no_overwrite(string, ambiguous)
end
end

# Optimization for ASCII-only strings without certain control symbols
if string.match?(ASCII_NON_ZERO_REGEX)
res = string.gsub(ASCII_NON_ZERO_REGEX, "").size - string.count("\b")
return res < 0 ? 0 : res
end

# Pure ASCII
string.size
end

def self.width_frame(string, options)
# Retrieve Emoji width
if options[:emoji] == false
res = 0
else
width_all_features(string, ambiguous, overwrite, options)
emoji_options = ( options[:emoji] == true || !options ) ?
DEFAULT_EMOJI_OPTIONS :
options[:emoji]
res, string = emoji_width(string, **emoji_options)
end

# Get general width
res += yield(string)

# Return result + prevent negative lengths
res < 0 ? 0 : res
end

def self.width_no_overwrite(string, ambiguous, options = {})
# Sum of all chars widths
res = string.codepoints.sum{ |codepoint|
def self.width_no_overwrite(string, ambiguous, _ = {})
string.codepoints.sum{ |codepoint|
if codepoint > 15 && codepoint < 161 # very common
next 1
elsif codepoint < 0x1001
Expand All @@ -45,18 +80,11 @@ def self.width_no_overwrite(string, ambiguous, options = {})

width == :A ? ambiguous : (width || 1)
}

# Substract emoji error
res -= emoji_extra_width_of(string, ambiguous) if options[:emoji]

# Return result + prevent negative lengths
res < 0 ? 0 : res
end

# Same as .width_no_overwrite - but with applying overwrites for each char
def self.width_all_features(string, ambiguous, overwrite, options)
# Sum of all chars widths
res = string.codepoints.sum{ |codepoint|
def self.width_all_features(string, ambiguous, overwrite)
string.codepoints.sum{ |codepoint|
next overwrite[codepoint] if overwrite[codepoint]

if codepoint > 15 && codepoint < 161 # very common
Expand All @@ -74,31 +102,54 @@ def self.width_all_features(string, ambiguous, overwrite, options)

width == :A ? ambiguous : (width || 1)
}
end

# Substract emoji error
res -= emoji_extra_width_of(string, ambiguous, overwrite) if options[:emoji]

# Return result + prevent negative lengths
res < 0 ? 0 : res
end
def self.emoji_width(string, sequences: :rgi_fqe, wide_text_presentation: false)
adjustments = 0

if regex = EMOJI_SEQUENCES_REGEX_MAPPING[sequences]
emoji_sequence_regex = Unicode::Emoji.const_get(regex)
else # sequences == :none
emoji_sequence_regex = /$^/
end

def self.emoji_extra_width_of(string, ambiguous = 1, overwrite = {}, _ = {})
require "unicode/emoji"
# For each string possibly an emoji
no_emoji_string = string.encode("utf-8").gsub(Unicode::Emoji::REGEX_POSSIBLE){ |emoji_candidate|
# Skip notorious false positives
if EMOJI_NOT_POSSIBLE.match?(emoji_candidate)
emoji_candidate

extra_width = 0
modifier_regex = /[#{ Unicode::Emoji::EMOJI_MODIFIERS.pack("U*") }]/
zwj_regex = /(?<=#{ [Unicode::Emoji::ZWJ].pack("U") })./
# Check if we have a combined Emoji with width 2
elsif emoji_candidate == emoji_candidate[emoji_sequence_regex]
adjustments += 2
""

string.scan(Unicode::Emoji::REGEX){ |emoji|
extra_width += 2 * emoji.scan(modifier_regex).size
# We are dealing with a default text presentation emoji or a well-formed sequence not matching the above Emoji set
else
# Ensure all explicit VS16 sequences have width 2
emoji_candidate.gsub!(Unicode::Emoji::REGEX_BASIC){ |basic_emoji|
if basic_emoji.size == 2 # VS16 present
adjustments += 2
""
else
basic_emoji
end
}

# Apply wide_text_presentation option if present
if wide_text_presentation
emoji_candidate.gsub!(Unicode::Emoji::REGEX_TEXT){ |text_emoji|
adjustments += 2
""
}
end

emoji.scan(zwj_regex){ |zwj_succ|
extra_width += self.of(zwj_succ, ambiguous, overwrite)
}
emoji_candidate
end
}

extra_width
[adjustments, no_emoji_string]
end

def initialize(ambiguous: 1, overwrite: {}, emoji: false)
Expand Down
Loading

0 comments on commit 42fced8

Please sign in to comment.