Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect handling of CJK ambiguous width characters #6560

Closed
ctrlcctrlv opened this issue Aug 18, 2023 · 6 comments
Closed

incorrect handling of CJK ambiguous width characters #6560

ctrlcctrlv opened this issue Aug 18, 2023 · 6 comments
Labels

Comments

@ctrlcctrlv
Copy link
Contributor

ctrlcctrlv commented Aug 18, 2023

Describe the bug
UAX №11 defines East Asian Width, or CJK Width.

The spec reads:

Ambiguous width characters are all those characters that can occur as fullwidth characters in any of a number of East Asian legacy character encodings. They have a “resolved” width of either narrow or wide depending on the context of their use. If they are not used in the context of the specific legacy encoding to which they belong, their width resolves to narrow. Otherwise, it resolves to fullwidth or halfwidth. The term context as used here includes extra information such as explicit markup, knowledge of the source code page, font information, or language and script identification. For example:

  • Greek characters resolve to narrow when used with a standard Greek font, because there is no East Asian legacy context.
  • Private-use character codes and the replacement character have ambiguous width, because they may stand in for characters of any width.
  • Ambiguous quotation marks are generally resolved to wide when they enclose and are adjacent to a wide character, and to narrow otherwise.

The East_Asian_Width property does not preserve canonical equivalence, because the base characters of canonical decompositions almost always have a different East_Asian_Width than the precomposed characters. East Asian Width is designed for use with legacy character sets so the property value is not designed to respect canonical equivalence.

Modern Rendering Practice. Modern practice is evolving toward rendering ever more of the ambiguous characters with proportionally spaced, narrow forms that rotate with the direction of writing, making a distinction within the legacy character set. In other words, context information beyond the choice of font or source character set is employed to resolve the width of the character. This annex does not attempt to track such changes in practice; therefore, the set of characters with mappings to legacy character sets that have been assigned ambiguous width constitute a superset of the set of such characters that may be rendered as wide characters in a given context. In particular, an application might find it useful to treat characters from alphabetic scripts as narrow by default. Conversely, many of the symbols in the Unicode Standard have no mappings to legacy character sets, yet they may be rendered as “wide” characters if they appear in an East Asian context. An implementation might therefore elect to treat them as ambiguous even though they are classified as neutral here.

5 Recommendations

When mapping Unicode to East Asian legacy character encodings

  • Wide Unicode characters always map to fullwidth characters.
  • Narrow (and neutral) Unicode characters always map to halfwidth characters.
  • Halfwidth Unicode characters always map to halfwidth characters.
  • Ambiguous Unicode characters always map to fullwidth characters.

Emphasis mine.

To Reproduce
Steps to reproduce the behavior:

  1. Type コピペ★
  2. See error
    image

Environment details

kitty 0.28.1 (877d8d7008) created by Kovid Goyal
Linux debu.tanuki.agency 6.4.10-zen2-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Sun, 13 Aug 2023 01:33:59 +0000 x86_64
Arch Linux 6.4.10-zen2-1-zen (/dev/tty)

DISTRIB_ID="Arch"
DISTRIB_RELEASE="rolling"
DISTRIB_DESCRIPTION="Arch Linux"
Running under: Wayland
Frozen: False
Paths:
  kitty: /usr/bin/kitty
  base dir: /usr/lib/kitty
  extensions dir: /usr/lib/kitty/kitty
  system shell: /bin/bash
Loaded config files:
  /home/fred/.config/kitty/kitty.conf

Config options different from defaults:
bold_font             IBM Plex Sans Mono Bold
bold_italic_font      IBM Plex Sans Mono Bold Italic
cursor_blink_interval 0.0
font_family           IBM Plex Sans Mono
font_size             16.0
force_ltr             True
italic_font           IBM Plex Sans Mono Italic

Important environment variables seen by the kitty process:
	PATH                                /opt/google-cloud-cli/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/home/fred/.dotnet/tools:/var/lib/flatpak/exports/bin:/usr/lib/jvm/default/bin:/usr/lib32/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/var/lib/snapd/snap/bin
	LANG                                en_US.UTF-8
	SHELL                               /bin/bash
	GLFW_IM_MODULE                      ibus
	DISPLAY                             :1
	WAYLAND_DISPLAY                     wayland-0
	USER                                fred
	XCURSOR_SIZE                        24
	XDG_CACHE_HOME                      /home/fred/.cache
	XDG_CONFIG_DIRS                     /home/fred/.config/kdedefaults:/etc/xdg
	XDG_CONFIG_HOME                     /home/fred/.config
	XDG_CURRENT_DESKTOP                 KDE
	XDG_DATA_DIRS                       /home/fred/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share:/var/lib/snapd/desktop
	XDG_DATA_HOME                       /home/fred/.local/share
	XDG_RUNTIME_DIR                     /run/user/1000
	XDG_SEAT                            seat0
	XDG_SEAT_PATH                       /org/freedesktop/DisplayManager/Seat0
	XDG_SESSION_CLASS                   user
	XDG_SESSION_DESKTOP                 KDE
	XDG_SESSION_ID                      936
	XDG_SESSION_PATH                    /org/freedesktop/DisplayManager/Session1
	XDG_SESSION_TYPE                    wayland
	XDG_STATE_HOME                      /home/fred/.local/state
	XDG_VTNR                            2

Additional context

  • Reproduces w/kitty --config NONE.
@ctrlcctrlv ctrlcctrlv added the bug label Aug 18, 2023
@ctrlcctrlv
Copy link
Contributor Author

ctrlcctrlv commented Aug 18, 2023

I suggest the following default rules:

  • When an ambiguous width character meets any of these criteria:
    • being surrounded by two EAW fullwidth characters;
    • being line-initial or line-terminal; or it is
    • surrounded by whitespace, which is then checked against these criteria recursively,

It is itself EAW fullwidth.

@kovidgoyal
Copy link
Owner

As far as I know no terminal programs follow these rules. Changing it in kitty will break things for anyone actually using these characters. Ideally developers of several major TUI programs should agree to this before it is implemented in kitty. Currently as far as I know there are no actual issues reported by kitty users for ambiguous width characters, making this change will cause issues when the program running in the terminal will no longer agree with kitty on what the width should be.

As such, I am not particularly keen to implement this. If you can point to some other terminal emulators or better major terminal programs that have implemented or plan to implement it, I will reconsider.

@ctrlcctrlv
Copy link
Contributor Author

mlterm follows these rules:
image

@ctrlcctrlv
Copy link
Contributor Author

@kovidgoyal
Copy link
Owner

There is no way wcwidth can implement the algorithm you describe since it returns widths of characters in isolation. One would need wcswidth for that.

@ctrlcctrlv
Copy link
Contributor Author

i did not name wcwidth-cjk repo

WerWolv added a commit to WerWolv/ImHex that referenced this issue May 1, 2024
- Better argument parsing
- Allow processing all language folders at the same time
- Allow an optional reference language when translating
- Save translations on KeyboardInterrupt
- Fixes a ooold input issues by importing readline
(kovidgoyal/kitty#6560)
- Add untranslate mode to remove translations by a key regex

---------

Co-authored-by: Nik <werwolv98@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants