Move HiStrCellsCount etc implementation to backend side #2378

unxed · 2024-09-07T14:09:59Z

int HiStrCellsCount(const wchar_t *Str);
int HiFindRealPos(const wchar_t *Str, int Pos, BOOL ShowAmp);
int HiFindNextVisualPos(const wchar_t *Str, int Pos, int Direct);
FARString &HiText2Str(FARString &strDest, const wchar_t *Str);
...
size_t StrCellsCount(const wchar_t *pwz, size_t nw)
size_t StrZCellsCount(const wchar_t *pwz)
size_t StrSizeOfCells(const wchar_t *pwz, size_t n, size_t &ng, bool round_up)
size_t StrSizeOfCell(const wchar_t *pwz, size_t n)
static struct TruncReplacement
void StrCellsTruncateLeft(wchar_t *pwz, size_t &n, size_t ng)
void StrCellsTruncateRight(wchar_t *pwz, size_t &n, size_t ng)
void StrCellsTruncateCenter(wchar_t *pwz, size_t &n, size_t ng)

All these functions rely on theoretically predicted visual sizes of grapheme clusters. However, our predictions do not account for the fact that adjacent grapheme clusters may display differently. In Unicode, there is no concept of character width outside of the string context in which it appears. Therefore, our approach cannot work perfectly even in theory. Additionally, the specific font affects the width, and in the case of a terminal, we don’t even have a rough idea of what the font is.

Instead, I propose a different approach. In the wx backend, we can ask wx itself for the pixel width needed to render a string, and then divide it by the pixel width of a cell, obtaining the minimum number of cells required to display the string.

In the tty backend, we can output the string outside the terminal window's space (above it, for instance), and by measuring the cursor's position after output, determine the number of screen cells occupied from the terminal's perspective. Performance can be significantly improved here by excluding this operation for strings that contain only characters guaranteed to have a width of 1 in all terminals and rendering libraries, such as Latin-1 characters, box-drawing characters, or Cyrillic characters.

In both cases, it is possible to retain the ability to revert to the current algorithm if performance issues arise.

UPD: guaranteed 1-cell-sized chars for fast size calculation:

0x0020-0x00AC
0x00AE-0x02FF
0x0370-0x0482
0x048A-0x0590
0x05bE-0x05FF
0x2500-0x257F

The text was updated successfully, but these errors were encountered:

unxed · 2024-09-07T18:45:39Z

So far, here are the ideas:

Write a string length (in screen cells) measurement function for both backends (in wx through rendering string into some buffer and measuring the resulting width; in console by outputting off-screen and measuring the cursor offset in screen cells by corresponding ESC sequences) and add it to the backend API.
Write a grapheme cluster splitter.
Refactor all code that currently uses IsCharFullWidth/IsCharPrefix/IsCharSuffix/IsCharXxxfix to use these two functions.
Add BiDi support through FriBidi; it's amd64 binary is just 40 Kb in size

In theory, this approach should provide perfect rendering in any terminal capable of at least splitting into grapheme clusters in the same way we do, which of course requires up-to-date data from unicode.org, but at least it doesn’t depend on the font being used.

unxed · 2024-09-08T20:41:44Z

Here is a sample script that uses just terminal interaction without any unicode libs to
— split a string into grapheme clusters
— measure length of a string in terminal screen cells

segmentation.py

import sys
import os

# Works instantly in Konsole, WezTerm, far2l terminal.
# Not too fast yet in GNOME Terminal and kovidgoyal's kitty.

def get_cursor_position():
    """Requests the current cursor position in the terminal."""
    sys.stdout.write('\033[6n')
    sys.stdout.flush()

    response = ""
    while True:
        ch = sys.stdin.read(1)
        response += ch
        if ch == 'R':  # Response ends with 'R'
            break

    try:
        _, position = response.split('[')
        row, col = map(int, position[:-1].split(';'))
        return row, col
    except ValueError:
        return -1, -1

def split_into_graphemes(text):
    """Splits a string into grapheme clusters, taking into account cursor movement and surrogate pairs."""
    grapheme_clusters = []
    grapheme_widths = []
    current_cluster = ""
    
    if os.name == 'posix':
        os.system('stty -icanon -echo')

    _, start_col = get_cursor_position()
    prev_col = start_col

    for char in text:
        sys.stdout.write(char)
        sys.stdout.flush()

        _, col = get_cursor_position()

        # Check the width of the current character
        cursor_moved = (col != start_col + len(current_cluster)) if not current_cluster else (col != start_col)

        # If the cursor moved, assume the end of the previous cluster
        if cursor_moved:
            if current_cluster:
                grapheme_clusters.append(current_cluster)

            grapheme_widths.append(col - prev_col)  # Save the cluster width
            prev_col = col

            current_cluster = char
            start_col = col  # Update the starting column for the new cluster
        else:
            current_cluster += char

    if current_cluster:
        grapheme_clusters.append(current_cluster)
        grapheme_widths.append(col - start_col + 1)  # +1 for the last character

    if os.name == 'posix':
        os.system('stty sane')

    return grapheme_clusters, grapheme_widths

def main():
    text = "a1🙂❤️ё-á🇷🇺😊"
    clusters, widths = split_into_graphemes(text)

    # Post-processing clusters and widths to combine VARIATION SELECTOR code points
    # Needed for correct operation in Konsole and kovidgoyal's kitty
    processed_clusters = []
    processed_widths = []
    i = 0
    while i < len(clusters):
        # Check for VARIATION SELECTOR
        if i + 1 < len(clusters) and len(clusters[i+1]) == 1 and ord(clusters[i+1]) >= 0xFE00 and ord(clusters[i+1]) <= 0xFE0F:
            processed_clusters.append(clusters[i] + clusters[i+1])
            processed_widths.append(widths[i] + widths[i+1])  # Sum the widths
            i += 2
        else:
            processed_clusters.append(clusters[i])
            processed_widths.append(widths[i])
            i += 1

    clusters = processed_clusters
    widths = processed_widths

    print("\n\nGrapheme clusters and their widths:")

    total_width = 0

    for cluster, width in zip(clusters, widths):
        separator = " - " if width == 1 else "- "  # Choose separator
        #cluster_width = len(cluster)  # Calculate the width of the cluster string in characters
        #print(f'"{cluster}"{separator}{width} {cluster_width}')
        print(f'"{cluster}"{separator}{width}')
        total_width += width

    print(f"\nString length in screen cells: {total_width}")  # Output the total sum


if __name__ == "__main__":
    main()

unxed · 2024-09-08T21:53:48Z

Here is a sample script that uses just terminal interaction without any unicode libs

NB! Far3 uses just the same logic:

FarGroup/FarManager@272cee2
FarGroup/FarManager@e39af49

Search for IsWidePreciseExpensive or GetWidthPreciseExpensive.

unxed · 2024-09-10T09:04:53Z

This was actually invented here: https://github.com/magiblot/tvision/blob/master/source/platform/winwidth.cpp

This link was posted to the FAR forum a long time ago. They ignored it then, but now they finally did it.

Source: https://t.me/FarManager/15298

unxed · 2024-09-10T09:36:30Z

magiblot/tvision@6f2acd9

unxed · 2024-09-15T10:51:15Z

Some reasonable remarks:

I'm afraid that solution is only feasible on Windows. In order to do that in a Unix terminal:

Turbo Vision uses the alternate screen buffer, which results in scrollback being disabled in most terminal emulators. You would have to print the characters you want to measure in the same screen area where your application is being drawn. For example: if you tried to print an emoji, then measure the cursor movement, then move the cursor back to its initial position, and then overwrite the emoji with the characters that used to be in that part of the screen, it is very likely that the terminal emulator would display the emoji on screen for some time.

Even if the characters being measured didn't appear on screen, or if that wasn't an issue, the performance of the whole process would be very poor.

Even if the above weren't an issue, the input stream used for reading the terminal state is the same that's used for reading user input. Therefore, you would have to either ignore user input while measuring text width, or write code that is able to keep the input events that are received while measuring text width. And then you would also have to consider the risk of waiting forever for an answer from the terminal...

So, in my opinion, you would end up with a poor experience for both the user and the programmer.

magiblot/tvision#51 (comment)

unxed · 2024-11-25T21:28:30Z

iTerm2 has ESC sequence to specify Unicode version for characters with detection:
magiblot/tvision#51 (comment)

unxed mentioned this issue Sep 9, 2024

Unicode issues left — metabug #2157

Open

unxed mentioned this issue Sep 10, 2024

Konsole: overly wide Unicode characters mess up intended layouts magiblot/tvision#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move HiStrCellsCount etc implementation to backend side #2378

Move HiStrCellsCount etc implementation to backend side #2378

unxed commented Sep 7, 2024 •

edited

Loading

unxed commented Sep 7, 2024 •

edited

Loading

unxed commented Sep 8, 2024

unxed commented Sep 8, 2024 •

edited

Loading

unxed commented Sep 10, 2024

unxed commented Sep 10, 2024

unxed commented Sep 15, 2024

unxed commented Nov 25, 2024

Move HiStrCellsCount etc implementation to backend side #2378

Move HiStrCellsCount etc implementation to backend side #2378

Comments

unxed commented Sep 7, 2024 • edited Loading

unxed commented Sep 7, 2024 • edited Loading

unxed commented Sep 8, 2024

unxed commented Sep 8, 2024 • edited Loading

unxed commented Sep 10, 2024

unxed commented Sep 10, 2024

unxed commented Sep 15, 2024

unxed commented Nov 25, 2024

unxed commented Sep 7, 2024 •

edited

Loading

unxed commented Sep 7, 2024 •

edited

Loading

unxed commented Sep 8, 2024 •

edited

Loading