Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move HiStrCellsCount etc implementation to backend side #2378

Open
unxed opened this issue Sep 7, 2024 · 7 comments
Open

Move HiStrCellsCount etc implementation to backend side #2378

unxed opened this issue Sep 7, 2024 · 7 comments

Comments

@unxed
Copy link
Contributor

unxed commented Sep 7, 2024

int HiStrCellsCount(const wchar_t *Str);
int HiFindRealPos(const wchar_t *Str, int Pos, BOOL ShowAmp);
int HiFindNextVisualPos(const wchar_t *Str, int Pos, int Direct);
FARString &HiText2Str(FARString &strDest, const wchar_t *Str);
...
size_t StrCellsCount(const wchar_t *pwz, size_t nw)
size_t StrZCellsCount(const wchar_t *pwz)
size_t StrSizeOfCells(const wchar_t *pwz, size_t n, size_t &ng, bool round_up)
size_t StrSizeOfCell(const wchar_t *pwz, size_t n)
static struct TruncReplacement
void StrCellsTruncateLeft(wchar_t *pwz, size_t &n, size_t ng)
void StrCellsTruncateRight(wchar_t *pwz, size_t &n, size_t ng)
void StrCellsTruncateCenter(wchar_t *pwz, size_t &n, size_t ng)

All these functions rely on theoretically predicted visual sizes of grapheme clusters. However, our predictions do not account for the fact that adjacent grapheme clusters may display differently. In Unicode, there is no concept of character width outside of the string context in which it appears. Therefore, our approach cannot work perfectly even in theory. Additionally, the specific font affects the width, and in the case of a terminal, we don’t even have a rough idea of what the font is.

Instead, I propose a different approach. In the wx backend, we can ask wx itself for the pixel width needed to render a string, and then divide it by the pixel width of a cell, obtaining the minimum number of cells required to display the string.

In the tty backend, we can output the string outside the terminal window's space (above it, for instance), and by measuring the cursor's position after output, determine the number of screen cells occupied from the terminal's perspective. Performance can be significantly improved here by excluding this operation for strings that contain only characters guaranteed to have a width of 1 in all terminals and rendering libraries, such as Latin-1 characters, box-drawing characters, or Cyrillic characters.

In both cases, it is possible to retain the ability to revert to the current algorithm if performance issues arise.

UPD: guaranteed 1-cell-sized chars for fast size calculation:

0x0020-0x00AC
0x00AE-0x02FF
0x0370-0x0482
0x048A-0x0590
0x05bE-0x05FF
0x2500-0x257F
@unxed
Copy link
Contributor Author

unxed commented Sep 7, 2024

So far, here are the ideas:

  • Write a string length (in screen cells) measurement function for both backends (in wx through rendering string into some buffer and measuring the resulting width; in console by outputting off-screen and measuring the cursor offset in screen cells by corresponding ESC sequences) and add it to the backend API.

  • Write a grapheme cluster splitter.

  • Refactor all code that currently uses IsCharFullWidth/IsCharPrefix/IsCharSuffix/IsCharXxxfix to use these two functions.

  • Add BiDi support through FriBidi; it's amd64 binary is just 40 Kb in size

In theory, this approach should provide perfect rendering in any terminal capable of at least splitting into grapheme clusters in the same way we do, which of course requires up-to-date data from unicode.org, but at least it doesn’t depend on the font being used.

@unxed
Copy link
Contributor Author

unxed commented Sep 8, 2024

Here is a sample script that uses just terminal interaction without any unicode libs to
— split a string into grapheme clusters
— measure length of a string in terminal screen cells

segmentation.py

import sys
import os

# Works instantly in Konsole, WezTerm, far2l terminal.
# Not too fast yet in GNOME Terminal and kovidgoyal's kitty.

def get_cursor_position():
    """Requests the current cursor position in the terminal."""
    sys.stdout.write('\033[6n')
    sys.stdout.flush()

    response = ""
    while True:
        ch = sys.stdin.read(1)
        response += ch
        if ch == 'R':  # Response ends with 'R'
            break

    try:
        _, position = response.split('[')
        row, col = map(int, position[:-1].split(';'))
        return row, col
    except ValueError:
        return -1, -1

def split_into_graphemes(text):
    """Splits a string into grapheme clusters, taking into account cursor movement and surrogate pairs."""
    grapheme_clusters = []
    grapheme_widths = []
    current_cluster = ""
    
    if os.name == 'posix':
        os.system('stty -icanon -echo')

    _, start_col = get_cursor_position()
    prev_col = start_col

    for char in text:
        sys.stdout.write(char)
        sys.stdout.flush()

        _, col = get_cursor_position()

        # Check the width of the current character
        cursor_moved = (col != start_col + len(current_cluster)) if not current_cluster else (col != start_col)

        # If the cursor moved, assume the end of the previous cluster
        if cursor_moved:
            if current_cluster:
                grapheme_clusters.append(current_cluster)

            grapheme_widths.append(col - prev_col)  # Save the cluster width
            prev_col = col

            current_cluster = char
            start_col = col  # Update the starting column for the new cluster
        else:
            current_cluster += char

    if current_cluster:
        grapheme_clusters.append(current_cluster)
        grapheme_widths.append(col - start_col + 1)  # +1 for the last character

    if os.name == 'posix':
        os.system('stty sane')

    return grapheme_clusters, grapheme_widths

def main():
    text = "a1🙂❤️ё-á🇷🇺😊"
    clusters, widths = split_into_graphemes(text)

    # Post-processing clusters and widths to combine VARIATION SELECTOR code points
    # Needed for correct operation in Konsole and kovidgoyal's kitty
    processed_clusters = []
    processed_widths = []
    i = 0
    while i < len(clusters):
        # Check for VARIATION SELECTOR
        if i + 1 < len(clusters) and len(clusters[i+1]) == 1 and ord(clusters[i+1]) >= 0xFE00 and ord(clusters[i+1]) <= 0xFE0F:
            processed_clusters.append(clusters[i] + clusters[i+1])
            processed_widths.append(widths[i] + widths[i+1])  # Sum the widths
            i += 2
        else:
            processed_clusters.append(clusters[i])
            processed_widths.append(widths[i])
            i += 1

    clusters = processed_clusters
    widths = processed_widths

    print("\n\nGrapheme clusters and their widths:")

    total_width = 0

    for cluster, width in zip(clusters, widths):
        separator = " - " if width == 1 else "- "  # Choose separator
        #cluster_width = len(cluster)  # Calculate the width of the cluster string in characters
        #print(f'"{cluster}"{separator}{width} {cluster_width}')
        print(f'"{cluster}"{separator}{width}')
        total_width += width

    print(f"\nString length in screen cells: {total_width}")  # Output the total sum


if __name__ == "__main__":
    main()

@unxed
Copy link
Contributor Author

unxed commented Sep 8, 2024

Here is a sample script that uses just terminal interaction without any unicode libs

NB! Far3 uses just the same logic:

FarGroup/FarManager@272cee2
FarGroup/FarManager@e39af49

Search for IsWidePreciseExpensive or GetWidthPreciseExpensive.

@unxed
Copy link
Contributor Author

unxed commented Sep 10, 2024

This was actually invented here: https://github.com/magiblot/tvision/blob/master/source/platform/winwidth.cpp

This link was posted to the FAR forum a long time ago. They ignored it then, but now they finally did it.

Source: https://t.me/FarManager/15298

@unxed
Copy link
Contributor Author

unxed commented Sep 10, 2024

magiblot/tvision@6f2acd9

Screenshot_20240910_113426_Firefox

@unxed
Copy link
Contributor Author

unxed commented Sep 15, 2024

Some reasonable remarks:

I'm afraid that solution is only feasible on Windows. In order to do that in a Unix terminal:

  • Turbo Vision uses the alternate screen buffer, which results in scrollback being disabled in most terminal emulators. You would have to print the characters you want to measure in the same screen area where your application is being drawn. For example: if you tried to print an emoji, then measure the cursor movement, then move the cursor back to its initial position, and then overwrite the emoji with the characters that used to be in that part of the screen, it is very likely that the terminal emulator would display the emoji on screen for some time.
  • Even if the characters being measured didn't appear on screen, or if that wasn't an issue, the performance of the whole process would be very poor.
  • Even if the above weren't an issue, the input stream used for reading the terminal state is the same that's used for reading user input. Therefore, you would have to either ignore user input while measuring text width, or write code that is able to keep the input events that are received while measuring text width. And then you would also have to consider the risk of waiting forever for an answer from the terminal...

So, in my opinion, you would end up with a poor experience for both the user and the programmer.

magiblot/tvision#51 (comment)

@unxed
Copy link
Contributor Author

unxed commented Nov 25, 2024

iTerm2 has ESC sequence to specify Unicode version for characters with detection:
magiblot/tvision#51 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant