Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Select by character or display indices #2724

Closed
eraserhd opened this issue Feb 5, 2019 · 8 comments
Closed

Proposal: Select by character or display indices #2724

eraserhd opened this issue Feb 5, 2019 · 8 comments

Comments

@eraserhd
Copy link
Contributor

eraserhd commented Feb 5, 2019

Expansions such as %val{selection_desc} expose columns in byte index within the line, and commands such as select take selection in terms of byte index within line.

This makes integrations hard. So far as I've experienced, byte-index columns aren't useful external to Kakoune. So far, I have unicode issues with parinfer-rust that I will have to fix by converting coordinates to characters, and I have unicode issues with my new selection tool based on this as well.

The selection tool is written in Clojure, therefore using Java strings, so reading the file produces UTF16 characters anyway, so the byte index of UTF8 can't be used directly for converting, and if there are multiple encodings of a character in UTF8, I'll have to guess which way it occurred to interpret the bytes.

Further, converting these requires extra content from the buffer. If I have a $kak_selection that start > column 1, I need to know the entire text of the starting line in addition to the text of the selection I probably want. We don't have access to the text of the buffer outside of selections unless we run some keys in draft mode and save the value to an option.

So, I propose that byte indices are not exposed for scripting. Instead, all expansions and commands use character indices. The user can split diacritics and other compound characters, but not multibyte character encodings.

No matter how it is sliced, this would be a bit of work. There's a couple ways to do it. One way is to audit everything at user-interface layer (commands and expansions), and ensure they provide the right data. Another way is to encapsulate this at the buffer layer and not expose byte indices to the rest of Kakoune.

IMHO, both add internal complexity. Both reduce interface complexity. The latter way contains the complexity better, but is harder. It would be nice to figure out how to break it down into smaller steps.

(Note: It would be possible to make parallel expansions that use char indices - some already exist - and add options to commands like select. I'm proposing the big change here because I think it'll be better for Kakoune.)

@Screwtapello
Copy link
Contributor

This affects kak-lsp too: kakoune-lsp/kakoune-lsp#98

"character" is kind of a hazy concept in the world of Unicode, and the official documentation tries to avoid it. Some common alternatives include:

  • bytes in a particular encoding, as Kakoune does, so U+0041 LATIN CAPITAL LETTER A is one byte, U+0107 LATIN SMALL LETTER C WITH ACUTE is two, and U+1F404 COW is four.
  • UTF-16 code units, as Java, and Win32 do, so U+0041 is one unit, U+0107 is also one unit, and U+1F404 is two
  • UTF-32 code units, or Unicode code points, so U+0041, U+0107 and U+1F404 are one code point each, but U+0063 LATIN SMALL LETTER C followed by U+0301 COMBINING ACUTE is two code-points, even though it looks the same as U+0107.
  • Grapheme clusters, so U+0041, U+0107, U+0063 U+0301, and U+1F404 are all one cluster each
  • Character cells, so U+0041, U+107, and U+0063 U+0301 are each on cell, but U+1F404 is two, and so is U+FF21 FULLWIDTH LATIN CAPITAL LETTER A, and U+0301 on its own is zero.

Kakoune's "UTF-8 bytes" coordinates are great for other tools that use UTF-8 internally, but not great for tools that use other encodings.

"code units" coordinates would likewise be great for tools that use UTF-16 internally, but not great for tools that use UTF-8. It also wouldn't be great for tools that use UTF-32, but the only difference would be astral plane characters (U+10000 and above) which aren't that common so maybe it would still be worthwhile.

"codepoints" is a nice encoding-agnostic system, but for selections it has the same problems as the current "UTF-8 bytes": since codepoints are not fixed-size in most encodings, you still need the original buffer content to convert codepoint-coordinates into UTF-8 bytes or code-units that you program can work with. Also, if these selection coordinates are visible in the UI, it would be weird that ć is one unit wide while is two, even though they look identical.

"grapheme clusters" is encoding-agnostic, but has the same "needs the whole buffer content" issues as UTF-8 bytes, code units, and codepoints. It also requires a copy of the Unicode character database to determine the edge of each cluster, which can make for Fun Times when (for example) Kakoune is using data from libc, and a plugin is using data from the JRE, and the two data sources disagree. It does make more sense in the UI, though.

"character cells" is basically like "grapheme clusters", except that it's not strictly-speaking defined by the Unicode standard. Unicode defines particular character cell widths for some characters, but not all characters, so implementors have to guess. See the comment at the top of Markus Kuhn's wcwidth.c, and issues like jquast/wcwidth#8.

@mawww
Copy link
Owner

mawww commented Feb 11, 2019

Hello,

So first, one thing to keep in mind is that Kakoune does not enforce the buffer content to be unicode, it will interpret it as UTF-8, but should tolerate non-UTF-8 contents (although there is not big guarantees on what it will do with it).

That means that giving anything else that byte offsets is going to be ambiguous on non-utf8 buffer contents, as there is no well defined method to handle invalid utf8 text that I know of.

I think we can all agree that in retrospect, the choice of UTF-16 by Java and Windows was a mistake, UTF-16 lost, UTF-8 won. I would be tempted to say that tools using another encoding are the ones that needs fixing, if your language of choice's strings enforce UTF-16 (or any encoding, including UTF-8) maybe using strings to store the buffer contents is not a good idea and you should be using a byte array instead... Unfortunately, that does not solve the status quo, language-server-protocol is going to be using UTF-16 for the time being.

Internally, Kakoune actually uses 3 different horizontal coordinates: bytes, characters, and columns.

  • bytes are what we've been discussing so far.
  • characters is misnamed and actually means codepoint.
  • columns is display related, using whatever libc wcwidth returns to compute a column position.

We could relatively easily expose any of those, but as described by @Screwtapello, none is entirely satisfying by itself.

One additional complication is timestamping, Kakoune accepts input that does not match the current buffer state (say for a ranges-highlighter input), it will update those coordinates by using the buffer changes vector (which tracks modifications made to the buffer), but those only give information on the byte level, and supporting anything else would mean we need either to store not only the count of bytes added/removed, but also the count of columns and/or the count of codepoint... I am not really looking forward to that.

All that to say I have no real solution to this problem, I am unconvinced any of the alternatives to the status quo is significantly better, and I think the status quo is the most robust solution, because its the only one which is unambiguous. But that means we cannot solve the mismatched encoding problem from Kakoune, and other tools needs to be fixed to stop enforcing an encoding on arbitrary bytes (because frankly, there is no technical reasons to do that, lsp uses UTF-16 because the designers were lazy and exposed an VSCode implementation detail to the world).

@eraserhd
Copy link
Contributor Author

  1. Is there an orthogonal way to expose these coordinates? Like, parsing could accept 1.5b, 1.5p, or 1.5c for example, and expansions could be selection_bytes_coordinate, selection_codepoints_coordinate, and so forth?

  2. Would you accept a patch renaming characters to codepoints in the source?

@eraserhd
Copy link
Contributor Author

Oh, I skipped part of my reaction, which is: alright, I get that, for binary file reasons, that we'll want byte coordinates.

@eraserhd
Copy link
Contributor Author

eraserhd commented Feb 12, 2019

OK... updated proposal:

  1. Rename characters to codepoints internally and in exposed expansions (breaking change).
  2. Add %val{selection_desc_in_codepoints}
  3. Add -codepoints option to select.

2 and 3 are what I need, minimally, to make select-nrepl not need to convert utf8 encodings to codepoints and back (we have exposed the cursor but not the anchor in codepoints). 1 is for consistency.

To fully fix parinfer-rust's unicode issues, I'd need to also add %val{selection_desc_in_columns}, and -columns option to select, though the above will make it much better.

@eraserhd eraserhd changed the title Proposal: Don't expose byte indices as columns Proposal: Select by character or display indices Oct 28, 2019
@eraserhd
Copy link
Contributor Author

I've changed the title to be more current. I've started working on this. Currently, it looks like this:

Column values given to :select can be suffixed with b (bytes), c (characters), or d (display). The suffix can be omitted, in which case it is bytes. E.g.:

:select -timestamp 42 3.3b,4.1c 4.7d,4.9d 5.7,6.9

Using suffixes instead of a -codepoints option is useful for parfiner-rust, which only modifies the main selection. Using an option would require parinfer-rust and similar utilities to translate the other coordinates somehow (defeating this features usefulness), or new options to :select which allow replacing of only some selections.

The biggest complication to implementing this is when -timestamp is not the latest timestamp, since the byte locations need to be resolved from display or character columns against a previous version of the buffer. How useful is this case?

@Screwtapello
Copy link
Contributor

c (characters)

I hope you mean "codepoints" here.

How useful is this case?

I feel like it should be pretty important for external, asynchronous processes like parinfer-rust and kak-lsp, since the user might have done more stuff between the message being sent and the response received. Does select -timestamp share a code-path with <z> to restore selections from a register? In that case, applying selections made much earlier would be very important.

@eraserhd
Copy link
Contributor Author

I hope you mean "codepoints" here.

Yes

eraserhd added a commit to eraserhd/kakoune that referenced this issue Nov 10, 2019
@mawww mawww closed this as completed in e964b68 Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants