Proposal: Select by character or display indices #2724

eraserhd · 2019-02-05T17:08:04Z

Expansions such as %val{selection_desc} expose columns in byte index within the line, and commands such as select take selection in terms of byte index within line.

This makes integrations hard. So far as I've experienced, byte-index columns aren't useful external to Kakoune. So far, I have unicode issues with parinfer-rust that I will have to fix by converting coordinates to characters, and I have unicode issues with my new selection tool based on this as well.

The selection tool is written in Clojure, therefore using Java strings, so reading the file produces UTF16 characters anyway, so the byte index of UTF8 can't be used directly for converting, and if there are multiple encodings of a character in UTF8, I'll have to guess which way it occurred to interpret the bytes.

Further, converting these requires extra content from the buffer. If I have a $kak_selection that start > column 1, I need to know the entire text of the starting line in addition to the text of the selection I probably want. We don't have access to the text of the buffer outside of selections unless we run some keys in draft mode and save the value to an option.

So, I propose that byte indices are not exposed for scripting. Instead, all expansions and commands use character indices. The user can split diacritics and other compound characters, but not multibyte character encodings.

No matter how it is sliced, this would be a bit of work. There's a couple ways to do it. One way is to audit everything at user-interface layer (commands and expansions), and ensure they provide the right data. Another way is to encapsulate this at the buffer layer and not expose byte indices to the rest of Kakoune.

IMHO, both add internal complexity. Both reduce interface complexity. The latter way contains the complexity better, but is harder. It would be nice to figure out how to break it down into smaller steps.

(Note: It would be possible to make parallel expansions that use char indices - some already exist - and add options to commands like select. I'm proposing the big change here because I think it'll be better for Kakoune.)

The text was updated successfully, but these errors were encountered:

Screwtapello · 2019-02-06T04:18:30Z

This affects kak-lsp too: kakoune-lsp/kakoune-lsp#98

"character" is kind of a hazy concept in the world of Unicode, and the official documentation tries to avoid it. Some common alternatives include:

bytes in a particular encoding, as Kakoune does, so U+0041 LATIN CAPITAL LETTER A is one byte, U+0107 LATIN SMALL LETTER C WITH ACUTE is two, and U+1F404 COW is four.
UTF-16 code units, as Java, and Win32 do, so U+0041 is one unit, U+0107 is also one unit, and U+1F404 is two
UTF-32 code units, or Unicode code points, so U+0041, U+0107 and U+1F404 are one code point each, but U+0063 LATIN SMALL LETTER C followed by U+0301 COMBINING ACUTE is two code-points, even though it looks the same as U+0107.
Grapheme clusters, so U+0041, U+0107, U+0063 U+0301, and U+1F404 are all one cluster each
Character cells, so U+0041, U+107, and U+0063 U+0301 are each on cell, but U+1F404 is two, and so is U+FF21 FULLWIDTH LATIN CAPITAL LETTER A, and U+0301 on its own is zero.

Kakoune's "UTF-8 bytes" coordinates are great for other tools that use UTF-8 internally, but not great for tools that use other encodings.

"code units" coordinates would likewise be great for tools that use UTF-16 internally, but not great for tools that use UTF-8. It also wouldn't be great for tools that use UTF-32, but the only difference would be astral plane characters (U+10000 and above) which aren't that common so maybe it would still be worthwhile.

"codepoints" is a nice encoding-agnostic system, but for selections it has the same problems as the current "UTF-8 bytes": since codepoints are not fixed-size in most encodings, you still need the original buffer content to convert codepoint-coordinates into UTF-8 bytes or code-units that you program can work with. Also, if these selection coordinates are visible in the UI, it would be weird that ć is one unit wide while ć is two, even though they look identical.

"grapheme clusters" is encoding-agnostic, but has the same "needs the whole buffer content" issues as UTF-8 bytes, code units, and codepoints. It also requires a copy of the Unicode character database to determine the edge of each cluster, which can make for Fun Times when (for example) Kakoune is using data from libc, and a plugin is using data from the JRE, and the two data sources disagree. It does make more sense in the UI, though.

"character cells" is basically like "grapheme clusters", except that it's not strictly-speaking defined by the Unicode standard. Unicode defines particular character cell widths for some characters, but not all characters, so implementors have to guess. See the comment at the top of Markus Kuhn's wcwidth.c, and issues like jquast/wcwidth#8.

mawww · 2019-02-11T23:01:35Z

Hello,

So first, one thing to keep in mind is that Kakoune does not enforce the buffer content to be unicode, it will interpret it as UTF-8, but should tolerate non-UTF-8 contents (although there is not big guarantees on what it will do with it).

That means that giving anything else that byte offsets is going to be ambiguous on non-utf8 buffer contents, as there is no well defined method to handle invalid utf8 text that I know of.

I think we can all agree that in retrospect, the choice of UTF-16 by Java and Windows was a mistake, UTF-16 lost, UTF-8 won. I would be tempted to say that tools using another encoding are the ones that needs fixing, if your language of choice's strings enforce UTF-16 (or any encoding, including UTF-8) maybe using strings to store the buffer contents is not a good idea and you should be using a byte array instead... Unfortunately, that does not solve the status quo, language-server-protocol is going to be using UTF-16 for the time being.

Internally, Kakoune actually uses 3 different horizontal coordinates: bytes, characters, and columns.

bytes are what we've been discussing so far.
characters is misnamed and actually means codepoint.
columns is display related, using whatever libc wcwidth returns to compute a column position.

We could relatively easily expose any of those, but as described by @Screwtapello, none is entirely satisfying by itself.

One additional complication is timestamping, Kakoune accepts input that does not match the current buffer state (say for a ranges-highlighter input), it will update those coordinates by using the buffer changes vector (which tracks modifications made to the buffer), but those only give information on the byte level, and supporting anything else would mean we need either to store not only the count of bytes added/removed, but also the count of columns and/or the count of codepoint... I am not really looking forward to that.

All that to say I have no real solution to this problem, I am unconvinced any of the alternatives to the status quo is significantly better, and I think the status quo is the most robust solution, because its the only one which is unambiguous. But that means we cannot solve the mismatched encoding problem from Kakoune, and other tools needs to be fixed to stop enforcing an encoding on arbitrary bytes (because frankly, there is no technical reasons to do that, lsp uses UTF-16 because the designers were lazy and exposed an VSCode implementation detail to the world).

eraserhd · 2019-02-12T14:31:13Z

Is there an orthogonal way to expose these coordinates? Like, parsing could accept 1.5b, 1.5p, or 1.5c for example, and expansions could be selection_bytes_coordinate, selection_codepoints_coordinate, and so forth?
Would you accept a patch renaming characters to codepoints in the source?

eraserhd · 2019-02-12T14:33:28Z

Oh, I skipped part of my reaction, which is: alright, I get that, for binary file reasons, that we'll want byte coordinates.

eraserhd · 2019-02-12T14:52:25Z

OK... updated proposal:

Rename characters to codepoints internally and in exposed expansions (breaking change).
Add %val{selection_desc_in_codepoints}
Add -codepoints option to select.

2 and 3 are what I need, minimally, to make select-nrepl not need to convert utf8 encodings to codepoints and back (we have exposed the cursor but not the anchor in codepoints). 1 is for consistency.

To fully fix parinfer-rust's unicode issues, I'd need to also add %val{selection_desc_in_columns}, and -columns option to select, though the above will make it much better.

eraserhd · 2019-10-28T02:31:49Z

I've changed the title to be more current. I've started working on this. Currently, it looks like this:

Column values given to :select can be suffixed with b (bytes), c (characters), or d (display). The suffix can be omitted, in which case it is bytes. E.g.:

:select -timestamp 42 3.3b,4.1c 4.7d,4.9d 5.7,6.9

Using suffixes instead of a -codepoints option is useful for parfiner-rust, which only modifies the main selection. Using an option would require parinfer-rust and similar utilities to translate the other coordinates somehow (defeating this features usefulness), or new options to :select which allow replacing of only some selections.

The biggest complication to implementing this is when -timestamp is not the latest timestamp, since the byte locations need to be resolved from display or character columns against a previous version of the buffer. How useful is this case?

Screwtapello · 2019-10-28T03:01:04Z

c (characters)

I hope you mean "codepoints" here.

How useful is this case?

I feel like it should be pretty important for external, asynchronous processes like parinfer-rust and kak-lsp, since the user might have done more stuff between the message being sent and the response received. Does select -timestamp share a code-path with <z> to restore selections from a register? In that case, applying selections made much earlier would be very important.

eraserhd · 2019-10-28T04:41:09Z

I hope you mean "codepoints" here.

Yes

Closes mawww#2724

eraserhd mentioned this issue Mar 21, 2019

Add cursor_display_column state variable #2788

Closed

ul mentioned this issue Apr 6, 2019

The best way to select and highlight based on character coordinates #2839

Closed

eraserhd mentioned this issue Oct 28, 2019

Merge selection list parsing into selection_list_from_strings #3167

Merged

eraserhd changed the title ~~Proposal: Don't expose byte indices as columns~~ Proposal: Select by character or display indices Oct 28, 2019

eraserhd added a commit to eraserhd/kakoune that referenced this issue Nov 10, 2019

A 'c' column suffix selects by codepoints

cabf319

Closes mawww#2724

eraserhd mentioned this issue Nov 10, 2019

A 'c' column suffix selects by codepoints #3188

Closed

mawww closed this as completed in e964b68 Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Select by character or display indices #2724

Proposal: Select by character or display indices #2724

eraserhd commented Feb 5, 2019

Screwtapello commented Feb 6, 2019

mawww commented Feb 11, 2019

eraserhd commented Feb 12, 2019

eraserhd commented Feb 12, 2019

eraserhd commented Feb 12, 2019 •

edited

Loading

eraserhd commented Oct 28, 2019

Screwtapello commented Oct 28, 2019

eraserhd commented Oct 28, 2019

Proposal: Select by character or display indices #2724

Proposal: Select by character or display indices #2724

Comments

eraserhd commented Feb 5, 2019

Screwtapello commented Feb 6, 2019

mawww commented Feb 11, 2019

eraserhd commented Feb 12, 2019

eraserhd commented Feb 12, 2019

eraserhd commented Feb 12, 2019 • edited Loading

eraserhd commented Oct 28, 2019

Screwtapello commented Oct 28, 2019

eraserhd commented Oct 28, 2019

eraserhd commented Feb 12, 2019 •

edited

Loading