Fix utf-16 to utf-32 character offsets. #117

deathaxe · 2020-05-02T20:57:32Z

Closes #116.

Python uses utf-32 for strings, but the language-server-protocol uses utf-16 to encode strings. Hence higher level unicode code points are represented by two utf-16 code units. Thus the position.character needs to be adjusted in case of the presence of such unicode characters.

see: #116, microsoft/language-server-protocol#376

This PR therefore introduces two methods to the Documents class to handle conversion between utf-16 and utf-32 column offsets for general use.

The new methods are used in the offset_at_position(), word_at_position() and _apply_incremental_change() methods.

Notes:

The conversion needs information about the line's content. Hence it feels natural to add the functionality to the Document.
This change is unfortunately not transparent. Means a real server needs to proxy all positions through

a) Document.rowcol_to_position() in order to create a Position with correct utf-16 code units to be sent to the client.

b) Document.position_to_rowcol() to receive a corrected (row, col) tuple.
The methods convert between tuple <-> Position to make the difference clear and because row and column are needed explicitly in the current use cases anyway. A new type could be invented but it would need to make the difference obvious.
The ''.join(..) statement in offset_at_position() is replaced as creating a maybe huge string just to count characters while the length of each line is a stored attribute seems too expensive.

Description (e.g. "Related to ...", etc.)

Please replace this description with a concise description of this Pull Request.

Code review checklist (for code reviewer to complete)

Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR)
Title summarizes what is changing
Commit messages are meaningful (see this for details)
Tests have been included and/or updated, as appropriate
Docstrings have been included and/or updated, as appropriate
Standalone docs have been updated accordingly
CONTRIBUTORS.md was updated, as appropriate
Changelog has been updated, as needed (see CHANGELOG.md)

Python uses utf-32 for strings, but the language-server-protocol uses utf-16 to encode strings. Hence higher level unicode code points are represented by two utf-16 code units. Thus the `position.character` needs to be adjusted in case of the presence of such unicode characters. see: #116 Notes: 1. This commit patches the methods which handle positions at the moment. A general function to convert character offsets might be useful. 2. The `''.join(..)` statement in `offset_at_position()` is replaced as creating a maybe huge string just to count characters while the length of each line is a stored attribute seems too expensive.

Illustrates how different utf-16 character offsets result in the same utf-32 character offset.

rwols · 2020-05-03T08:30:33Z

The problems don't end here. For instance, the consumers of this library receive raw LSP positions:

https://github.com/muffinmad/anakin-language-server/blob/65a23097277ee7999f4ae824f11f034921121b64/anakinls/server.py#L226-L229

This means consumers are receiving LSP-space positions (UTF-16 col offsets), and are assuming they're getting "Python" / "Jedi" space positions (UTF-32 col offsets).

This commit introduces two methods to the `Documents` class to handle conversion between utf-16 and utf-32 column offsets for general use. The new methods are used in the offset_at_position(), word_at_position() and _apply_incremental_change() methods. Notes: 1. The conversion needs information about the line's content. Hence it feels natural to add the functionality to the Document. 2. This change is unfortunately not transparent. Means a real server needs to proxy all positions through a) Document.rowcol_to_position() in order to create a `Position` with correct utf-16 code units to be sent to the client. b) Document.position_to_rowcol() to receive a corrected (row, col) tuple. 3. The methods convert between tuple <-> Position to make the difference clear and because row and column are needed explicitly in the current use cases anyway. A new type could be invented but it would need to make the difference obvious.

danixeee · 2020-05-04T12:51:30Z

This code was initially copied from pyls and I wasn't aware of this. Thanks for these fixes. Could you please update the changelog and you can also add yourself to contributors doc if you wish.

deathaxe · 2020-05-04T16:43:12Z

They are not aware of it, too. To be honest, I was trying to find an approach which transparently converts those column offsets in order to avoid bothering server developers with that detail.

Do you see any chance to do something like that?

One further question arose yesterday. What If a server needs to to conversion after an expensive job, which was computed in a thread and the document was modified in the meanwhile?

I have a feeling of my solution to not be the best choise then, but I am not too sure about it. Maybe should add global functions or make them a method of Position and passing a string which is used to compute the column offsets.

Maybe we can workout a proper pythonic solution?

You may read that PR as an initial point of discussion.

danixeee · 2020-05-06T17:27:35Z

@deathaxe I will test it again, but it seemed to be working fine.

What If a server needs to to conversion after an expensive job, which was computed in a thread and the document was modified in the meanwhile?

I am not sure how we will solve this, but it doesn't have to be in this PR. Do you have an example when this could happen?

It makes sense to me to have methods / global functions for this conversion which will be called on each request whose parameter extends TextDocumentPositionParams and possibly some other like DidChangeTextDocumentParams. That way other developers won't have to bother with this, as you said. If you wish, you can give it a try in this PR, otherwise, I think this is fine enough for now.

What do you think?

deathaxe · 2020-05-07T16:40:45Z

The code of this PR makes sure to correctly sync the document incrementally, which might be one of the most importent things to start with. Hence I'd be fine to merge it without further changes to keep the scope limited to this one aspect.

Just want to make sure not to add some unluckily choosen methods which might cause trouble in the long term as I just started to play with this library last weekend.

Do you have an example when this could happen?

Imagine the _valideate_json() in your example server was a time expensive asynchronous task which is executed in a separate thread. A language server could have received another didChange notification befor it finishes which updates the document by adding unicode characters to a line which would also be part of the _validate_json() command's results.

The Position or Range parts of the Diagnostic being sent to the client needs to be adjusted to utf-16 code units as well. As the document changed the conversion would run against the wrong piece of content.

Ok, we can argue the Diagnostics not to be valid in that situation anymore anyway and drop them, but I am not sure whether there are situations in which that could be important to run against an up to date string.

I thought about that possibility theroretically when diving into anakin-languages-server.

deathaxe · 2020-05-07T17:13:16Z

We may need some optimization to avoid splitting code too often, when running batch calls to the new methods.

Where would you place global functions?

pygls/workspace.py

danixeee · 2020-05-08T11:38:21Z

Imagine the _valideate_json() in your example server was a time expensive asynchronous task which is executed in a separate thread. A language server could have received another didChange notification befor it finishes which updates the document by adding unicode characters to a line which would also be part of the _validate_json() command's results.

I see, that makes sense, but I think that core operations (validation, code completions, etc.) should be fast, otherwise your LS would be useless. For custom commands I think developer should decide what to do and if pygls has some barriers, we will find the solution. Thinking about all possible problems in advance is hard and maybe not even needed, but we should stay flexible as much as we can. For e.g., in this project even LanguageServerProtocol class is overrided which gives more control if you need it.

Where would you place global functions?

Maybe you can place them in workspace.py?

Needed to extract those objects. Example: row, col = position (start_row, start_col), (end_row, end_col) = range

At some point the utf16 conversion should become transparent. Therefore the column offsets are adjusted without type conversion and by creating new objects with the adjusted values. No longer convert in-place.

danixeee · 2020-05-12T09:52:05Z

@deathaxe Can you just update the changelog and rebase/merge with master?

…6-columns

pygls/workspace.py

danixeee · 2020-05-13T12:01:23Z

@deathaxe Everything works great, can you just address few typos in docstrings and I will approve/merge.

deathaxe added 2 commits May 2, 2020 22:55

Improve test case

12a8ad9

Illustrates how different utf-16 character offsets result in the same utf-32 character offset.

danixeee self-requested a review May 4, 2020 12:18

danixeee assigned deathaxe May 4, 2020

danixeee added the bug Something isn't working label May 4, 2020

Reduce number of splitlines() calls

f621edf

rwols reviewed May 7, 2020

View reviewed changes

pygls/workspace.py Show resolved Hide resolved

deathaxe added 3 commits May 9, 2020 22:30

Global functions to convert character offsets

1259fd6

Add iterator support to Range and Position classes

fd9a04f

Needed to extract those objects. Example: row, col = position (start_row, start_col), (end_row, end_col) = range

Replace position_to_rowcol, rowcol_to_position

7f455dd

At some point the utf16 conversion should become transparent. Therefore the column offsets are adjusted without type conversion and by creating new objects with the adjusted values. No longer convert in-place.

deathaxe added 2 commits May 12, 2020 18:18

Merge remote-tracking branch 'openlawlibrary/master' into pr/fix/utf1…

eaa741c

…6-columns

Update changelog

b82c7f2

danixeee reviewed May 13, 2020

View reviewed changes

Fix typos

e0509e9

danixeee approved these changes May 14, 2020

View reviewed changes

danixeee merged commit 517f9dc into openlawlibrary:master May 14, 2020

deathaxe deleted the pr/fix/utf16-columns branch May 14, 2020 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix utf-16 to utf-32 character offsets. #117

Fix utf-16 to utf-32 character offsets. #117

deathaxe commented May 2, 2020 •

edited by danixeee

Loading

rwols commented May 3, 2020 •

edited

Loading

danixeee commented May 4, 2020

deathaxe commented May 4, 2020

danixeee commented May 6, 2020

deathaxe commented May 7, 2020

deathaxe commented May 7, 2020

danixeee commented May 8, 2020

danixeee commented May 12, 2020

danixeee commented May 13, 2020

Fix utf-16 to utf-32 character offsets. #117

Fix utf-16 to utf-32 character offsets. #117

Conversation

deathaxe commented May 2, 2020 • edited by danixeee Loading

Description (e.g. "Related to ...", etc.)

Code review checklist (for code reviewer to complete)

rwols commented May 3, 2020 • edited Loading

danixeee commented May 4, 2020

deathaxe commented May 4, 2020

danixeee commented May 6, 2020

deathaxe commented May 7, 2020

deathaxe commented May 7, 2020

danixeee commented May 8, 2020

danixeee commented May 12, 2020

danixeee commented May 13, 2020

deathaxe commented May 2, 2020 •

edited by danixeee

Loading

rwols commented May 3, 2020 •

edited

Loading