Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix utf-16 to utf-32 character offsets. #117

Merged
merged 10 commits into from
May 14, 2020
Merged

Fix utf-16 to utf-32 character offsets. #117

merged 10 commits into from
May 14, 2020

Conversation

deathaxe
Copy link
Contributor

@deathaxe deathaxe commented May 2, 2020

Closes #116.

Python uses utf-32 for strings, but the language-server-protocol uses utf-16 to encode strings. Hence higher level unicode code points are represented by two utf-16 code units. Thus the position.character needs to be adjusted in case of the presence of such unicode characters.

see: #116, microsoft/language-server-protocol#376

This PR therefore introduces two methods to the Documents class to handle conversion between utf-16 and utf-32 column offsets for general use.

The new methods are used in the offset_at_position(), word_at_position() and _apply_incremental_change() methods.

Notes:

  1. The conversion needs information about the line's content. Hence it feels natural to add the functionality to the Document.

  2. This change is unfortunately not transparent. Means a real server needs to proxy all positions through

    a) Document.rowcol_to_position() in order to create a Position with correct utf-16 code units to be sent to the client.

    b) Document.position_to_rowcol() to receive a corrected (row, col) tuple.

  3. The methods convert between tuple <-> Position to make the difference clear and because row and column are needed explicitly in the current use cases anyway. A new type could be invented but it would need to make the difference obvious.

  4. The ''.join(..) statement in offset_at_position() is replaced as creating a maybe huge string just to count characters while the length of each line is a stored attribute seems too expensive.

Description (e.g. "Related to ...", etc.)

Please replace this description with a concise description of this Pull Request.

Code review checklist (for code reviewer to complete)

  • Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR)
  • Title summarizes what is changing
  • Commit messages are meaningful (see this for details)
  • Tests have been included and/or updated, as appropriate
  • Docstrings have been included and/or updated, as appropriate
  • Standalone docs have been updated accordingly
  • CONTRIBUTORS.md was updated, as appropriate
  • Changelog has been updated, as needed (see CHANGELOG.md)

deathaxe added 2 commits May 2, 2020 22:55
Python uses utf-32 for strings, but the language-server-protocol uses
utf-16 to encode strings. Hence higher level unicode code points are
represented by two utf-16 code units. Thus the `position.character` needs to be adjusted in case of the presence of such unicode characters.

see: #116

Notes:

1. This commit patches the methods which handle positions at the moment.
   A general function to convert character offsets might be useful.

2. The `''.join(..)` statement in `offset_at_position()` is replaced as
   creating a maybe huge string just to count characters while the
   length of each line is a stored attribute seems too expensive.
Illustrates how different utf-16 character offsets result in the same
utf-32 character offset.
@rwols
Copy link

rwols commented May 3, 2020

The problems don't end here. For instance, the consumers of this library receive raw LSP positions:

https://github.com/muffinmad/anakin-language-server/blob/65a23097277ee7999f4ae824f11f034921121b64/anakinls/server.py#L226-L229

This means consumers are receiving LSP-space positions (UTF-16 col offsets), and are assuming they're getting "Python" / "Jedi" space positions (UTF-32 col offsets).

This commit introduces two methods to the `Documents` class to handle
conversion between utf-16 and utf-32 column offsets for general use.

The new methods are used in the offset_at_position(), word_at_position()
and _apply_incremental_change() methods.

Notes:

1. The conversion needs information about the line's content. Hence it
   feels natural to add the functionality to the Document.

2. This change is unfortunately not transparent. Means a real server
   needs to proxy all positions through

   a) Document.rowcol_to_position() in order to create a `Position` with
      correct utf-16 code units to be sent to the client.

   b) Document.position_to_rowcol() to receive a corrected (row, col)
      tuple.

3. The methods convert between tuple <-> Position to make the difference
   clear and because row and column are needed explicitly in the current
   use cases anyway. A new type could be invented but it would need to
   make the difference obvious.
@danixeee danixeee self-requested a review May 4, 2020 12:18
@danixeee danixeee added the bug Something isn't working label May 4, 2020
@danixeee
Copy link
Contributor

danixeee commented May 4, 2020

This code was initially copied from pyls and I wasn't aware of this. Thanks for these fixes. Could you please update the changelog and you can also add yourself to contributors doc if you wish.

@deathaxe
Copy link
Contributor Author

deathaxe commented May 4, 2020

They are not aware of it, too. To be honest, I was trying to find an approach which transparently converts those column offsets in order to avoid bothering server developers with that detail.

Do you see any chance to do something like that?

One further question arose yesterday. What If a server needs to to conversion after an expensive job, which was computed in a thread and the document was modified in the meanwhile?

I have a feeling of my solution to not be the best choise then, but I am not too sure about it. Maybe should add global functions or make them a method of Position and passing a string which is used to compute the column offsets.

Maybe we can workout a proper pythonic solution?

You may read that PR as an initial point of discussion.

@danixeee
Copy link
Contributor

danixeee commented May 6, 2020

@deathaxe I will test it again, but it seemed to be working fine.

What If a server needs to to conversion after an expensive job, which was computed in a thread and the document was modified in the meanwhile?

I am not sure how we will solve this, but it doesn't have to be in this PR. Do you have an example when this could happen?

It makes sense to me to have methods / global functions for this conversion which will be called on each request whose parameter extends TextDocumentPositionParams and possibly some other like DidChangeTextDocumentParams. That way other developers won't have to bother with this, as you said. If you wish, you can give it a try in this PR, otherwise, I think this is fine enough for now.

What do you think?

@deathaxe
Copy link
Contributor Author

deathaxe commented May 7, 2020

The code of this PR makes sure to correctly sync the document incrementally, which might be one of the most importent things to start with. Hence I'd be fine to merge it without further changes to keep the scope limited to this one aspect.

Just want to make sure not to add some unluckily choosen methods which might cause trouble in the long term as I just started to play with this library last weekend.

Do you have an example when this could happen?

Imagine the _valideate_json() in your example server was a time expensive asynchronous task which is executed in a separate thread. A language server could have received another didChange notification befor it finishes which updates the document by adding unicode characters to a line which would also be part of the _validate_json() command's results.

The Position or Range parts of the Diagnostic being sent to the client needs to be adjusted to utf-16 code units as well. As the document changed the conversion would run against the wrong piece of content.

Ok, we can argue the Diagnostics not to be valid in that situation anymore anyway and drop them, but I am not sure whether there are situations in which that could be important to run against an up to date string.

I thought about that possibility theroretically when diving into anakin-languages-server.

@deathaxe
Copy link
Contributor Author

deathaxe commented May 7, 2020

We may need some optimization to avoid splitting code too often, when running batch calls to the new methods.

Where would you place global functions?

@danixeee
Copy link
Contributor

danixeee commented May 8, 2020

Imagine the _valideate_json() in your example server was a time expensive asynchronous task which is executed in a separate thread. A language server could have received another didChange notification befor it finishes which updates the document by adding unicode characters to a line which would also be part of the _validate_json() command's results.

I see, that makes sense, but I think that core operations (validation, code completions, etc.) should be fast, otherwise your LS would be useless. For custom commands I think developer should decide what to do and if pygls has some barriers, we will find the solution. Thinking about all possible problems in advance is hard and maybe not even needed, but we should stay flexible as much as we can. For e.g., in this project even LanguageServerProtocol class is overrided which gives more control if you need it.

Where would you place global functions?

Maybe you can place them in workspace.py?

deathaxe added 3 commits May 9, 2020 22:30
Needed to extract those objects.

Example:

   row, col = position

   (start_row, start_col), (end_row, end_col) = range
At some point the utf16 conversion should become transparent. Therefore
the column offsets are adjusted without type conversion and by creating
new objects with the adjusted values. No longer convert in-place.
@danixeee
Copy link
Contributor

@deathaxe Can you just update the changelog and rebase/merge with master?

pygls/workspace.py Outdated Show resolved Hide resolved
pygls/workspace.py Outdated Show resolved Hide resolved
pygls/workspace.py Outdated Show resolved Hide resolved
pygls/workspace.py Outdated Show resolved Hide resolved
pygls/workspace.py Outdated Show resolved Hide resolved
pygls/workspace.py Outdated Show resolved Hide resolved
pygls/workspace.py Outdated Show resolved Hide resolved
pygls/workspace.py Outdated Show resolved Hide resolved
pygls/workspace.py Outdated Show resolved Hide resolved
@danixeee
Copy link
Contributor

@deathaxe Everything works great, can you just address few typos in docstrings and I will approve/merge.

@danixeee danixeee merged commit 517f9dc into openlawlibrary:master May 14, 2020
@deathaxe deathaxe deleted the pr/fix/utf16-columns branch May 14, 2020 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make sure you handle UTF-16 character offsets
3 participants