-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix utf-16 to utf-32 character offsets. #117
Fix utf-16 to utf-32 character offsets. #117
Conversation
Python uses utf-32 for strings, but the language-server-protocol uses utf-16 to encode strings. Hence higher level unicode code points are represented by two utf-16 code units. Thus the `position.character` needs to be adjusted in case of the presence of such unicode characters. see: #116 Notes: 1. This commit patches the methods which handle positions at the moment. A general function to convert character offsets might be useful. 2. The `''.join(..)` statement in `offset_at_position()` is replaced as creating a maybe huge string just to count characters while the length of each line is a stored attribute seems too expensive.
Illustrates how different utf-16 character offsets result in the same utf-32 character offset.
The problems don't end here. For instance, the consumers of this library receive raw LSP positions: This means consumers are receiving LSP-space positions (UTF-16 col offsets), and are assuming they're getting "Python" / "Jedi" space positions (UTF-32 col offsets). |
This commit introduces two methods to the `Documents` class to handle conversion between utf-16 and utf-32 column offsets for general use. The new methods are used in the offset_at_position(), word_at_position() and _apply_incremental_change() methods. Notes: 1. The conversion needs information about the line's content. Hence it feels natural to add the functionality to the Document. 2. This change is unfortunately not transparent. Means a real server needs to proxy all positions through a) Document.rowcol_to_position() in order to create a `Position` with correct utf-16 code units to be sent to the client. b) Document.position_to_rowcol() to receive a corrected (row, col) tuple. 3. The methods convert between tuple <-> Position to make the difference clear and because row and column are needed explicitly in the current use cases anyway. A new type could be invented but it would need to make the difference obvious.
This code was initially copied from pyls and I wasn't aware of this. Thanks for these fixes. Could you please update the changelog and you can also add yourself to contributors doc if you wish. |
They are not aware of it, too. To be honest, I was trying to find an approach which transparently converts those column offsets in order to avoid bothering server developers with that detail. Do you see any chance to do something like that? One further question arose yesterday. What If a server needs to to conversion after an expensive job, which was computed in a thread and the document was modified in the meanwhile? I have a feeling of my solution to not be the best choise then, but I am not too sure about it. Maybe should add global functions or make them a method of Position and passing a Maybe we can workout a proper pythonic solution? You may read that PR as an initial point of discussion. |
@deathaxe I will test it again, but it seemed to be working fine.
I am not sure how we will solve this, but it doesn't have to be in this PR. Do you have an example when this could happen? It makes sense to me to have methods / global functions for this conversion which will be called on each request whose parameter extends TextDocumentPositionParams and possibly some other like DidChangeTextDocumentParams. That way other developers won't have to bother with this, as you said. If you wish, you can give it a try in this PR, otherwise, I think this is fine enough for now. What do you think? |
The code of this PR makes sure to correctly sync the document incrementally, which might be one of the most importent things to start with. Hence I'd be fine to merge it without further changes to keep the scope limited to this one aspect. Just want to make sure not to add some unluckily choosen methods which might cause trouble in the long term as I just started to play with this library last weekend.
Imagine the The Ok, we can argue the Diagnostics not to be valid in that situation anymore anyway and drop them, but I am not sure whether there are situations in which that could be important to run against an up to date string. I thought about that possibility theroretically when diving into anakin-languages-server. |
We may need some optimization to avoid splitting code too often, when running batch calls to the new methods. Where would you place global functions? |
I see, that makes sense, but I think that core operations (validation, code completions, etc.) should be fast, otherwise your LS would be useless. For custom commands I think developer should decide what to do and if pygls has some barriers, we will find the solution. Thinking about all possible problems in advance is hard and maybe not even needed, but we should stay flexible as much as we can. For e.g., in this project even LanguageServerProtocol class is overrided which gives more control if you need it.
Maybe you can place them in |
Needed to extract those objects. Example: row, col = position (start_row, start_col), (end_row, end_col) = range
At some point the utf16 conversion should become transparent. Therefore the column offsets are adjusted without type conversion and by creating new objects with the adjusted values. No longer convert in-place.
@deathaxe Can you just update the changelog and rebase/merge with master? |
@deathaxe Everything works great, can you just address few typos in docstrings and I will approve/merge. |
Closes #116.
Python uses utf-32 for strings, but the language-server-protocol uses utf-16 to encode strings. Hence higher level unicode code points are represented by two utf-16 code units. Thus the
position.character
needs to be adjusted in case of the presence of such unicode characters.see: #116, microsoft/language-server-protocol#376
This PR therefore introduces two methods to the
Documents
class to handle conversion between utf-16 and utf-32 column offsets for general use.The new methods are used in the
offset_at_position()
,word_at_position()
and_apply_incremental_change()
methods.Notes:
The conversion needs information about the line's content. Hence it feels natural to add the functionality to the Document.
This change is unfortunately not transparent. Means a real server needs to proxy all positions through
a)
Document.rowcol_to_position()
in order to create aPosition
with correct utf-16 code units to be sent to the client.b)
Document.position_to_rowcol()
to receive a corrected (row, col) tuple.The methods convert between tuple <-> Position to make the difference clear and because row and column are needed explicitly in the current use cases anyway. A new type could be invented but it would need to make the difference obvious.
The
''.join(..)
statement inoffset_at_position()
is replaced as creating a maybe huge string just to count characters while the length of each line is a stored attribute seems too expensive.Description (e.g. "Related to ...", etc.)
Please replace this description with a concise description of this Pull Request.
Code review checklist (for code reviewer to complete)