Skip to content

Fixes non UTF-8 surrogateescapes #612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 11, 2017
Merged

Conversation

tylerjharden
Copy link
Contributor

@tylerjharden tylerjharden commented Jun 30, 2017

Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer.

This addresses my issue here: #611

Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer.
Copy link
Contributor

@honzakral honzakral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't had a chance to run it but this looks like it would swallow all encode errors other than surrogates and also return unicode where bytes are called for.

A change in a sensitive spot like this really requires tests.

Fixes to re-raise exceptions with different reasons
Removes erroneous bytes decode where bytes are desired
Tests that a surrogate escape sequence is properly escaped
with backslashes to produce valid UTF-8.
@tylerjharden
Copy link
Contributor Author

@honzakral It passes existing tests, and I have added a unit test surrounding the case it is designed to fix. Works properly, and the same fix is currently used in a production application before passing into elasticsearch-py.

I also removed the erroneous decode and ensured irrelevant errors were reraised.

@tylerjharden
Copy link
Contributor Author

Noticing there are some further differences in the Py2/Py3 tests, as this is a non-issue in Python2. Will investigate further.

Use a Unicode Surrogate that properly escapes in both Python2 and Python3
@tylerjharden
Copy link
Contributor Author

tylerjharden commented Jul 7, 2017

@honzakral I would really appreciate any time or effort you can put into looking into this issue, there is a fundamental flaw in the difference between how Python 2 and Python 3 handle Unicode Surrogates. I have included Python REPL output from Python 2.7.13 and Python 3.6.1 to outline this issue. I am somewhat unclear how I can write a passing testin both versions of the language simultaneously, it is either one or the other with this issue:

Python 2

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'.decode('utf-8')
u'\u4f60\u597d\\uda6a'
>>> '\u4f60\u597d\\uda6a'.encode('utf-8')
'\\u4f60\\u597d\\uda6a'
>>> '\u4f60\u597d\uda6a'.encode('utf-8')
'\\u4f60\\u597d\\uda6a'
>>> u'\u4f60\u597d\uda6a'.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xa9\xaa'
>>> u'\u4f60\u597d\udced\udca9\udcaa'.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xb3\xad\xed\xb2\xa9\xed\xb2\xaa'
>>>

Python 3

Python 3.6.1 (default, Jun 12 2017, 14:15:31)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.22.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'.decode()
'\u4f60\u597d\\uda6a'
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'.decode('utf-8')
'\u4f60\u597d\\uda6a'
>>> '\uda6a'.encode('utf-8', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\uda6a' in position 0: surrogates not allowed
>>> '\u4f60\u597d\uda6a'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\uda6a' in position 2: surrogates not allowed
>>> '\u4f60\u597d\uda6a'.encode('utf-8', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\uda6a' in position 2: surrogates not allowed
>>> '\u4f60\u597d\uda6a'.encode('utf-8', 'backslashreplace')
b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
>>> '\u4f60\u597d\uda6a'.encode('utf-8', 'ignore')
b'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> '\u4f60\u597d\uda6a'.encode('utf-8', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\uda6a' in position 2: surrogates not allowed
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xa9\xaa'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6: invalid continuation byte
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xa9\xaa'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6: invalid continuation byte
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xa9\xaa'.decode('utf-8', 'surrogateescape')
'\u4f60\u597d\udced\udca9\udcaa'
>>>

Updating test to pass once surrogatepass is used
This replicates behavior between Python 2 and Python 3
@tylerjharden
Copy link
Contributor Author

@honzakral This is now passing tests, and is utilizing surrogatepass instead of backslashreplace which will replicate the Python 2 functionality in Python 3 when the error is encountered. The failing test is a fluke, completely unrelated to my changes.

Copy link
Contributor Author

@tylerjharden tylerjharden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have made changes as requested

@tylerjharden
Copy link
Contributor Author

Sorry for the noise, but this is now passing and ready to be properly reviewed.

cc @honzakral @fxdgear

Copy link

@bll-z bll-z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Since `surrogatepass` will only ever explicitly occur when there are surrogate bytes encountered, there is no need to let the error throw and catch it, also uses single-quotes for consistency.
@honzakral honzakral merged commit d6fb953 into elastic:master Jul 11, 2017
rciorba added a commit to rciorba/elasticsearch-py that referenced this pull request Mar 2, 2018
Fixes non UTF-8 surrogateescapes 

Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer.
Fixes elastic#611
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants