Fixes non UTF-8 surrogateescapes #612

tylerjharden · 2017-06-30T15:15:17Z

Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer.

This addresses my issue here: #611

Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer.

honzakral

Haven't had a chance to run it but this looks like it would swallow all encode errors other than surrogates and also return unicode where bytes are called for.

A change in a sensitive spot like this really requires tests.

Fixes to re-raise exceptions with different reasons Removes erroneous bytes decode where bytes are desired

Tests that a surrogate escape sequence is properly escaped with backslashes to produce valid UTF-8.

tylerjharden · 2017-06-30T15:54:20Z

@honzakral It passes existing tests, and I have added a unit test surrounding the case it is designed to fix. Works properly, and the same fix is currently used in a production application before passing into elasticsearch-py.

I also removed the erroneous decode and ensured irrelevant errors were reraised.

tylerjharden · 2017-06-30T17:31:31Z

Noticing there are some further differences in the Py2/Py3 tests, as this is a non-issue in Python2. Will investigate further.

Use a Unicode Surrogate that properly escapes in both Python2 and Python3

tylerjharden · 2017-07-07T15:23:26Z

@honzakral I would really appreciate any time or effort you can put into looking into this issue, there is a fundamental flaw in the difference between how Python 2 and Python 3 handle Unicode Surrogates. I have included Python REPL output from Python 2.7.13 and Python 3.6.1 to outline this issue. I am somewhat unclear how I can write a passing testin both versions of the language simultaneously, it is either one or the other with this issue:

Python 2

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'.decode('utf-8')
u'\u4f60\u597d\\uda6a'
>>> '\u4f60\u597d\\uda6a'.encode('utf-8')
'\\u4f60\\u597d\\uda6a'
>>> '\u4f60\u597d\uda6a'.encode('utf-8')
'\\u4f60\\u597d\\uda6a'
>>> u'\u4f60\u597d\uda6a'.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xa9\xaa'
>>> u'\u4f60\u597d\udced\udca9\udcaa'.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xb3\xad\xed\xb2\xa9\xed\xb2\xaa'
>>>

Python 3

Python 3.6.1 (default, Jun 12 2017, 14:15:31)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.22.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'.decode()
'\u4f60\u597d\\uda6a'
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'.decode('utf-8')
'\u4f60\u597d\\uda6a'
>>> '\uda6a'.encode('utf-8', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\uda6a' in position 0: surrogates not allowed
>>> '\u4f60\u597d\uda6a'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\uda6a' in position 2: surrogates not allowed
>>> '\u4f60\u597d\uda6a'.encode('utf-8', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\uda6a' in position 2: surrogates not allowed
>>> '\u4f60\u597d\uda6a'.encode('utf-8', 'backslashreplace')
b'\xe4\xbd\xa0\xe5\xa5\xbd\\uda6a'
>>> '\u4f60\u597d\uda6a'.encode('utf-8', 'ignore')
b'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> '\u4f60\u597d\uda6a'.encode('utf-8', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\uda6a' in position 2: surrogates not allowed
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xa9\xaa'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6: invalid continuation byte
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xa9\xaa'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6: invalid continuation byte
>>> b'\xe4\xbd\xa0\xe5\xa5\xbd\xed\xa9\xaa'.decode('utf-8', 'surrogateescape')
'\u4f60\u597d\udced\udca9\udcaa'
>>>

Updating test to pass once surrogatepass is used

This replicates behavior between Python 2 and Python 3

tylerjharden · 2017-07-07T17:31:55Z

@honzakral This is now passing tests, and is utilizing surrogatepass instead of backslashreplace which will replicate the Python 2 functionality in Python 3 when the error is encountered. The failing test is a fluke, completely unrelated to my changes.

tylerjharden

Have made changes as requested

tylerjharden · 2017-07-07T17:55:03Z

Sorry for the noise, but this is now passing and ready to be properly reviewed.

cc @honzakral @fxdgear

bll-z

looks good to me

Since `surrogatepass` will only ever explicitly occur when there are surrogate bytes encountered, there is no need to let the error throw and catch it, also uses single-quotes for consistency.

Fixes non UTF-8 surrogateescapes Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer. Fixes elastic#611

Fixes non UTF-8 surrogateescapes

61d39d8

Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer.

honzakral suggested changes Jun 30, 2017

View reviewed changes

tylerjharden added 2 commits June 30, 2017 11:26

Removes erroneous bytes decode and reraises

0470858

Fixes to re-raise exceptions with different reasons Removes erroneous bytes decode where bytes are desired

Adds test for surrogate escapes in body

c6fa87b

Tests that a surrogate escape sequence is properly escaped with backslashes to produce valid UTF-8.

tylerjharden added 2 commits June 30, 2017 12:01

Use proper byte sequence for surrogate

1d8c0e9

Use if/else versus pass

4fa7023

Proper Unicode surrogate escape

b90231f

Use a Unicode Surrogate that properly escapes in both Python2 and Python3

tylerjharden added 2 commits July 7, 2017 11:49

Passing test once surrogatepass is used

cf0672d

Updating test to pass once surrogatepass is used

Use surrogatepass instead of backslashreplace

6259cb3

This replicates behavior between Python 2 and Python 3

Fixes whitespace

05c2b0a

tylerjharden commented Jul 7, 2017

View reviewed changes

bll-z approved these changes Jul 7, 2017

View reviewed changes

Simplifies with no exception block

41fb145

Since `surrogatepass` will only ever explicitly occur when there are surrogate bytes encountered, there is no need to let the error throw and catch it, also uses single-quotes for consistency.

jamesmosier approved these changes Jul 11, 2017

View reviewed changes

honzakral approved these changes Jul 11, 2017

View reviewed changes

honzakral merged commit d6fb953 into elastic:master Jul 11, 2017

x0day mentioned this pull request Apr 1, 2019

please fix non UTF-8 surrogateescapes elastic/elasticsearch-py-async#62

Closed

vEpiphyte mentioned this pull request Apr 20, 2020

Missing surrogatepass in urllib3 handler #1212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes non UTF-8 surrogateescapes #612

Fixes non UTF-8 surrogateescapes #612

Uh oh!

tylerjharden commented Jun 30, 2017 •

edited

Loading

Uh oh!

honzakral left a comment

Uh oh!

tylerjharden commented Jun 30, 2017

Uh oh!

tylerjharden commented Jun 30, 2017

Uh oh!

tylerjharden commented Jul 7, 2017 •

edited

Loading

Uh oh!

tylerjharden commented Jul 7, 2017

Uh oh!

tylerjharden left a comment

Uh oh!

tylerjharden commented Jul 7, 2017

Uh oh!

bll-z left a comment

Uh oh!

Uh oh!

Fixes non UTF-8 surrogateescapes #612

Fixes non UTF-8 surrogateescapes #612

Uh oh!

Conversation

tylerjharden commented Jun 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

honzakral left a comment

Choose a reason for hiding this comment

Uh oh!

tylerjharden commented Jun 30, 2017

Uh oh!

tylerjharden commented Jun 30, 2017

Uh oh!

tylerjharden commented Jul 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python 2

Python 3

Uh oh!

tylerjharden commented Jul 7, 2017

Uh oh!

tylerjharden left a comment

Choose a reason for hiding this comment

Uh oh!

tylerjharden commented Jul 7, 2017

Uh oh!

bll-z left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tylerjharden commented Jun 30, 2017 •

edited

Loading

tylerjharden commented Jul 7, 2017 •

edited

Loading