-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fixes non UTF-8 surrogateescapes #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't had a chance to run it but this looks like it would swallow all encode errors other than surrogates and also return unicode where bytes are called for.
A change in a sensitive spot like this really requires tests.
Fixes to re-raise exceptions with different reasons Removes erroneous bytes decode where bytes are desired
Tests that a surrogate escape sequence is properly escaped with backslashes to produce valid UTF-8.
@honzakral It passes existing tests, and I have added a unit test surrounding the case it is designed to fix. Works properly, and the same fix is currently used in a production application before passing into elasticsearch-py. I also removed the erroneous decode and ensured irrelevant errors were reraised. |
Noticing there are some further differences in the Py2/Py3 tests, as this is a non-issue in Python2. Will investigate further. |
Use a Unicode Surrogate that properly escapes in both Python2 and Python3
@honzakral I would really appreciate any time or effort you can put into looking into this issue, there is a fundamental flaw in the difference between how Python 2 and Python 3 handle Unicode Surrogates. I have included Python REPL output from Python 2.7.13 and Python 3.6.1 to outline this issue. I am somewhat unclear how I can write a passing testin both versions of the language simultaneously, it is either one or the other with this issue: Python 2
Python 3
|
Updating test to pass once surrogatepass is used
This replicates behavior between Python 2 and Python 3
@honzakral This is now passing tests, and is utilizing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have made changes as requested
Sorry for the noise, but this is now passing and ready to be properly reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
Since `surrogatepass` will only ever explicitly occur when there are surrogate bytes encountered, there is no need to let the error throw and catch it, also uses single-quotes for consistency.
Fixes non UTF-8 surrogateescapes Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer. Fixes elastic#611
Surrogate escapes in Unicode (non UTF-8 encoding) will be properly escaped with backslashes when encountered, versus breaking the transport layer.
This addresses my issue here: #611