-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for unexpected decoding errors. #59
base: master
Are you sure you want to change the base?
Conversation
Now no longer raises a UnicodeDecodeError when running into invalid UTF-8 characters, but tries iso-8859-1 first instead.
List of issues this should fix: |
Could you have a look at #97 and, if you're still interested in working on this issue, verify that your solution works for all usecases in all supported Python versions? Thanks! |
Just ran all the tests, both in my fork and the current latest official version. The tests all pass in Python 2, and some don't in Python 3 - but this PR doesn't break anything; those were broken even in the official version. What this PR does fix, however, is |
@@ -91,4 +91,10 @@ def whois_request(domain, server, port=43): | |||
if len(data) == 0: | |||
break | |||
buff += data | |||
return buff.decode("utf-8") | |||
encodings = ("utf-8", "iso-8859-1") | |||
for encoding in encodings: # This should probably not be a permanent solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for that comment IMO. This is a perfectly fine solution. The RFC does not specify an encoding (rather comments on it as an issue), so the original assumption that it's valid UTF-8 data is where the problem lies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I also think the comment can be removed
Also: as a final catch-all instead of an error, one might normalise the data before conversion. |
Any progress on this decision as of yet? |
@@ -91,4 +91,10 @@ def whois_request(domain, server, port=43): | |||
if len(data) == 0: | |||
break | |||
buff += data | |||
return buff.decode("utf-8") | |||
encodings = ("utf-8", "iso-8859-1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe should add WIN-1252 (aka LATIN-1 aka 'cp1252'
in Python), very similar to iso-8859-1 with some differences. Supposedly, common as well.
I wanted to try and run this, maybe rebased onto latest master, and see if I had same issues with tests failing. Blocked because the python3.3 testenv isn't working, to begin with ... Agree that it doesn't seem to be from the contents of this PR though, appears like it could have been an existing problem ... |
OK. I got it working with pyenv...
This works on bb57c22 ( @obskyr this is your commit) ... as well as on master. In both cases all pass:
So IMHO this would be ready for merge?? |
This could be merged until a better solution is available. Just used it for a few days and seems fine. |
Now pythonwhois no longer raises a
UnicodeDecodeError
whenever a response uses a non-UTF-8 encoding. It instead goes through UTF-8 and ISO-8859-1, and is easily extensible to other encodings Python supports. If it goes through the entire list of encodings without managing to decode the page, it raises aValueError
with the server in it for easy error spotting.This should fix issue #57, but I'll leave that determination to someone else. As noted in that issue's comments, there could be other ways to do this.