-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issues #97
Comments
When you say, "Not all WHOIS servers return Unicode or even ASCII," do you mean, "Not all WHOIS servers return [UTF-8] or even ASCII"? Not trying to nit pick, just trying to make sure I'm not missing something in that statement |
What about something that tries most-common-encodings, then falls back to trying other encodings, and mixed encodings? Like PR #59 but instead of trying just 2, straight-up, use a library that has more complex behavior for dealing with weird cases — complex like a web browser, which actually results in "simple" results in a way. Such "browser" like behavior can be lossy on the edge cases (mojibake-ish) but since there are so many edge cases possible with WHOIS, maybe this is a way to catch the rest we haven't found or simulated yet. ("the internet is edge cases," says a colleague; and especially so in areas like this, since the WHOIS RFC doesn't specify encoding IIRC) Suitable libraries, if approach is pursued:
FWIW, I know this would be introducing a dependency, and pywhois doesn't really have deps right now. I think this is worth it though. #59-ish solutions will solve some cases, but if we try to solve things like What do you think ... should I give one a try? Apologies, just saw #64. ^ This is a very, very worthwhile reason to introduce a dependency. The problem is very general, and these libraries offer battle-hardened solutions. It will be a waste to do that work again. I really think using a library should be considered! |
How about this: optional dependencies. Actually, come to think of it, this is exactly how BeautifulSoup's I've just realized I will submit a PR doing somethin like described above, which may also combine/obviate #59 and #64 but w/o introducing mandatory deps |
OK, I've got code changes and all tests have passed w/ no external libs, with |
Yikes. I've run into some pretty serious dragons ... if you make everything unicode, you run into problems with the data loaded at the top of There is a library out there that supports this, |
Hey just checking in here. Any thoughts? |
Any progress on this? #64 seems like a good approach, but it's closed... |
I think my comment above still takes some action, for us to make progress. I still think we need So, I still think that's the right idea. But in my testing I found that this only solves part of the problem. Once we can detect character encodings and decode more data, that data gets further in the @joepie91 , I know you're hesitant to introduce these dependencies, but I don't think these issues will be solved without pulling out "the big guns." In my experience the WHOIS landscape is wild and messy. How do you feel about these 3 dependencies if we carefully manage to keep all of them optional? |
@hangtwenty I agree with you that we need "big guns" for these issues. I've been looking for a good whois library, and was almost giving up (and trying to implement it myself), but I was happy when I found this one. It's the best among the ones I researched, especially because of the whois server "hopping". But this unicode issue is really big for us because we're in Brazil. I don't have the same issues with dependencies as @joepie91. |
gotta +1 @hangtwenty and @tuler here regarding optional encoding detection. (I have encountered various use cases with those ancient local-encoding-only whois servs) Introducing optional dependencies seems like a way to keep everyone happy, no? But regarding adding a non-stdlib csv dependency, maybe this can be avoided since it seems to only be used for static-data and it can be assumed(?) they're all printable ascii so no need to fancy decode just |
@hummus doh, that's a good point. I think you're right. That seems like good news actually, hehe. |
This is the canonical issue for the outstanding encoding issues in python-whois. It supersedes all the currently open issues; refer to those individual issues for more background.
Any help on resolving these would be much appreciated. Please leave a note on this thread if you plan on working on the issue, and explain what your suggested approach is. Also feel free to comment on the pending pull requests.
If you plan on submitting a new PR: Please test your solution for all the usecases below. Refactoring code to make it work well is perfectly okay and even desirable, even if it leads to large diffs. I'd like to have the absolute minimum amount of encoding-related code in the library code, to prevent these issues in the future.
Past work on this issue:
Known-problematic domains and IPs:
Note: IP WHOIS is not officially supported, but the listed IPs can serve as useful testcases regardless.
Also make sure to test the
.co.th
domains that are already in the python-whois test cases.Usecases to keep in mind
pwhois --raw domain.com > test/data/domain.com
stdout
correctly.pwhois --json domain.com
(Crashes on utf-8 encoding for JSON #88)pwhois domain.com
(Unicode parsing problems #28, Crashes on utf-8 encoding for JSON #88)get_whois
(parsed) (Unicode parsing problems #28, Decoding issues for bidtheatre.com #51, Error while parsing some domains. #57 )net.get_whois_raw
(raw)parse.parse_raw_whois
(parsed)./test.py update
)./test.py run
)./test.py run
)Issues for which the usecase was unclear: #92
Caveats / Requirements
socket
module will return a different kind of string depending on the version. Currently supported versions are2.6.x
,2.7.x
,3.3.x
,3.4.x
.The text was updated successfully, but these errors were encountered: