Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issues #97

Open
joepie91 opened this issue Sep 6, 2015 · 11 comments
Open

Encoding issues #97

joepie91 opened this issue Sep 6, 2015 · 11 comments

Comments

@joepie91
Copy link
Owner

joepie91 commented Sep 6, 2015

This is the canonical issue for the outstanding encoding issues in python-whois. It supersedes all the currently open issues; refer to those individual issues for more background.

Any help on resolving these would be much appreciated. Please leave a note on this thread if you plan on working on the issue, and explain what your suggested approach is. Also feel free to comment on the pending pull requests.

If you plan on submitting a new PR: Please test your solution for all the usecases below. Refactoring code to make it work well is perfectly okay and even desirable, even if it leads to large diffs. I'd like to have the absolute minimum amount of encoding-related code in the library code, to prevent these issues in the future.

Past work on this issue:

Known-problematic domains and IPs:

Note: IP WHOIS is not officially supported, but the listed IPs can serve as useful testcases regardless.

  • vatanim.com.tr
  • english.hk
  • bidtheatre.com
  • sesise.com
  • gcbcapp.com
  • rikke.fi
  • eldinamo.cl
  • ziddu.com
  • network.hu
  • biobiochile.cl
  • primicias.cl
  • planalto.gov.br (reported not to occur when reading the data from a file)
  • 179.175.242.131
  • 80.148.135.28
  • 177.169.129.223
  • 80.148.135.28
  • 181.1.55.141
  • 91.6.97.224
  • 179.246.105.144
  • 94.92.58.77

Also make sure to test the .co.th domains that are already in the python-whois test cases.

Usecases to keep in mind

Issues for which the usecase was unclear: #92

Caveats / Requirements

  • Must work on both Python 2 and Python 3 equivalently. The different string types across versions are a potential issue, and the socket module will return a different kind of string depending on the version. Currently supported versions are 2.6.x, 2.7.x, 3.3.x, 3.4.x.
  • Testing network-related issues (eg. encoding from source WHOIS server) can be challenging, because of often very strict rate limits. For the same reason, this can't be an automated test.
  • Not all WHOIS servers return Unicode or even ASCII. Some older ccTLD servers will return data in their local encoding.
@floer32
Copy link

floer32 commented Apr 6, 2016

When you say, "Not all WHOIS servers return Unicode or even ASCII," do you mean, "Not all WHOIS servers return [UTF-8] or even ASCII"? Not trying to nit pick, just trying to make sure I'm not missing something in that statement

@floer32
Copy link

floer32 commented Apr 6, 2016

What about something that tries most-common-encodings, then falls back to trying other encodings, and mixed encodings? Like PR #59 but instead of trying just 2, straight-up, use a library that has more complex behavior for dealing with weird cases — complex like a web browser, which actually results in "simple" results in a way. Such "browser" like behavior can be lossy on the edge cases (mojibake-ish) but since there are so many edge cases possible with WHOIS, maybe this is a way to catch the rest we haven't found or simulated yet. ("the internet is edge cases," says a colleague; and especially so in areas like this, since the WHOIS RFC doesn't specify encoding IIRC)

Suitable libraries, if approach is pursued:

FWIW, ftfy points out that it is very different from chardet.

I know this would be introducing a dependency, and pywhois doesn't really have deps right now. I think this is worth it though. #59-ish solutions will solve some cases, but if we try to solve things like ftfy can, without just using ftfy (or something like it) ... we will slowly wind up re-inventing something similar, I think.

What do you think ... should I give one a try?


Apologies, just saw #64. ^ cchardet is a bit more contentious because it's a C dependency and might have OS portability issues. might. Probably OK, but in any case, I believe ftfy is pure Python...

This is a very, very worthwhile reason to introduce a dependency. The problem is very general, and these libraries offer battle-hardened solutions. It will be a waste to do that work again. I really think using a library should be considered!

@floer32
Copy link

floer32 commented Apr 6, 2016

How about this: optional dependencies. try to import the library, fall back to a simple implementation like the one in #59. README can just explain about the optional dependencies. Then it could even be a "package extra" for even smoother optional-installation-of-enhanced-unicode-edge-case-support. Some people will choose it, maybe some won't. It would just be a difference of pip install python-whois (for no extras) vs. pip install python-whois[extras] (get the extra dep, like chardet or ftfy). more info

Actually, come to think of it, this is exactly how BeautifulSoup's UnicodeDammit module works: "Unicode, Dammit’s guesses will get a lot more accurate if you install the chardet or cchardet Python libraries." I could see #64 being modified to be just like this.

I've just realized ftfy works with a smaller range of versions than python-whois. That could still be OK ... again if it's an "extra" and some users will just not-have it and we won't-use it. But maybe this is a reason to prefer chardet or cchardet ... wider support.

I will submit a PR doing somethin like described above, which may also combine/obviate #59 and #64 but w/o introducing mandatory deps

@floer32
Copy link

floer32 commented Apr 15, 2016

OK, I've got code changes and all tests have passed w/ no external libs, with chardet, and with cchardet ... now I just need to go back tomorrow and prove it all and show my work, then will submit PR showing all that

@floer32
Copy link

floer32 commented Apr 16, 2016

Yikes. I've run into some pretty serious dragons ... if you make everything unicode, you run into problems with the data loaded at the top of parse.py. csv doesn't support unicode. There are blobs of code that might address this without a dependency, but I have not gotten them working right ... kind of a mess.

There is a library out there that supports this, python-unicodecsv. It is compatible with "2.6, 2.7, 3.3, 3.4, 3.5, and pypy 2.4.0." After struggling with csv-and-unicode for a bit now, that seems like the smartest path ... will you consider a PR that adds unicodecsv as a mandatory dependency, or is that not OK no matter what?

@floer32
Copy link

floer32 commented Oct 3, 2016

Hey just checking in here. Any thoughts?

@tuler
Copy link

tuler commented Jan 11, 2017

Any progress on this? #64 seems like a good approach, but it's closed...

@floer32
Copy link

floer32 commented Jan 11, 2017

I think my comment above still takes some action, for us to make progress.

I still think we need chardet, cchardet, and/or python-unicodecsv to deal with this. So I agree with you @tuler on the basic approach from #64, but check out my commits, where the user is given the option to choose between chardet and cchardet. I think we should should follow the example set by BeautifulSoup/UnicodeDammit, using chardet or cchardet to do the heavy lifting ... and at the same time we allow things to work even if neither lib is installed. It's just "progressive enhancement" based on what is installed.

So, I still think that's the right idea. But in my testing I found that this only solves part of the problem. Once we can detect character encodings and decode more data, that data gets further in the python-whois codebase, and reaches code that uses the csv module. As explained above, we probably can't finish better Unicode/encodings support without moving beyond this module's limitations. python-unicodecsv could help with this.

@joepie91 , I know you're hesitant to introduce these dependencies, but I don't think these issues will be solved without pulling out "the big guns." In my experience the WHOIS landscape is wild and messy. How do you feel about these 3 dependencies if we carefully manage to keep all of them optional?

@tuler
Copy link

tuler commented Jan 11, 2017

@hangtwenty I agree with you that we need "big guns" for these issues. I've been looking for a good whois library, and was almost giving up (and trying to implement it myself), but I was happy when I found this one. It's the best among the ones I researched, especially because of the whois server "hopping".

But this unicode issue is really big for us because we're in Brazil.

I don't have the same issues with dependencies as @joepie91.

@hummus
Copy link

hummus commented Jan 12, 2017

gotta +1 @hangtwenty and @tuler here regarding optional encoding detection. (I have encountered various use cases with those ancient local-encoding-only whois servs)

Introducing optional dependencies seems like a way to keep everyone happy, no?

But regarding adding a non-stdlib csv dependency, maybe this can be avoided since it seems to only be used for static-data and it can be assumed(?) they're all printable ascii so no need to fancy decode just unicode() or .decode("ascii") if you want unicode everywhere?

@floer32
Copy link

floer32 commented Jan 12, 2017

@hummus doh, that's a good point. I think you're right. That seems like good news actually, hehe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants