Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe a better way to fix decoding issues. Fix #51. #64

Closed
wants to merge 3 commits into from
Closed

Maybe a better way to fix decoding issues. Fix #51. #64

wants to merge 3 commits into from

Conversation

huxuan
Copy link

@huxuan huxuan commented Nov 10, 2014

Use cchardet to detect the encoding before decoding.

Output from test.py:

[  1/250] 0007games.com passed in normalized mode.
[  2/250] 0007games.com passed in default mode.
[  3/250] 0031fashion.com passed in normalized mode.
[  4/250] 0031fashion.com passed in default mode.
[  5/250] 123vitamine.com passed in normalized mode.
[  6/250] 123vitamine.com passed in default mode.
[  7/250] 2x4.ru passed in normalized mode.
[  8/250] 2x4.ru passed in default mode.
[  9/250] 365calendars.com passed in normalized mode.
[ 10/250] 365calendars.com passed in default mode.
[ 11/250] 9v.lt passed in normalized mode.
[ 12/250] 9v.lt passed in default mode.
[ 13/250] about.museum passed in normalized mode.
[ 14/250] about.museum passed in default mode.
[ 15/250] abouttubes.com passed in normalized mode.
[ 16/250] abouttubes.com passed in default mode.
[ 17/250] actu.org.au passed in normalized mode.
[ 18/250] actu.org.au passed in default mode.
[ 19/250] alibaba.jp passed in normalized mode.
[ 20/250] alibaba.jp passed in default mode.
[ 21/250] alliancefrançaise.nu passed in normalized mode.
[ 22/250] alliancefrançaise.nu passed in default mode.
[ 23/250] anink.com passed in normalized mode.
[ 24/250] anink.com passed in default mode.
[ 25/250] anonne.ws passed in normalized mode.
[ 26/250] anonne.ws passed in default mode.
[ 27/250] anonnews.org passed in normalized mode.
[ 28/250] anonnews.org passed in default mode.
[ 29/250] aol.com passed in normalized mode.
[ 30/250] aol.com passed in default mode.
[ 31/250] aridns.net.au passed in normalized mode.
[ 32/250] aridns.net.au passed in default mode.
[ 33/250] arkeysolutions.com passed in normalized mode.
[ 34/250] arkeysolutions.com passed in default mode.
[ 35/250] asiahotel.co.th passed in normalized mode.
[ 36/250] asiahotel.co.th passed in default mode.
[ 37/250] atheme.org passed in normalized mode.
[ 38/250] atheme.org passed in default mode.
[ 39/250] australia.gov.au passed in normalized mode.
[ 40/250] australia.gov.au passed in default mode.
[ 41/250] b.ro passed in normalized mode.
[ 42/250] b.ro passed in default mode.
[ 43/250] baligems.co.uk passed in normalized mode.
[ 44/250] baligems.co.uk passed in default mode.
[ 45/250] bäckerei.de passed in normalized mode.
[ 46/250] bäckerei.de passed in default mode.
[ 47/250] bidtheatre.com passed in normalized mode.
[ 48/250] bidtheatre.com passed in default mode.
[ 49/250] blackburn.ac.uk passed in normalized mode.
[ 50/250] blackburn.ac.uk passed in default mode.
[ 51/250] bristol.ac.uk passed in normalized mode.
[ 52/250] bristol.ac.uk passed in default mode.
[ 53/250] bts.co.th passed in normalized mode.
[ 54/250] bts.co.th passed in default mode.
[ 55/250] byme.at passed in normalized mode.
[ 56/250] byme.at passed in default mode.
[ 57/250] communigal.net passed in normalized mode.
[ 58/250] communigal.net passed in default mode.
[ 59/250] cryto.net passed in normalized mode.
[ 60/250] cryto.net passed in default mode.
[ 61/250] daemonrage.net passed in normalized mode.
[ 62/250] daemonrage.net passed in default mode.
[ 63/250] davicom.com.tw passed in normalized mode.
[ 64/250] davicom.com.tw passed in default mode.
[ 65/250] defunctkernel.me passed in normalized mode.
[ 66/250] defunctkernel.me passed in default mode.
[ 67/250] direct.gov.uk passed in normalized mode.
[ 68/250] direct.gov.uk passed in default mode.
[ 69/250] dns4pro.com passed in normalized mode.
[ 70/250] dns4pro.com passed in default mode.
[ 71/250] donuts.co passed in normalized mode.
[ 72/250] donuts.co passed in default mode.
[ 73/250] drpciv.biz passed in normalized mode.
[ 74/250] drpciv.biz passed in default mode.
[ 75/250] edis.at passed in normalized mode.
[ 76/250] edis.at passed in default mode.
[ 77/250] engine.com passed in normalized mode.
[ 78/250] engine.com passed in default mode.
[ 79/250] evalsed.info passed in normalized mode.
[ 80/250] evalsed.info passed in default mode.
[ 81/250] example.com passed in normalized mode.
[ 82/250] example.com passed in default mode.
[ 83/250] expopack.com.mx passed in normalized mode.
[ 84/250] expopack.com.mx passed in default mode.
[ 85/250] f63.net passed in normalized mode.
[ 86/250] f63.net passed in default mode.
[ 87/250] formule1fo.com passed in normalized mode.
[ 88/250] formule1fo.com passed in default mode.
[ 89/250] foxiepa.ws passed in normalized mode.
[ 90/250] foxiepa.ws passed in default mode.
[ 91/250] geko.dk passed in normalized mode.
[ 92/250] geko.dk passed in default mode.
[ 93/250] get.moe passed in normalized mode.
[ 94/250] get.moe passed in default mode.
[ 95/250] globallatedeals.com passed in normalized mode.
[ 96/250] globallatedeals.com passed in default mode.
[ 97/250] globaltravelgroup.com passed in normalized mode.
[ 98/250] globaltravelgroup.com passed in default mode.
[ 99/250] google.cn passed in normalized mode.
[100/250] google.cn passed in default mode.
[101/250] google.co.jp passed in normalized mode.
[102/250] google.co.jp passed in default mode.
[103/250] google.co.th passed in normalized mode.
[104/250] google.co.th passed in default mode.
[105/250] google.co.uk passed in normalized mode.
[106/250] google.co.uk passed in default mode.
[107/250] google.com passed in normalized mode.
[108/250] google.com passed in default mode.
[109/250] google.com.tw passed in normalized mode.
[110/250] google.com.tw passed in default mode.
[111/250] google.it passed in normalized mode.
[112/250] google.it passed in default mode.
[113/250] hl3.eu passed in normalized mode.
[114/250] hl3.eu passed in default mode.
[115/250] hopjb.eu passed in normalized mode.
[116/250] hopjb.eu passed in default mode.
[117/250] huskeh.net passed in normalized mode.
[118/250] huskeh.net passed in default mode.
[119/250] hyves.nl passed in normalized mode.
[120/250] hyves.nl passed in default mode.
[121/250] imperial.ac.uk passed in normalized mode.
[122/250] imperial.ac.uk passed in default mode.
[123/250] ireland.ie passed in normalized mode.
[124/250] ireland.ie passed in default mode.
[125/250] ismtgoxdeadyet.com passed in normalized mode.
[126/250] ismtgoxdeadyet.com passed in default mode.
[127/250] jizzbo.com passed in normalized mode.
[128/250] jizzbo.com passed in default mode.
[129/250] keybase.io passed in normalized mode.
[130/250] keybase.io passed in default mode.
[131/250] linux.conf.au passed in normalized mode.
[132/250] linux.conf.au passed in default mode.
[133/250] lowendbox.com passed in normalized mode.
[134/250] lowendbox.com passed in default mode.
[135/250] lowendshare.com passed in normalized mode.
[136/250] lowendshare.com passed in default mode.
[137/250] luka-netconsult.com passed in normalized mode.
[138/250] luka-netconsult.com passed in default mode.
[139/250] microsoft.com passed in normalized mode.
[140/250] microsoft.com passed in default mode.
[141/250] mu.oz.au passed in normalized mode.
[142/250] mu.oz.au passed in default mode.
[143/250] nepasituation.com passed in normalized mode.
[144/250] nepasituation.com passed in default mode.
[145/250] nic.buzz passed in normalized mode.
[146/250] nic.buzz passed in default mode.
[147/250] nic.ir passed in normalized mode.
[148/250] nic.ir passed in default mode.
[149/250] nic.ps passed in normalized mode.
[150/250] nic.ps passed in default mode.
[151/250] nic.pw passed in normalized mode.
[152/250] nic.pw passed in default mode.
[153/250] nic.ru passed in normalized mode.
[154/250] nic.ru passed in default mode.
[155/250] nominet.org.uk passed in normalized mode.
[156/250] nominet.org.uk passed in default mode.
[157/250] nsa.gov passed in normalized mode.
[158/250] nsa.gov passed in default mode.
[159/250] nttpc.co.jp passed in normalized mode.
[160/250] nttpc.co.jp passed in default mode.
[161/250] nytimes.com passed in normalized mode.
[162/250] nytimes.com passed in default mode.
[163/250] oli.id.au passed in normalized mode.
[164/250] oli.id.au passed in default mode.
[165/250] ovh.fr passed in normalized mode.
[166/250] ovh.fr passed in default mode.
[167/250] pcmups.com.tw passed in normalized mode.
[168/250] pcmups.com.tw passed in default mode.
[169/250] pixelmania.asia passed in normalized mode.
[170/250] pixelmania.asia passed in default mode.
[171/250] porn.com.tw passed in normalized mode.
[172/250] porn.com.tw passed in default mode.
[173/250] prq.se passed in normalized mode.
[174/250] prq.se passed in default mode.
[175/250] quadranet.com passed in normalized mode.
[176/250] quadranet.com passed in default mode.
[177/250] realtek.com.tw passed in normalized mode.
[178/250] realtek.com.tw passed in default mode.
[179/250] redd.it passed in normalized mode.
[180/250] redd.it passed in default mode.
[181/250] ricoh.co.th passed in normalized mode.
[182/250] ricoh.co.th passed in default mode.
[183/250] rs.co.th passed in normalized mode.
[184/250] rs.co.th passed in default mode.
[185/250] servequake.com passed in normalized mode.
[186/250] servequake.com passed in default mode.
[187/250] siamparagon.co.th passed in normalized mode.
[188/250] siamparagon.co.th passed in default mode.
[189/250] simpardaz.com passed in normalized mode.
[190/250] simpardaz.com passed in default mode.
[191/250] sina.com.cn passed in normalized mode.
[192/250] sina.com.cn passed in default mode.
[193/250] singularity.fr passed in normalized mode.
[194/250] singularity.fr passed in default mode.
[195/250] starbucks.co.th passed in normalized mode.
[196/250] starbucks.co.th passed in default mode.
[197/250] swisscom.ch passed in normalized mode.
[198/250] swisscom.ch passed in default mode.
[199/250] sydney.edu.au passed in normalized mode.
[200/250] sydney.edu.au passed in default mode.
[201/250] tattitude.co.uk passed in normalized mode.
[202/250] tattitude.co.uk passed in default mode.
[203/250] test.de passed in normalized mode.
[204/250] test.de passed in default mode.
[205/250] textfiles.com passed in normalized mode.
[206/250] textfiles.com passed in default mode.
[207/250] theregister.com passed in normalized mode.
[208/250] theregister.com passed in default mode.
[209/250] tip.it passed in normalized mode.
[210/250] tip.it passed in default mode.
[211/250] toyota.co.th passed in normalized mode.
[212/250] toyota.co.th passed in default mode.
[213/250] twitter.com passed in normalized mode.
[214/250] twitter.com passed in default mode.
[215/250] ufpa.br passed in normalized mode.
[216/250] ufpa.br passed in default mode.
[217/250] unwire.hk passed in normalized mode.
[218/250] unwire.hk passed in default mode.
[219/250] urlte.am passed in normalized mode.
[220/250] urlte.am passed in default mode.
[221/250] via.com.tw passed in normalized mode.
[222/250] via.com.tw passed in default mode.
[223/250] vulnweb.com passed in normalized mode.
[224/250] vulnweb.com passed in default mode.
[225/250] wa.us passed in normalized mode.
[226/250] wa.us passed in default mode.
[227/250] warwick.ac.uk passed in normalized mode.
[228/250] warwick.ac.uk passed in default mode.
[229/250] whirlpool.net.au passed in normalized mode.
[230/250] whirlpool.net.au passed in default mode.
[231/250] whois.com passed in normalized mode.
[232/250] whois.com passed in default mode.
[233/250] whois.us passed in normalized mode.
[234/250] whois.us passed in default mode.
[235/250] whoiser.ir passed in normalized mode.
[236/250] whoiser.ir passed in default mode.
[237/250] winamp.com passed in normalized mode.
[238/250] winamp.com passed in default mode.
[239/250] wosoccer.com passed in normalized mode.
[240/250] wosoccer.com passed in default mode.
[241/250] x.it passed in normalized mode.
[242/250] x.it passed in default mode.
[243/250] xboxmoments.com passed in normalized mode.
[244/250] xboxmoments.com passed in default mode.
[245/250] yahoo.com.tw passed in normalized mode.
[246/250] yahoo.com.tw passed in default mode.
[247/250] yahoo.it passed in normalized mode.
[248/250] yahoo.it passed in default mode.
[249/250] zem.org.uk passed in normalized mode.
[250/250] zem.org.uk passed in default mode.
Timing in default mode: 15ms avg, 0ms min, 52ms max
Timing in normalized mode: 16ms avg, 0ms min, 60ms max
All tests passed!

@joepie91 joepie91 mentioned this pull request Sep 6, 2015
@joepie91
Copy link
Owner

joepie91 commented Sep 6, 2015

I'm slightly wary about including an external dependency, given Python's somewhat erratic package management system. Would there be any way to replicate this functionality without using cchardet as a dependency?

Additionally, could you have a look at #97, and see whether your solution works for all usecases on all supported Python versions? I've created a canonical thread there with all the requirements and a list of issues. Thanks!

@huxuan
Copy link
Author

huxuan commented Sep 7, 2015

Thanks for your attention and comment. I wll be back after some progress.

@huxuan huxuan closed this Jun 24, 2016
@tuler
Copy link

tuler commented Jan 11, 2017

Any progress on this? Why is this closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants