Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with encoding when using pyquery #285

Closed
artbataev opened this issue Oct 25, 2017 · 0 comments
Closed

Problem with encoding when using pyquery #285

artbataev opened this issue Oct 25, 2017 · 0 comments
Labels

Comments

@artbataev
Copy link

artbataev commented Oct 25, 2017

I'm trying to parse Azerbaijani forum

url = "http://www.disput.az/index.php?app=forums&module=forums&controller=topic&id=1051606"
g = Grab()
g.go(url)
messages = g.doc.pyquery('[data-role="commentContent"]>p')
for message in messages:
    print(message.text_content())

This code prints many strange characters (encoding is broken).
Setting Grab charset/document_charset has no effect.

I've found this fix:

for message in messages:
    print(str(message.text_content()).encode("iso-8859-1").decode())

But it's rather strange.

If I try do the same thing with requests, everythin is ok:

r = requests.get(url)
print(r.text)

prints clean Azerbaijani text (html)

Is there any normal solution for this problem?

Info:

Ubuntu 16.04
Grab 0.6.38 (current)
Python 3.5.2
@lorien lorien added the bug label Apr 14, 2018
@lorien lorien closed this as completed in ee9b33a Apr 16, 2018
lorien added a commit that referenced this issue Apr 16, 2018
Fix #285: pyquery extension parses html incorrectly
@lorien lorien changed the title Problem with encoding Problem with encoding when using pyquery Apr 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants