Problem with encoding when using pyquery #285

artbataev · 2017-10-25T15:43:37Z

I'm trying to parse Azerbaijani forum

url = "http://www.disput.az/index.php?app=forums&module=forums&controller=topic&id=1051606"
g = Grab()
g.go(url)
messages = g.doc.pyquery('[data-role="commentContent"]>p')
for message in messages:
    print(message.text_content())

This code prints many strange characters (encoding is broken).
Setting Grab charset/document_charset has no effect.

I've found this fix:

for message in messages:
    print(str(message.text_content()).encode("iso-8859-1").decode())

But it's rather strange.

If I try do the same thing with requests, everythin is ok:

r = requests.get(url)
print(r.text)

prints clean Azerbaijani text (html)

Is there any normal solution for this problem?

Info:

Ubuntu 16.04
Grab 0.6.38 (current)
Python 3.5.2

The text was updated successfully, but these errors were encountered:

Fix #285: pyquery extension parses html incorrectly

lorien added the bug label Apr 14, 2018

lorien closed this as completed in ee9b33a Apr 16, 2018

lorien added a commit that referenced this issue Apr 16, 2018

Merge pull request #308 from lorien/issue_285_pyquery

8a68282

Fix #285: pyquery extension parses html incorrectly

lorien changed the title ~~Problem with encoding~~ Problem with encoding when using pyquery Apr 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with encoding when using pyquery #285

Problem with encoding when using pyquery #285

artbataev commented Oct 25, 2017 •

edited

Loading

Problem with encoding when using pyquery #285

Problem with encoding when using pyquery #285

Comments

artbataev commented Oct 25, 2017 • edited Loading

artbataev commented Oct 25, 2017 •

edited

Loading