Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content() cann't handle chinese text with encoding "GB2312" correctly #209

Closed
bluaze opened this issue Mar 14, 2015 · 5 comments
Closed

Comments

@bluaze
Copy link

bluaze commented Mar 14, 2015

this problem arises from function parse_text, which presume all encodings returned by iconvlist() are upper-case or upper-case repeated.
unfortunately, "gb2312" but not "GB2312" is returned by iconvlist(), and content() will not handle encoding "gb2312" correctly in turn.
GB2312 is the mostly used encoding in chinese website.

@bluaze
Copy link
Author

bluaze commented Mar 15, 2015

additionally, in parse_auto function, the encoding parameter doesn't transfer to the parser function in the last line. it might be ok for jpeg, png etc, but for html, xml, the parser function in XML package that are eventually called will not correctly handle texts in them, even they have declared the encoding themself.

of course, there is a workaround: content(xxx, "text") %>% htmlParse(encoding="yyy"), presuming the bug i metioned above has been fixed, but it's not consistent.

@bluaze
Copy link
Author

bluaze commented Jun 27, 2015

sorry for my second comment, i haven't made it clear enough and may contain mistake.

what i said is that, because of the mis-handling of non-english character (at least of chinese) of XML package (maybe eventually libxml2), even it has converted to UTF-8 already, if we don't give the 'encoding' parameter to the 'htmlParse' or 'xmlParse' function explicitly, we may get a wrong result.

as you have convert all text to UTF-8 before htmlParse or xmlParse, what need to do is to simply add 'encoding="UTF-8"' parameter to each htmlParse or xmlParse that called by parser functions.

so, stopping autoparsing text formats into text first is not helpful to the problem. actually, autoparing text formats into text first is a beatiful design, at least from my point of view.

@hadley
Copy link
Member

hadley commented Jul 4, 2015

I don't think that's correct from my reading of the libxml2 documentation. If it still doesn't it work correctly, can you please provide a minimal test case?

@bluaze
Copy link
Author

bluaze commented Jul 4, 2015

library(httr)
library(XML)
library(dplyr)

here is an example

a <- GET("http://www.mof.gov.cn/zhengwuxinxi/caizhengxinwen/201506/t20150626_1261410.html")

the html code

content(a,"text",encoding="GB2312")

with encoding parameter

content(a,"text",encoding="GB2312") %>% htmlParse(encoding="UTF-8") %>% xmlRoot() %>% xmlValue()

without encoding parameter, we'll get nothing

content(a,"text",encoding="GB2312") %>% htmlParse() %>% xmlRoot() %>% xmlValue()

another example

GET("http://finance.sina.com.cn/china/20150704/122922591331.shtml") %>% content("text",encoding="gb2312") %>% htmlParse(encoding="utf-8") %>% xmlRoot %>% xmlValue()
GET("http://finance.sina.com.cn/china/20150704/122922591331.shtml") %>% content("text",encoding="gb2312") %>% htmlParse() %>% xmlRoot %>% xmlValue()

sometimes httr doesn't work without encoding parameter

GET("http://world.huanqiu.com/article/2015-07/6849047.html?from=bdwz") %>% content("text")
GET("http://world.huanqiu.com/article/2015-07/6849047.html?from=bdwz") %>% content("text",encoding="UTF-8")

and sometimes encoding should be specified in GET and htmlParse

GET("http://www.pbc.gov.cn") %>% content("text")
GET("http://www.pbc.gov.cn") %>% content("text",encoding="UTF-8")
GET("http://www.pbc.gov.cn") %>% content("text",encoding="UTF-8") %>% htmlParse() %>% xmlRoot() %>% xmlValue()
GET("http://www.pbc.gov.cn") %>% content("text",encoding="UTF-8") %>% htmlParse(encoding="utf-8") %>% xmlRoot() %>% xmlValue()

@hadley hadley reopened this Jul 4, 2015
@hadley hadley closed this as completed in b529686 Dec 17, 2015
@hadley
Copy link
Member

hadley commented Dec 17, 2015

I'm pretty sure the switch to xml2 should fix all these encoding problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants