-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
content() cann't handle chinese text with encoding "GB2312" correctly #209
Comments
additionally, in parse_auto function, the encoding parameter doesn't transfer to the parser function in the last line. it might be ok for jpeg, png etc, but for html, xml, the parser function in XML package that are eventually called will not correctly handle texts in them, even they have declared the encoding themself. of course, there is a workaround: content(xxx, "text") %>% htmlParse(encoding="yyy"), presuming the bug i metioned above has been fixed, but it's not consistent. |
sorry for my second comment, i haven't made it clear enough and may contain mistake. what i said is that, because of the mis-handling of non-english character (at least of chinese) of XML package (maybe eventually libxml2), even it has converted to UTF-8 already, if we don't give the 'encoding' parameter to the 'htmlParse' or 'xmlParse' function explicitly, we may get a wrong result. as you have convert all text to UTF-8 before htmlParse or xmlParse, what need to do is to simply add 'encoding="UTF-8"' parameter to each htmlParse or xmlParse that called by parser functions. so, stopping autoparsing text formats into text first is not helpful to the problem. actually, autoparing text formats into text first is a beatiful design, at least from my point of view. |
I don't think that's correct from my reading of the libxml2 documentation. If it still doesn't it work correctly, can you please provide a minimal test case? |
library(httr) here is an examplea <- GET("http://www.mof.gov.cn/zhengwuxinxi/caizhengxinwen/201506/t20150626_1261410.html") the html codecontent(a,"text",encoding="GB2312") with encoding parametercontent(a,"text",encoding="GB2312") %>% htmlParse(encoding="UTF-8") %>% xmlRoot() %>% xmlValue() without encoding parameter, we'll get nothingcontent(a,"text",encoding="GB2312") %>% htmlParse() %>% xmlRoot() %>% xmlValue() another exampleGET("http://finance.sina.com.cn/china/20150704/122922591331.shtml") %>% content("text",encoding="gb2312") %>% htmlParse(encoding="utf-8") %>% xmlRoot %>% xmlValue() sometimes httr doesn't work without encoding parameterGET("http://world.huanqiu.com/article/2015-07/6849047.html?from=bdwz") %>% content("text") and sometimes encoding should be specified in GET and htmlParseGET("http://www.pbc.gov.cn") %>% content("text") |
I'm pretty sure the switch to xml2 should fix all these encoding problems. |
this problem arises from function parse_text, which presume all encodings returned by iconvlist() are upper-case or upper-case repeated.
unfortunately, "gb2312" but not "GB2312" is returned by iconvlist(), and content() will not handle encoding "gb2312" correctly in turn.
GB2312 is the mostly used encoding in chinese website.
The text was updated successfully, but these errors were encountered: