content() cann't handle chinese text with encoding "GB2312" correctly #209

bluaze · 2015-03-14T16:32:57Z

this problem arises from function parse_text, which presume all encodings returned by iconvlist() are upper-case or upper-case repeated.
unfortunately, "gb2312" but not "GB2312" is returned by iconvlist(), and content() will not handle encoding "gb2312" correctly in turn.
GB2312 is the mostly used encoding in chinese website.

bluaze · 2015-03-15T03:34:51Z

additionally, in parse_auto function, the encoding parameter doesn't transfer to the parser function in the last line. it might be ok for jpeg, png etc, but for html, xml, the parser function in XML package that are eventually called will not correctly handle texts in them, even they have declared the encoding themself.

of course, there is a workaround: content(xxx, "text") %>% htmlParse(encoding="yyy"), presuming the bug i metioned above has been fixed, but it's not consistent.

bluaze · 2015-06-27T13:53:22Z

sorry for my second comment, i haven't made it clear enough and may contain mistake.

what i said is that, because of the mis-handling of non-english character (at least of chinese) of XML package (maybe eventually libxml2), even it has converted to UTF-8 already, if we don't give the 'encoding' parameter to the 'htmlParse' or 'xmlParse' function explicitly, we may get a wrong result.

as you have convert all text to UTF-8 before htmlParse or xmlParse, what need to do is to simply add 'encoding="UTF-8"' parameter to each htmlParse or xmlParse that called by parser functions.

so, stopping autoparsing text formats into text first is not helpful to the problem. actually, autoparing text formats into text first is a beatiful design, at least from my point of view.

hadley · 2015-07-04T08:52:55Z

I don't think that's correct from my reading of the libxml2 documentation. If it still doesn't it work correctly, can you please provide a minimal test case?

bluaze · 2015-07-04T09:56:25Z

library(httr)
library(XML)
library(dplyr)

here is an example

a <- GET("http://www.mof.gov.cn/zhengwuxinxi/caizhengxinwen/201506/t20150626_1261410.html")

the html code

content(a,"text",encoding="GB2312")

with encoding parameter

content(a,"text",encoding="GB2312") %>% htmlParse(encoding="UTF-8") %>% xmlRoot() %>% xmlValue()

without encoding parameter, we'll get nothing

content(a,"text",encoding="GB2312") %>% htmlParse() %>% xmlRoot() %>% xmlValue()

another example

GET("http://finance.sina.com.cn/china/20150704/122922591331.shtml") %>% content("text",encoding="gb2312") %>% htmlParse(encoding="utf-8") %>% xmlRoot %>% xmlValue()
GET("http://finance.sina.com.cn/china/20150704/122922591331.shtml") %>% content("text",encoding="gb2312") %>% htmlParse() %>% xmlRoot %>% xmlValue()

sometimes httr doesn't work without encoding parameter

GET("http://world.huanqiu.com/article/2015-07/6849047.html?from=bdwz") %>% content("text")
GET("http://world.huanqiu.com/article/2015-07/6849047.html?from=bdwz") %>% content("text",encoding="UTF-8")

and sometimes encoding should be specified in GET and htmlParse

GET("http://www.pbc.gov.cn") %>% content("text")
GET("http://www.pbc.gov.cn") %>% content("text",encoding="UTF-8")
GET("http://www.pbc.gov.cn") %>% content("text",encoding="UTF-8") %>% htmlParse() %>% xmlRoot() %>% xmlValue()
GET("http://www.pbc.gov.cn") %>% content("text",encoding="UTF-8") %>% htmlParse(encoding="utf-8") %>% xmlRoot() %>% xmlValue()

hadley · 2015-12-17T17:02:15Z

I'm pretty sure the switch to xml2 should fix all these encoding problems.

hadley added a commit that referenced this issue May 4, 2015

Make encoding check case insensitive. #209

77a3f7b

hadley closed this as completed in 9f8d21e May 4, 2015

hadley reopened this Jul 4, 2015

hadley closed this as completed in b529686 Dec 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content() cann't handle chinese text with encoding "GB2312" correctly #209

content() cann't handle chinese text with encoding "GB2312" correctly #209

bluaze commented Mar 14, 2015

bluaze commented Mar 15, 2015

bluaze commented Jun 27, 2015

hadley commented Jul 4, 2015

bluaze commented Jul 4, 2015

hadley commented Dec 17, 2015

content() cann't handle chinese text with encoding "GB2312" correctly #209

content() cann't handle chinese text with encoding "GB2312" correctly #209

Comments

bluaze commented Mar 14, 2015

bluaze commented Mar 15, 2015

bluaze commented Jun 27, 2015

hadley commented Jul 4, 2015

bluaze commented Jul 4, 2015

here is an example

the html code

with encoding parameter

without encoding parameter, we'll get nothing

another example

sometimes httr doesn't work without encoding parameter

and sometimes encoding should be specified in GET and htmlParse

hadley commented Dec 17, 2015