broken unicode uses <U+884C> which needs to be escaped #21

flying-sheep · 2015-12-04T15:29:22Z

@takluyver said in IRkernel/IRkernel#224 (comment):

I ran into another unicode issue while testing this. If R thinks it can't display a character, it escapes it like this: <U+884C> (vs Python style \u884c). These sequences are being included raw in the HTML repr produces, so the browser tries to interpret them as HTML tags and doesn't show anything. repr should probably be escaping strings for the HTML representation.

please tell me what makes R output this.

probably a good idea to html-encode all character arrays before repr_htmling them, but still…

The text was updated successfully, but these errors were encountered:

takluyver · 2015-12-04T16:04:09Z

I came across it on Windows - e.g. by trying print("行政法") in a notebook. I would assume that R tries to determine what encoding the system uses, and if that encoding can't handle the code point in question, it escapes it to the <U+884C> format.

Ideally, our output should bypass that unicode escaping and just send the real unicode code points. But either way, strings in the HTML output need to be HTML escaped so that you can use <, > and & in strings and have them display correctly.

jankatins · 2016-04-07T14:22:58Z

See also #28

jankatins · 2016-04-08T16:27:03Z

Ok, this works in RStudio:

> print("行政法")
[1] "行政法"

but not in the notebook:

> print("行政法")
[1] "<U+884C><U+653F><U+6CD5>"

flying-sheep · 2016-04-09T14:54:23Z

works for me. also we send encoding now. hmm. are you sure this happens with newest everything?

jankatins · 2016-04-10T10:35:23Z

Still happening with the newest everything...

@flying-sheep You are on a non-windows system?

jankatins · 2016-04-10T10:42:00Z

Found this blog post mentioning the problem, but haven't looked deep enough to understand what's going on... https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

flying-sheep · 2016-04-10T15:12:12Z

You are on a non-windows system?

Yeh

jankatins · 2016-04-10T17:08:00Z

jikes:

x = "A行政法ß"
nchar(x)
x

26
"A<U+884C><U+653F><U+6CD5>ß"

My interpetation is that the string is already wrong when it comes in?

jankatins · 2016-04-10T18:57:59Z

Even clearer:

"法\u8FDB"

Produces this:

"<U+6CD5>进"

jankatins · 2016-04-10T20:29:20Z

My current guess is that this is happening in evaluate -> see last element...

Input cell:

x = "法"
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y)

Output in the notebook:

8
1
"<U+6CD5>"
"进"
[1] "<U+6CD5>"
[1] "<U+8FDB>"

Using the IRkernel/IRkernel#293, this is what ends up in the file log:

2016-04-10 22:26:10 DEBUG: main loop: after poll
2016-04-10 22:26:10 DEBUG: main loop: shell
2016-04-10 22:26:10 DEBUG: Sending msg status
2016-04-10 22:26:10 DEBUG: Sending msg execute_input
2016-04-10 22:26:10 DEBUG: Executing code: x = "法"
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y)
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] 8"
 $ text/html    : chr "8"
 $ text/markdown: chr "8"
 $ text/latex   : chr "8"
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] 1"
 $ text/html    : chr "1"
 $ text/markdown: chr "1"
 $ text/latex   : chr "1"
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] \"<U+6CD5>\""
 $ text/html    : chr "\"&lt;U+6CD5&gt;\""
 $ text/markdown: chr "\"&lt;U+6CD5&gt;\""
 $ text/latex   : chr "\"<U+6CD5>\""
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] \"<U+8FDB>\""
 $ text/html    : chr "\"<U+8FDB>\"""| __truncated__
 $ text/markdown: chr "\"<U+8FDB>\"""| __truncated__
 $ text/latex   : chr "\"<U+8FDB>\"""| __truncated__
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Stream output: [1] "<U+6CD5>"

2016-04-10 22:26:10 DEBUG: Sending msg stream
2016-04-10 22:26:10 DEBUG: Stream output: [1] "<U+8FDB>"

2016-04-10 22:26:10 DEBUG: Sending msg stream
2016-04-10 22:26:10 DEBUG: Sending msg status
2016-04-10 22:26:10 DEBUG: Sending msg execute_reply
2016-04-10 22:26:10 DEBUG: main loop: beginning

Guess:

8 # it's fine when it comes from zmq (see log), but it's already screwed up when it gets executed
1 # evaluate parses the unicode escape to a single value -> everything is fine
"<U+6CD5>" # dito above
"进" # printing in the context of the kernel of a returned value is ok
[1] "<U+6CD5>" # no change...
[1] "<U+8FDB>" # but printing in evaluate will screw up the unicode again

So it looks like evalue needs some encoding, both in and out?

jankatins · 2016-04-10T20:42:17Z

https://stat.ethz.ch/R-manual/R-devel/library/base/html/source.html -> Encoding section

This what I get on my windows R:

> localeToCharset()
[1] "ISO8859-1"

And this is what I get on my NAS (linux based, hadleyverse docker image):

> localeToCharset()
[1] "UTF-8"     "ISO8859-1"

jankatins · 2016-04-10T20:51:54Z

And here is an example of the evaluate problem (both executed in an RStudio window...):

library(evaluate)

code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"

l = list()
txt <- function(o, type) {
  t <- paste(o, collapse = '\n')
  l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity, 
                         text = function(o) txt(o, "text"), 
                         graphics = identity,
                         message = identity, 
                         warning = identity, 
                         error = identity, 
                         value = identity)

x <- evaluate(code, output_handler = oh)
l

Windows:

> Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '<U+6CD5>', y = '\u8FDB', print(nchar(x)), print(nchar(y)), 
    print(x), print(y))
> l
[[1]]
[1] "[1] 8\n" #> bad in

[[2]]
[1] "[1] 1\n" # ok if escaped...

[[3]]
[1] "[1] \"<U+6CD5>\"\n" # -> Just the bad in

[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> but here it's bad out...

Linux (NAS):

> Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '法', y = '\u8FDB', print(nchar(x)), print(nchar(y)), 
    print(x), print(y))
> l
[[1]]
[1] "[1] 1\n"

[[2]]
[1] "[1] 1\n"

[[3]]
[1] "[1] \"法\"\n"

[[4]]
[1] "[1] \"进\"\n"

jankatins · 2016-04-10T21:13:02Z

And even further down for the input problem:

Windows:

> parse(text='"法 \\u8FDB"')
expression("<U+6CD5> \u8FDB")

Linux:

> parse(text='"法 \\u8FDB"')
expression("法 \u8FDB")

jankatins · 2016-04-10T22:43:23Z

If someone wants to have fun: c sources of parse: https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/main/source.c#L193

I tried to set my locale, but everything I tired was rejected by Sys.setlocales(...).

flying-sheep · 2016-04-11T08:21:35Z

thanks for digging into this. i think you were almost there. parse has an encoding argument.

i filed r-lib/evaluate#66.

depending on how it is resolved (automatic/manually) we might need to extract and specify the encoding when calling a fixed/enhanced version of evaluate or not.

jankatins · 2016-04-11T09:01:24Z

I tried that argument and it didn't make any difference :-(

jankatins · 2016-04-11T11:04:03Z

I updated r-lib/evaluate#66 with code examples which demonstrate what goes wrong here...

jankatins · 2016-04-21T15:05:17Z

Current status here: it's an ustream bug and we have some workarounds (warn if unicode input and don't send the eclipse char on such systems. So not a blocker for the next release IMO -> restor teh milestone if you have a different opinion...]

takluyver · 2016-04-21T16:19:37Z

But HTML output is now being escaped, right? So you can at least see <U+884C>?

jankatins · 2016-04-21T17:29:40Z

But HTML output is now being escaped, right? So you can at least see <U+884C>?

Yes and no: yes because html is escaped and no, because of #43 I see three dots (=3 chars).

But "OUT" is not the problem: you always see something, it's just escaped in the funny <U+xxxx> and therefore not C&P-able... "IN" is the bigger problem, but that was taken care of in IRkernel/IRkernel#296

flying-sheep · 2022-06-27T08:08:09Z

Since R 4.2, it has support for UTF-8 support in windows. Anything one needs to do there or will it just work?

takluyver mentioned this issue Feb 5, 2016

Release 0.7 IRkernel/IRkernel#257

Closed

jankatins added the bug label Apr 7, 2016

jankatins added this to the 0.5 milestone Apr 7, 2016

jankatins mentioned this issue Apr 7, 2016

abbreviated output for data.frame is wrong #28

Closed

jankatins mentioned this issue Apr 10, 2016

Log2file IRkernel/IRkernel#293

Merged

jankatins removed this from the 0.5 milestone Apr 21, 2016

jankatins added upstream-bug and removed bug labels May 9, 2016

flying-sheep mentioned this issue Jun 27, 2022

unicode char which cannot be displayed #152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broken unicode uses <U+884C> which needs to be escaped #21

broken unicode uses <U+884C> which needs to be escaped #21

flying-sheep commented Dec 4, 2015

takluyver commented Dec 4, 2015

jankatins commented Apr 7, 2016

jankatins commented Apr 8, 2016

flying-sheep commented Apr 9, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

flying-sheep commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

flying-sheep commented Apr 11, 2016

jankatins commented Apr 11, 2016

jankatins commented Apr 11, 2016

jankatins commented Apr 21, 2016

takluyver commented Apr 21, 2016

jankatins commented Apr 21, 2016 •

edited

Loading

flying-sheep commented Jun 27, 2022

broken unicode uses <U+884C> which needs to be escaped #21

broken unicode uses <U+884C> which needs to be escaped #21

Comments

flying-sheep commented Dec 4, 2015

takluyver commented Dec 4, 2015

jankatins commented Apr 7, 2016

jankatins commented Apr 8, 2016

flying-sheep commented Apr 9, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

flying-sheep commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

jankatins commented Apr 10, 2016

flying-sheep commented Apr 11, 2016

jankatins commented Apr 11, 2016

jankatins commented Apr 11, 2016

jankatins commented Apr 21, 2016

takluyver commented Apr 21, 2016

jankatins commented Apr 21, 2016 • edited Loading

flying-sheep commented Jun 27, 2022

jankatins commented Apr 21, 2016 •

edited

Loading