-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broken unicode uses <U+884C> which needs to be escaped #21
Comments
I came across it on Windows - e.g. by trying Ideally, our output should bypass that unicode escaping and just send the real unicode code points. But either way, strings in the HTML output need to be HTML escaped so that you can use |
See also #28 |
Ok, this works in RStudio: > print("行政法")
[1] "行政法" but not in the notebook: > print("行政法")
[1] "<U+884C><U+653F><U+6CD5>" |
Still happening with the newest everything... @flying-sheep You are on a non-windows system? |
Found this blog post mentioning the problem, but haven't looked deep enough to understand what's going on... https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/ |
Yeh |
jikes: x = "A行政法ß"
nchar(x)
x
My interpetation is that the string is already wrong when it comes in? |
Even clearer: "法\u8FDB" Produces this:
|
My current guess is that this is happening in evaluate -> see last element... Input cell: x = "法"
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y) Output in the notebook:
Using the IRkernel/IRkernel#293, this is what ends up in the file log:
Guess:
So it looks like evalue needs some encoding, both in and out? |
https://stat.ethz.ch/R-manual/R-devel/library/base/html/source.html -> Encoding section This what I get on my windows R:
And this is what I get on my NAS (linux based, hadleyverse docker image):
|
And here is an example of the evaluate problem (both executed in an RStudio window...): library(evaluate)
code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"
l = list()
txt <- function(o, type) {
t <- paste(o, collapse = '\n')
l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity,
text = function(o) txt(o, "text"),
graphics = identity,
message = identity,
warning = identity,
error = identity,
value = identity)
x <- evaluate(code, output_handler = oh)
l Windows: > Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '<U+6CD5>', y = '\u8FDB', print(nchar(x)), print(nchar(y)),
print(x), print(y))
> l
[[1]]
[1] "[1] 8\n" #> bad in
[[2]]
[1] "[1] 1\n" # ok if escaped...
[[3]]
[1] "[1] \"<U+6CD5>\"\n" # -> Just the bad in
[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> but here it's bad out... Linux (NAS): > Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '法', y = '\u8FDB', print(nchar(x)), print(nchar(y)),
print(x), print(y))
> l
[[1]]
[1] "[1] 1\n"
[[2]]
[1] "[1] 1\n"
[[3]]
[1] "[1] \"法\"\n"
[[4]]
[1] "[1] \"进\"\n" |
And even further down for the input problem: Windows: > parse(text='"法 \\u8FDB"')
expression("<U+6CD5> \u8FDB") Linux: > parse(text='"法 \\u8FDB"')
expression("法 \u8FDB") |
If someone wants to have fun: c sources of parse: https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/main/source.c#L193 I tried to set my locale, but everything I tired was rejected by |
thanks for digging into this. i think you were almost there. i filed r-lib/evaluate#66. depending on how it is resolved (automatic/manually) we might need to extract and specify the encoding when calling a fixed/enhanced version of evaluate or not. |
I tried that argument and it didn't make any difference :-( |
I updated r-lib/evaluate#66 with code examples which demonstrate what goes wrong here... |
Current status here: it's an ustream bug and we have some workarounds (warn if unicode input and don't send the eclipse char on such systems. So not a blocker for the next release IMO -> restor teh milestone if you have a different opinion...] |
But HTML output is now being escaped, right? So you can at least see |
Yes and no: yes because html is escaped and no, because of #43 I see three dots (=3 chars). But "OUT" is not the problem: you always see something, it's just escaped in the funny |
Since R 4.2, it has support for UTF-8 support in windows. Anything one needs to do there or will it just work? |
@takluyver said in IRkernel/IRkernel#224 (comment):
please tell me what makes R output this.
probably a good idea to html-encode all character arrays before
repr_html
ing them, but still…The text was updated successfully, but these errors were encountered: