reprex output does not have the right encoding on Windows #82

dpprdan · 2017-05-12T14:28:19Z

reprex's output does not have the right encoding on Windows 10 (i.e. it should be declared as UTF-8).

This is the source I am passing to reprex()

# from help(Encoding)
x <- "fa\xE7ile"
Encoding(x)
x
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
c(x, xx)

This is how reprex renders it:

# from help(Encoding)
x <- "fa\xE7ile"
Encoding(x)
#> [1] "latin1"
x
#> [1] "faÃ§ile"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
#> [1] "latin1" "UTF-8"
c(x, xx)
#> [1] "faÃ§ile" "faÃ§ile"

But this is what I actually see on my console

> # from help(Encoding)
> x <- "fa\xE7ile"
> Encoding(x)
[1] "latin1"
> x
[1] "façile"
> xx <- iconv(x, "latin1", "UTF-8")
> Encoding(c(x, xx))
[1] "latin1" "UTF-8" 
> c(x, xx)
[1] "façile" "façile"

So, once I do this after passing the source to reprex()

to_utf8 = function(x) {
  Encoding(x) <- "UTF-8"
  x
}

clipr::read_clip() %>% to_utf8() %>%  clipr::write_clip()

I get

# from help(Encoding)
x <- "fa\xE7ile"
Encoding(x)
#> [1] "latin1"
x
#> [1] "façile"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
#> [1] "latin1" "UTF-8"
c(x, xx)
#> [1] "façile" "façile"

Session info

devtools::session_info("reprex")
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.0 (2017-04-21)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2017-05-12
#> Packages -----------------------------------------------------------------
#>  package   * version    date       source                         
#>  backports   1.0.5      2017-01-18 CRAN (R 3.3.2)                 
#>  base64enc   0.1-3      2015-07-28 CRAN (R 3.3.0)                 
#>  bitops      1.0-6      2013-08-17 CRAN (R 3.3.0)                 
#>  callr       1.0.0      2016-06-18 CRAN (R 3.4.0)                 
#>  caTools     1.17.1     2014-09-10 CRAN (R 3.3.0)                 
#>  clipr       0.3.2      2017-01-09 CRAN (R 3.3.2)                 
#>  digest      0.6.12     2017-01-27 CRAN (R 3.3.2)                 
#>  evaluate    0.10       2016-10-11 CRAN (R 3.3.1)                 
#>  graphics  * 3.4.0      2017-04-21 local                          
#>  grDevices * 3.4.0      2017-04-21 local                          
#>  highr       0.6        2016-05-09 CRAN (R 3.3.0)                 
#>  htmltools   0.3.6      2017-04-28 CRAN (R 3.4.0)                 
#>  jsonlite    1.4        2017-04-08 CRAN (R 3.3.3)                 
#>  knitr       1.15.1     2016-11-22 CRAN (R 3.3.2)                 
#>  magrittr    1.5        2014-11-22 CRAN (R 3.3.0)                 
#>  markdown    0.8        2017-04-20 CRAN (R 3.3.3)                 
#>  methods   * 3.4.0      2017-04-21 local                          
#>  mime        0.5        2016-07-07 CRAN (R 3.3.1)                 
#>  Rcpp        0.12.10    2017-03-19 CRAN (R 3.3.3)                 
#>  reprex      0.1.1.9000 2017-05-10 Github (jennybc/reprex@9bad6f7)
#>  rmarkdown   1.5        2017-04-26 CRAN (R 3.3.3)                 
#>  rprojroot   1.2        2017-01-16 CRAN (R 3.3.2)                 
#>  stats     * 3.4.0      2017-04-21 local                          
#>  stringi     1.1.5      2017-04-07 CRAN (R 3.3.3)                 
#>  stringr     1.2.0      2017-02-18 CRAN (R 3.3.3)                 
#>  tools       3.4.0      2017-04-21 local                          
#>  utils     * 3.4.0      2017-04-21 local                          
#>  whisker     0.3-2      2013-04-28 CRAN (R 3.3.0)                 
#>  yaml        2.1.14     2016-11-12 CRAN (R 3.3.2)

The text was updated successfully, but these errors were encountered:

hadley · 2017-05-25T23:00:52Z

This should be fixed by #76

dpprdan · 2017-05-29T09:58:09Z

Looks good to me

x <- "fa\xE7ile"
Encoding(x)
#> [1] "latin1"
x
#> [1] "façile"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
#> [1] "latin1" "UTF-8"
c(x, xx)
#> [1] "façile" "façile"

dpprdan · 2017-05-30T08:44:49Z

Err, not quite.

I am assuming that the output from reprex() and reprex_selection() (and reprex_addin by extension) should all be the same.

Well, reprex::reprex("Brüssel") gives me

"Brüssel"
#> [1] "Brüssel"

But with "Brüssel" selected in the RStudio source pane and calling reprex::reprex_selection() I get:

"BrÃ¼ssel"
#> [1] "BrÃ¼ssel"

Again with reprex::reprex("fa\xE7ile")

"façile"
#> [1] "façile"

"fa\xE7ile" with reprex::reprex_selection()

"fa\xE7ile"
#> [1] "façile"

Actually, with "fa\xE7ile" I think reprex_selection() is correct and reprex() is not, because "fa\xE7ile" on the command line gives me

> "fa\xE7ile"
[1] "façile"

yutannihilation · 2017-05-30T13:10:28Z

Thanks, and sorry for my imperfect PR to fix this... I've got a bit different results, which are similarly problematic.

Due to my locale, Japanese_Japan.932, I cannot render this correctly:

# this result is copy and paste from console
"fa\xe7ile"
#> [1] "fa輅le"

So I use this one instead:

# this result is copy and paste from console
"fa\u00E7ile"
#> [1] "façile"

Then, I got the following results both for reprex() and reprex_selection():

"fa\u00E7ile"
#> [1] "facile"

Note that, when I type "ç" on console, it will be quietly substituted with "c":

# this result is copy and paste from console
"ç"
#> [1] "c"

So, the reason the result above is not "façile" but "facile" seems that it was parsed incorrectly, as R does as usual :(

hadley · 2017-05-30T15:49:23Z

This seems most likely to be an encoding issue with the RStudio API. Hopefully @kevinushey will have some ideas

dpprdan · 2017-05-30T15:52:35Z

@yutannihilation: Is the following from reprex("fa\u00E7ile") or reprex_selection() or did you just edit it here on github?

"fa\u00E7ile"
#> [1] "façile"

I am just asking because the ç is not a c here.

This

"fa\xe7ile"
#> [1] "fa輅le"

might actually be correct IMHO, assuming that \xe7 is the right encoding for 輅 in Japanese_Japan.932.

What does reprex_selection give you with "Brüssel"?

yutannihilation · 2017-05-30T16:06:31Z

Is the following from reprex("fa\u00E7ile") or reprex_selection() or did you just edit it here on github?

Ah, sorry for confusing you! I copy and paste from my console and edit it here.

assuming that \xe7 is the right encoding for 輅 in Japanese_Japan.932.

No, 輅 is \xe7\x69. You can see i is absorbed into this character.

charToRaw("輅")
#> [1] e7 69
charToRaw("i")
#> [1] 69

I got the following for "Brüssel":

"Brﾃｼssel"
#> [1] "Brﾃｼssel"

yutannihilation · 2017-05-30T16:14:08Z

This seems most likely to be an encoding issue with the RStudio API.

I guess we can blame the difference between reprex::reprex() and reprex::reprex_selection() on RStudio API, but I'm afraid it is too tough to preserve escaped characters as is...

yutannihilation · 2017-05-30T16:29:42Z

Good news, RStudio API works fine for me. (but not for @dpprdan?)

Here is the result when I copied/selected the string "fa\xe7ile":

# these results are copied and pasted from console

readLines("clipboard")
#> [1] "\"fa\\xe7ile\""
#> Warning message:
#> In readLines("clipboard") : incomplete final line found on 'clipboard'

rstudioapi::getSourceEditorContext()
#> Document Context: 
#> - id:        '332A20F5'
#> - path:      ''
#> - contents:  <1 rows>
#> Document Selection:
#> - [1, 1] -- [1, 12]: '"fa\\xe7ile"'

kevinushey · 2017-05-30T16:53:52Z

I recall that the rstudioapi package had (has?) a bug wherein we fail to mark the encoding of UTF-8 text, so that text retrieved using the rstudioapi package would not render correctly on Windows (since it would then assume that UTF-8 text was actually encoded in the system encoding).

If that's the case, manually fixing up the encoded text with e.g. Encoding(x) <- "UTF-8" should be a workaround in the interim.

hadley · 2017-05-30T18:59:06Z

@kevinushey so it's save to assume the API always returns UTF-8 text?

kevinushey · 2017-05-30T22:15:03Z

That's right -- the rstudioapi will always return text using UTF-8 encoding.

yutannihilation · 2017-05-31T04:04:23Z

@dpprdan I've misunderstood your comment, sorry. Let me clarify.

Here are some possible ways of do "reprex"-fu:

copy the text "fa\xe7ile" and do reprex::reprex()
directly run reprex::reprex("fa\xe7ile")
select the text "fa\xe7ile" and do reprex::reprex_selection()

Method 1. and 3. work fine because they pass the text "fa\xe7ile" as character, whereas method 2. fails because it passes an expression. Before R processes the expression, it will be irreversibly substitutes the escaped characters with actual characters, for example, \xe with ç in your locale. Once substituted, R cannot infer which was the original character \xe or ç. (I'm not fully sure my usage of the terms like "expression" are correct. Let me know if I'm wrong...)

So your choice can be method 1. or 3..

Or, alternatively, input argument might be useful since it can take characters like bellow.

# input needs line breaks to distinguish texts from filenames
reprex::reprex(input = sprintf("%s\n", "\"fa\\xe7ile\""))

dpprdan · 2017-05-31T10:01:57Z

To sum up: "fa\xe7ile" and "fa\u00E7ile" work fine for me with @yutannihilation's methods 1 and 3. The only problem that remains for me is with "Brüssel" with reprex_selection(). And it seems to me that indeed rstudioapi::getSourceEditorContext() is to blame here.

readLines("clipboard")
# [1] "\"Brüssel\""

rstudioapi::getSourceEditorContext()
# Document Context: 
# - id:        '8EA39DCA'
# - path:      ''
# - contents:  <1 rows>
# Document Selection:
# - [1, 1] -- [1, 10]: '"BrÃ¼ssel"'

I guess this is what @kevinushey was referring to? So with

ctx <- rstudioapi::getSourceEditorContext()
Encoding(ctx$selection[[1]]$text) <- "UTF-8"
ctx$selection
# Document Selection:
# - [1, 1] -- [1, 10]: '"Brüssel"'

this could be fixed in reprex until the underlying problem is fixed in rstudioapi?

yutannihilation · 2017-05-31T10:46:32Z

I've got the same result for "Brüssel".

this could be fixed in reprex until the underlying problem is fixed in rstudioapi?

Sounds fair to me. Marking the encoding of UTF-8 string as UTF-8 is safe no matter it is already marked as UTF-8 or not.

Note that we can safely assume the string passed from RStudio is always UTF-8 since originally it is passed as JSON, where the character encoding is supposed to be UTF-8: https://github.com/rstudio/rstudio/blob/600d2adf687cec0034bd63ff739bbc0f6acba348/src/cpp/session/modules/SessionWorkbench.cpp#L84-L100

So I guess this should be fixed in the very upstream, RStudio itself. Until the day comes, let's set "UTF-8" encoding in reprex package.

yutannihilation · 2017-05-31T13:51:01Z

Thanks for fixing!

Just for future reference, people in MBCS locale like me may still fail to render some characters such as "Brüssel" correctly, probably due to an issue with sink():

"Brussel"
#> [1] "Brussel"

But this is not up to reprex package. So I'm fine for the fix :)

dpprdan · 2017-07-18T14:20:51Z

I hate to say it, but I found something else (which I believe belongs here as well).

Source

x <- c("€", "–", "¼", "⅛", "℅", "‰", "Malmö")
Encoding(x)
print(x)

x[2] is an en-dash (U+2013), btw.

Console output

> x <- c("€", "–", "¼", "⅛", "℅", "‰", "Malmö")
> Encoding(x)
[1] "latin1" "latin1" "latin1" "UTF-8"  "UTF-8"  "latin1" "latin1"
> print(x)
[1] "€"     "–"     "¼"     "⅛"     "℅"     "‰"     "Malmö"

reprex::reprex()

x <- c("€", "–", "¼", "?", "?", "‰", "Malmö")
Encoding(x)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
print(x)
#> [1] "<U+0080>"     "<U+0096>"     "¼"     "?"     "?"     "<U+0089>"     "Malmö"

reprex::reprex_selection()

x <- c("€", "–", "¼", "<U+215B>", "<U+2105>", "‰", "Malmö")
Encoding(x)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
print(x)
#> [1] "<U+0080>"        "<U+0096>"        "¼"        "<U+215B>" "<U+2105>" "<U+0089>"       
#> [7] "Malmö"

Version reprex@3960cc7

jennybc · 2017-07-18T15:34:21Z

This is at least somewhat related to yihui/knitr#1415, which is about knit's reporting of encoding.

dpprdan · 2017-07-18T16:19:43Z

Not really sure. knitr's reporting is only wrong for the UTF-8 chars. But apart from that knitr also seems to do all kinds of other weird things here.

This rmd source (the #> output is what's shown in Rstudio's Source pane):

    ```{r}
    x <- c("€", "–", "¼", "⅛", "℅", "‰", "ö")
    Encoding(x)
    ```

    #> [1] "latin1" "latin1" "latin1" "UTF-8"  "UTF-8"  "latin1" "latin1"

    ```{r}
    print(x)
    ```

    #> [1] "€" "–" "¼" "⅛" "℅" "‰" "ö"

results in this markdown (via Rstudio > knit (w/ keep_md)):

    ```r
    x <- c("€", "–", "¼", "⅛", "℅", "‰", "ö") 
    Encoding(x)
    ```

    ```
    ## [1] "latin1"  "latin1"  "latin1"  "unknown" "unknown" "latin1"  "latin1"
    ```

    ```r
    print(x)
    ```

    ```
    ## [1] "�"        "�"        "¼"        "<U+215B>" "<U+2105>" "�"       
    ## [7] "ö"
    ```

knitr handles the c("€", "–", "¼", "⅛", "℅", "‰", "ö") part better than reprex(), but the print(x) output is even worse.

dpprdan closed this as completed May 29, 2017

dpprdan reopened this May 30, 2017

yutannihilation mentioned this issue May 31, 2017

Add workaround for Windows #87

Closed

hadley closed this as completed in 3960cc7 May 31, 2017

batpigandme mentioned this issue Jul 14, 2018

Encoding issue: the results with non-ASCII symbols were not reproduced #197

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reprex output does not have the right encoding on Windows #82

reprex output does not have the right encoding on Windows #82

dpprdan commented May 12, 2017

hadley commented May 25, 2017

dpprdan commented May 29, 2017

dpprdan commented May 30, 2017

yutannihilation commented May 30, 2017 •

edited

Loading

hadley commented May 30, 2017

dpprdan commented May 30, 2017

yutannihilation commented May 30, 2017

yutannihilation commented May 30, 2017

yutannihilation commented May 30, 2017 •

edited

Loading

kevinushey commented May 30, 2017

hadley commented May 30, 2017

kevinushey commented May 30, 2017

yutannihilation commented May 31, 2017 •

edited

Loading

dpprdan commented May 31, 2017

yutannihilation commented May 31, 2017 •

edited

Loading

yutannihilation commented May 31, 2017

dpprdan commented Jul 18, 2017

jennybc commented Jul 18, 2017

dpprdan commented Jul 18, 2017

reprex output does not have the right encoding on Windows #82

reprex output does not have the right encoding on Windows #82

Comments

dpprdan commented May 12, 2017

hadley commented May 25, 2017

dpprdan commented May 29, 2017

dpprdan commented May 30, 2017

yutannihilation commented May 30, 2017 • edited Loading

hadley commented May 30, 2017

dpprdan commented May 30, 2017

yutannihilation commented May 30, 2017

yutannihilation commented May 30, 2017

yutannihilation commented May 30, 2017 • edited Loading

kevinushey commented May 30, 2017

hadley commented May 30, 2017

kevinushey commented May 30, 2017

yutannihilation commented May 31, 2017 • edited Loading

dpprdan commented May 31, 2017

yutannihilation commented May 31, 2017 • edited Loading

yutannihilation commented May 31, 2017

dpprdan commented Jul 18, 2017

jennybc commented Jul 18, 2017

dpprdan commented Jul 18, 2017

yutannihilation commented May 30, 2017 •

edited

Loading

yutannihilation commented May 30, 2017 •

edited

Loading

yutannihilation commented May 31, 2017 •

edited

Loading

yutannihilation commented May 31, 2017 •

edited

Loading