Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reprex output does not have the right encoding on Windows #82

Closed
dpprdan opened this issue May 12, 2017 · 19 comments
Closed

reprex output does not have the right encoding on Windows #82

dpprdan opened this issue May 12, 2017 · 19 comments

Comments

@dpprdan
Copy link

dpprdan commented May 12, 2017

reprex's output does not have the right encoding on Windows 10 (i.e. it should be declared as UTF-8).

This is the source I am passing to reprex()

# from help(Encoding)
x <- "fa\xE7ile"
Encoding(x)
x
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
c(x, xx)

This is how reprex renders it:

# from help(Encoding)
x <- "fa\xE7ile"
Encoding(x)
#> [1] "latin1"
x
#> [1] "façile"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
#> [1] "latin1" "UTF-8"
c(x, xx)
#> [1] "façile" "façile"

But this is what I actually see on my console

> # from help(Encoding)
> x <- "fa\xE7ile"
> Encoding(x)
[1] "latin1"
> x
[1] "façile"
> xx <- iconv(x, "latin1", "UTF-8")
> Encoding(c(x, xx))
[1] "latin1" "UTF-8" 
> c(x, xx)
[1] "façile" "façile"

So, once I do this after passing the source to reprex()

to_utf8 = function(x) {
  Encoding(x) <- "UTF-8"
  x
}

clipr::read_clip() %>% to_utf8() %>%  clipr::write_clip()

I get

# from help(Encoding)
x <- "fa\xE7ile"
Encoding(x)
#> [1] "latin1"
x
#> [1] "façile"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
#> [1] "latin1" "UTF-8"
c(x, xx)
#> [1] "façile" "façile"
Session info
devtools::session_info("reprex")
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.0 (2017-04-21)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2017-05-12
#> Packages -----------------------------------------------------------------
#>  package   * version    date       source                         
#>  backports   1.0.5      2017-01-18 CRAN (R 3.3.2)                 
#>  base64enc   0.1-3      2015-07-28 CRAN (R 3.3.0)                 
#>  bitops      1.0-6      2013-08-17 CRAN (R 3.3.0)                 
#>  callr       1.0.0      2016-06-18 CRAN (R 3.4.0)                 
#>  caTools     1.17.1     2014-09-10 CRAN (R 3.3.0)                 
#>  clipr       0.3.2      2017-01-09 CRAN (R 3.3.2)                 
#>  digest      0.6.12     2017-01-27 CRAN (R 3.3.2)                 
#>  evaluate    0.10       2016-10-11 CRAN (R 3.3.1)                 
#>  graphics  * 3.4.0      2017-04-21 local                          
#>  grDevices * 3.4.0      2017-04-21 local                          
#>  highr       0.6        2016-05-09 CRAN (R 3.3.0)                 
#>  htmltools   0.3.6      2017-04-28 CRAN (R 3.4.0)                 
#>  jsonlite    1.4        2017-04-08 CRAN (R 3.3.3)                 
#>  knitr       1.15.1     2016-11-22 CRAN (R 3.3.2)                 
#>  magrittr    1.5        2014-11-22 CRAN (R 3.3.0)                 
#>  markdown    0.8        2017-04-20 CRAN (R 3.3.3)                 
#>  methods   * 3.4.0      2017-04-21 local                          
#>  mime        0.5        2016-07-07 CRAN (R 3.3.1)                 
#>  Rcpp        0.12.10    2017-03-19 CRAN (R 3.3.3)                 
#>  reprex      0.1.1.9000 2017-05-10 Github (jennybc/reprex@9bad6f7)
#>  rmarkdown   1.5        2017-04-26 CRAN (R 3.3.3)                 
#>  rprojroot   1.2        2017-01-16 CRAN (R 3.3.2)                 
#>  stats     * 3.4.0      2017-04-21 local                          
#>  stringi     1.1.5      2017-04-07 CRAN (R 3.3.3)                 
#>  stringr     1.2.0      2017-02-18 CRAN (R 3.3.3)                 
#>  tools       3.4.0      2017-04-21 local                          
#>  utils     * 3.4.0      2017-04-21 local                          
#>  whisker     0.3-2      2013-04-28 CRAN (R 3.3.0)                 
#>  yaml        2.1.14     2016-11-12 CRAN (R 3.3.2)
@hadley
Copy link
Member

hadley commented May 25, 2017

This should be fixed by #76

@dpprdan
Copy link
Author

dpprdan commented May 29, 2017

Looks good to me

x <- "fa\xE7ile"
Encoding(x)
#> [1] "latin1"
x
#> [1] "façile"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
#> [1] "latin1" "UTF-8"
c(x, xx)
#> [1] "façile" "façile"

@dpprdan dpprdan closed this as completed May 29, 2017
@dpprdan
Copy link
Author

dpprdan commented May 30, 2017

Err, not quite.

I am assuming that the output from reprex() and reprex_selection() (and reprex_addin by extension) should all be the same.

Well, reprex::reprex("Brüssel") gives me

"Brüssel"
#> [1] "Brüssel"

But with "Brüssel" selected in the RStudio source pane and calling reprex::reprex_selection() I get:

"Brüssel"
#> [1] "Brüssel"

Again with reprex::reprex("fa\xE7ile")

"façile"
#> [1] "façile"

"fa\xE7ile" with reprex::reprex_selection()

"fa\xE7ile"
#> [1] "façile"

Actually, with "fa\xE7ile" I think reprex_selection() is correct and reprex() is not, because "fa\xE7ile" on the command line gives me

> "fa\xE7ile"
[1] "façile"

@dpprdan dpprdan reopened this May 30, 2017
@yutannihilation
Copy link
Member

yutannihilation commented May 30, 2017

Thanks, and sorry for my imperfect PR to fix this... I've got a bit different results, which are similarly problematic.

Due to my locale, Japanese_Japan.932, I cannot render this correctly:

# this result is copy and paste from console
"fa\xe7ile"
#> [1] "fa輅le"

So I use this one instead:

# this result is copy and paste from console
"fa\u00E7ile"
#> [1] "façile"

Then, I got the following results both for reprex() and reprex_selection():

"fa\u00E7ile"
#> [1] "facile"

Note that, when I type "ç" on console, it will be quietly substituted with "c":

# this result is copy and paste from console
"ç"
#> [1] "c"

So, the reason the result above is not "façile" but "facile" seems that it was parsed incorrectly, as R does as usual :(

@hadley
Copy link
Member

hadley commented May 30, 2017

This seems most likely to be an encoding issue with the RStudio API. Hopefully @kevinushey will have some ideas

@dpprdan
Copy link
Author

dpprdan commented May 30, 2017

@yutannihilation: Is the following from reprex("fa\u00E7ile") or reprex_selection() or did you just edit it here on github?

"fa\u00E7ile"
#> [1] "façile"

I am just asking because the ç is not a c here.

This

"fa\xe7ile"
#> [1] "fa輅le"

might actually be correct IMHO, assuming that \xe7 is the right encoding for 輅 in Japanese_Japan.932.

What does reprex_selection give you with "Brüssel"?

@yutannihilation
Copy link
Member

Is the following from reprex("fa\u00E7ile") or reprex_selection() or did you just edit it here on github?

Ah, sorry for confusing you! I copy and paste from my console and edit it here.

assuming that \xe7 is the right encoding for in Japanese_Japan.932.

No, is \xe7\x69. You can see i is absorbed into this character.

charToRaw("")
#> [1] e7 69
charToRaw("i")
#> [1] 69

I got the following for "Brüssel":

"Brテシssel"
#> [1] "Brテシssel"

@yutannihilation
Copy link
Member

This seems most likely to be an encoding issue with the RStudio API.

I guess we can blame the difference between reprex::reprex() and reprex::reprex_selection() on RStudio API, but I'm afraid it is too tough to preserve escaped characters as is...

@yutannihilation
Copy link
Member

yutannihilation commented May 30, 2017

Good news, RStudio API works fine for me. (but not for @dpprdan?)

Here is the result when I copied/selected the string "fa\xe7ile":

# these results are copied and pasted from console

readLines("clipboard")
#> [1] "\"fa\\xe7ile\""
#> Warning message:
#> In readLines("clipboard") : incomplete final line found on 'clipboard'

rstudioapi::getSourceEditorContext()
#> Document Context: 
#> - id:        '332A20F5'
#> - path:      ''
#> - contents:  <1 rows>
#> Document Selection:
#> - [1, 1] -- [1, 12]: '"fa\\xe7ile"'

@kevinushey
Copy link

I recall that the rstudioapi package had (has?) a bug wherein we fail to mark the encoding of UTF-8 text, so that text retrieved using the rstudioapi package would not render correctly on Windows (since it would then assume that UTF-8 text was actually encoded in the system encoding).

If that's the case, manually fixing up the encoded text with e.g. Encoding(x) <- "UTF-8" should be a workaround in the interim.

@hadley
Copy link
Member

hadley commented May 30, 2017

@kevinushey so it's save to assume the API always returns UTF-8 text?

@kevinushey
Copy link

That's right -- the rstudioapi will always return text using UTF-8 encoding.

@yutannihilation
Copy link
Member

yutannihilation commented May 31, 2017

@dpprdan I've misunderstood your comment, sorry. Let me clarify.

Here are some possible ways of do "reprex"-fu:

  1. copy the text "fa\xe7ile" and do reprex::reprex()
  2. directly run reprex::reprex("fa\xe7ile")
  3. select the text "fa\xe7ile" and do reprex::reprex_selection()

Method 1. and 3. work fine because they pass the text "fa\xe7ile" as character, whereas method 2. fails because it passes an expression. Before R processes the expression, it will be irreversibly substitutes the escaped characters with actual characters, for example, \xe with ç in your locale. Once substituted, R cannot infer which was the original character \xe or ç. (I'm not fully sure my usage of the terms like "expression" are correct. Let me know if I'm wrong...)

So your choice can be method 1. or 3..

Or, alternatively, input argument might be useful since it can take characters like bellow.

# input needs line breaks to distinguish texts from filenames
reprex::reprex(input = sprintf("%s\n", "\"fa\\xe7ile\""))

@dpprdan
Copy link
Author

dpprdan commented May 31, 2017

To sum up: "fa\xe7ile" and "fa\u00E7ile" work fine for me with @yutannihilation's methods 1 and 3. The only problem that remains for me is with "Brüssel" with reprex_selection(). And it seems to me that indeed rstudioapi::getSourceEditorContext() is to blame here.

readLines("clipboard")
# [1] "\"Brüssel\""

rstudioapi::getSourceEditorContext()
# Document Context: 
# - id:        '8EA39DCA'
# - path:      ''
# - contents:  <1 rows>
# Document Selection:
# - [1, 1] -- [1, 10]: '"Brüssel"'

I guess this is what @kevinushey was referring to? So with

ctx <- rstudioapi::getSourceEditorContext()
Encoding(ctx$selection[[1]]$text) <- "UTF-8"
ctx$selection
# Document Selection:
# - [1, 1] -- [1, 10]: '"Brüssel"'

this could be fixed in reprex until the underlying problem is fixed in rstudioapi?

@yutannihilation
Copy link
Member

yutannihilation commented May 31, 2017

I've got the same result for "Brüssel".

this could be fixed in reprex until the underlying problem is fixed in rstudioapi?

Sounds fair to me. Marking the encoding of UTF-8 string as UTF-8 is safe no matter it is already marked as UTF-8 or not.

Note that we can safely assume the string passed from RStudio is always UTF-8 since originally it is passed as JSON, where the character encoding is supposed to be UTF-8: https://github.com/rstudio/rstudio/blob/600d2adf687cec0034bd63ff739bbc0f6acba348/src/cpp/session/modules/SessionWorkbench.cpp#L84-L100

So I guess this should be fixed in the very upstream, RStudio itself. Until the day comes, let's set "UTF-8" encoding in reprex package.

@yutannihilation
Copy link
Member

Thanks for fixing!

Just for future reference, people in MBCS locale like me may still fail to render some characters such as "Brüssel" correctly, probably due to an issue with sink():

"Brussel"
#> [1] "Brussel"

But this is not up to reprex package. So I'm fine for the fix :)

@dpprdan
Copy link
Author

dpprdan commented Jul 18, 2017

I hate to say it, but I found something else (which I believe belongs here as well).

Source

x <- c("", "", "¼", "", "", "", "Malmö")
Encoding(x)
print(x)

x[2] is an en-dash (U+2013), btw.

Console output

> x <- c("", "", "¼", "", "", "", "Malmö")
> Encoding(x)
[1] "latin1" "latin1" "latin1" "UTF-8"  "UTF-8"  "latin1" "latin1"
> print(x)
[1] ""     ""     "¼"     ""     ""     ""     "Malmö"

reprex::reprex()

x <- c("", "", "¼", "?", "?", "", "Malmö")
Encoding(x)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
print(x)
#> [1] "<U+0080>"     "<U+0096>"     "¼"     "?"     "?"     "<U+0089>"     "Malmö"

reprex::reprex_selection()

x <- c("", "", "¼", "<U+215B>", "<U+2105>", "", "Malmö")
Encoding(x)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
print(x)
#> [1] "<U+0080>"        "<U+0096>"        "¼"        "<U+215B>" "<U+2105>" "<U+0089>"       
#> [7] "Malmö"

Version reprex@3960cc7

@jennybc
Copy link
Member

jennybc commented Jul 18, 2017

This is at least somewhat related to yihui/knitr#1415, which is about knit's reporting of encoding.

@dpprdan
Copy link
Author

dpprdan commented Jul 18, 2017

Not really sure. knitr's reporting is only wrong for the UTF-8 chars. But apart from that knitr also seems to do all kinds of other weird things here.

This rmd source (the #> output is what's shown in Rstudio's Source pane):

    ```{r}
    x <- c("€", "–", "¼", "⅛", "℅", "‰", "ö")
    Encoding(x)
    ```

    #> [1] "latin1" "latin1" "latin1" "UTF-8"  "UTF-8"  "latin1" "latin1"

    ```{r}
    print(x)
    ```

    #> [1] "€" "–" "¼" "⅛" "℅" "‰" "ö"

results in this markdown (via Rstudio > knit (w/ keep_md)):

    ```r
    x <- c("€", "–", "¼", "⅛", "℅", "‰", "ö") 
    Encoding(x)
    ```

    ```
    ## [1] "latin1"  "latin1"  "latin1"  "unknown" "unknown" "latin1"  "latin1"
    ```

    ```r
    print(x)
    ```

    ```
    ## [1] "�"        "�"        "¼"        "<U+215B>" "<U+2105>" "�"       
    ## [7] "ö"
    ```

knitr handles the c("€", "–", "¼", "⅛", "℅", "‰", "ö") part better than reprex(), but the print(x) output is even worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants