Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have read_csv and friends understand international characters #892

Closed
jonovik opened this issue Sep 21, 2018 · 7 comments
Closed

Have read_csv and friends understand international characters #892

jonovik opened this issue Sep 21, 2018 · 7 comments

Comments

@jonovik
Copy link

jonovik commented Sep 21, 2018

Modern R handles UTF-8 characters well, including the Norwegian æøå. However, read_csv returns a tibble which displays these characters as unicode escape codes. This is ugly, e.g. when knitting international rmarkdown documents.

read.table handles international input without problems, and the result can be converted to a tibble without the ugly escaping. I wish read_csv and other readr functions would behave similarly.

Small reproducible example:

readr::read_csv(file="x\næøå")
## # A tibble: 1 x 1
##   x             
##   <chr>         
## 1 "\xe6\xf8\xe5"

tibble::as_tibble(read.table(textConnection("x\næøå"), header=TRUE, stringsAsFactors = FALSE))
## # A tibble: 1 x 1
##   x    
##   <chr>
## 1 æøå
devtools::session_info()
Session info ----
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, mingw32             
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  English_United States.1252  
 tz       Europe/Berlin               
 date     2018-09-21                  

Packages -------
 package    * version date       source                      
 assertthat   0.2.0   2017-04-11 CRAN (R 3.5.1)              
 backports    1.1.2   2017-12-13 CRAN (R 3.5.0)              
 base       * 3.5.1   2018-07-02 local                       
 cli          1.0.0   2017-11-05 CRAN (R 3.5.1)              
 compiler     3.5.1   2018-07-02 local                       
 crayon       1.3.4   2017-09-16 CRAN (R 3.5.1)              
 datasets   * 3.5.1   2018-07-02 local                       
 devtools     1.13.6  2018-06-27 CRAN (R 3.5.1)              
 digest       0.6.17  2018-09-12 CRAN (R 3.5.1)              
 evaluate     0.11    2018-07-17 CRAN (R 3.5.1)              
 fansi        0.3.0   2018-08-13 CRAN (R 3.5.1)              
 graphics   * 3.5.1   2018-07-02 local                       
 grDevices  * 3.5.1   2018-07-02 local                       
 hms          0.4.2   2018-03-10 CRAN (R 3.5.1)              
 htmltools    0.3.6   2017-04-28 CRAN (R 3.5.1)              
 knitr        1.20.15 2018-08-22 Github (yihui/knitr@4864ac9)
 magrittr     1.5     2014-11-22 CRAN (R 3.5.1)              
 memoise      1.1.0   2017-04-21 CRAN (R 3.5.1)              
 methods    * 3.5.1   2018-07-02 local                       
 pillar       1.3.0   2018-07-14 CRAN (R 3.5.1)              
 pkgconfig    2.0.2   2018-08-16 CRAN (R 3.5.1)              
 R6           2.2.2   2017-06-17 CRAN (R 3.5.1)              
 Rcpp         0.12.18 2018-07-23 CRAN (R 3.5.1)              
 readr        1.1.1   2017-05-16 CRAN (R 3.5.1)              
 rlang        0.2.2   2018-08-16 CRAN (R 3.5.1)              
 rmarkdown    1.10    2018-06-11 CRAN (R 3.5.1)              
 rprojroot    1.3-2   2018-01-03 CRAN (R 3.5.1)              
 rstudioapi   0.7     2017-09-07 CRAN (R 3.5.1)              
 stats      * 3.5.1   2018-07-02 local                       
 stringi      1.2.4   2018-07-20 CRAN (R 3.5.1)              
 stringr      1.3.1   2018-05-10 CRAN (R 3.5.1)              
 tibble       1.4.2   2018-01-22 CRAN (R 3.5.1)              
 tools        3.5.1   2018-07-02 local                       
 utf8         1.1.4   2018-05-24 CRAN (R 3.5.1)              
 utils      * 3.5.1   2018-07-02 local                       
 withr        2.1.2   2018-03-15 CRAN (R 3.5.1)              
 xfun         0.3     2018-07-06 CRAN (R 3.5.1)              
 yaml         2.2.0   2018-07-25 CRAN (R 3.5.1)
@yutannihilation
Copy link
Member

I believe this is already fixed by #730 (not released yet, though).

Currently, you need to specify locale explicitly to your default encoding. This should work on your environment:

readr::read_csv(file="x\næøå", locale = readr::locale(encoding = "latin1"))

@jonovik
Copy link
Author

jonovik commented Sep 21, 2018

Your example works on my system too with encoding = "latin1".
However, the problem remains if I use encoding = "UTF-8".
(UTF-8 is also the default on my system.)

> readr::read_csv(file="x\næøå", locale = readr::locale(encoding = "latin1"))
# A tibble: 1 x 1
  x    
  <chr>
1 æøå  
> readr::read_csv(file="x\næøå", locale = readr::locale(encoding = "UTF-8"))
# A tibble: 1 x 1
  x             
  <chr>         
1 "\xe6\xf8\xe5"

I was hoping to steer my students towards readr as an alternative to base R with fewer surprises and less hassle, but internationalization seems to be tricky (cf also #884).

@jonovik
Copy link
Author

jonovik commented Sep 21, 2018

Hmm, I may be confused about my own locale... Setting encoding="UTF-8" throws even read.table off:

> read.table(textConnection("x\næøå"), header=TRUE, stringsAsFactors = FALSE, encoding = "latin-1")
    x
1 æøå
> read.table(textConnection("x\næøå"), header=TRUE, stringsAsFactors = FALSE, encoding = "UTF-8")
             x
1 <e6><f8><e5>

On second reading of my session_info() above, I see collate English_United States.1252. However my locale() does seem to be UTF-8:

> locale()
<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday (Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), June (Jun), July (Jul), August (Aug), September (Sep), October (Oct), November (Nov), December
        (Dec)
AM/PM:  AM/PM

Specifying UTF-8 when writing and reading also seems to work:

> write.table(data_frame(x="æøå"), file="test.txt", fileEncoding = "UTF-8")
> read.table("test.txt")
       x
1 æøå
> read.table("test.txt", encoding = "UTF-8")
    x
1 æøå

In any case, this illustrates exactly the kind of stuff I have always had trouble understanding, and was hoping that UTF-8 would free me from.

Any suggestions on how I can set my locale or other settings/environment variables to have this "just work" (i.e. never having to specify encoding in my own code) would be highly appreciated.

@jonovik
Copy link
Author

jonovik commented Sep 21, 2018

[yutannihilation]

I believe this is already fixed by #730 (not released yet, though).

I forgot to say that I appreciate this! Maybe my confusion will be irrelevant soon 8-)

@yutannihilation
Copy link
Member

yutannihilation commented Sep 21, 2018

Your default locale is English_United States.1252 (a.k.a. latin1) and readr's default locale is UTF-8 (locale() shows readr's default locale).

That's the tragedy; everything you type on R console is encoded in latin1 and readr tries to read it as UTF-8 unless otherwise specified. So, the text can be read correctly when

  1. You provide UTF-8 text (e.g. "text.txt" written in UTF-8) and don't specify locale.
  2. You provide latin1 text (e.g. "x\næøå" typed on console) and specify locale to tell readr to use latin1.

Maybe my confusion will be irrelevant soon 8-)

I wish so, but, unfortunately, the release seems in trouble for months...

@jimhester
Copy link
Collaborator

As far as I can tell this is fixed in the devel version of readr.

@lock
Copy link

lock bot commented May 12, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators May 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants