Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv() cannot handle file paths with characters outside of the default locale (Windows) #884

Closed
yutannihilation opened this issue Aug 24, 2018 · 7 comments

Comments

@yutannihilation
Copy link
Member

(Originally reported here: #834 (comment))

On Windows, it seems read_csv() cannot handle the file path that contains character outside of the default locale.

For example, in my locale, CP932 (Shift_JIS), is not representable.

readr::read_csv("萼片长.csv")
#> Error in guess_header_(datasource, tokenizer, locale) : 
#>   Cannot read file C:/Users/hiroaki-yutani/Documents/repo/R/ggplot2/萼片<U+957F>.csv: (...snip...)

For another example, in CP1252 (latin1), all of and and are not representable:

# change locale to latin1
rlang::mut_latin1_locale()

readr::read_csv("萼片长.csv")
#> Error in guess_header_(datasource, tokenizer, locale) : 
#>   Cannot read file C:/Users/hiroaki-yutani/Documents/repo/R/ggplot2/<U+843C><U+7247><U+957F>.csv: (...snip...)

In the defense of readr, base R also fails to handle them. For example, saveRDS().

saveRDS(iris, file = "萼片长.rds")
#> Error in gzfile(file, mode) : cannot open the connection
#> In addition: Warning message:
#> In gzfile(file, mode) :
#>   cannot open compressed file '<U+843C><U+7247><U+957F>.rds', probable reason 'Invalid argument'

But, I have no idea how boost can handle this, since they requires the file path in the native locale...

@yutannihilation
Copy link
Member Author

Hmmmm, it seems impossible...?

boostorg/interprocess#28

@jonovik
Copy link

jonovik commented Sep 20, 2018

Another reproducible example below. read.table() can handle filenames with Norwegian characters, but read_tsv cannot.

write.table(head(iris, 2), file="æøå.txt")
read.table("æøå.txt")
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
readr::read_tsv("æøå.txt")
## Error in guess_header_(datasource, tokenizer, locale) : 
## Cannot read file C:/.../æøå.txt: The system cannot find the file specified.

devtools::session_info()
Session info --------------------------------------------------------------------------------------------------
  setting  value                       
version  R version 3.5.1 (2018-07-02)
system   x86_64, mingw32             
ui       RStudio (1.1.456)           
language (EN)                        
collate  English_United States.1252  
tz       Europe/Berlin               
date     2018-09-19                  

Packages ------------------------------------------------------------------------------------------------------
  package    * version date       source        
base       * 3.5.1   2018-07-02 local         
compiler     3.5.1   2018-07-02 local         
crayon       1.3.4   2017-09-16 CRAN (R 3.5.1)
datasets   * 3.5.1   2018-07-02 local         
devtools     1.13.6  2018-06-27 CRAN (R 3.5.1)
digest       0.6.17  2018-09-12 CRAN (R 3.5.1)
graphics   * 3.5.1   2018-07-02 local         
grDevices  * 3.5.1   2018-07-02 local         
hms          0.4.2   2018-03-10 CRAN (R 3.5.1)
memoise      1.1.0   2017-04-21 CRAN (R 3.5.1)
methods    * 3.5.1   2018-07-02 local         
pillar       1.3.0   2018-07-14 CRAN (R 3.5.1)
pkgconfig    2.0.2   2018-08-16 CRAN (R 3.5.1)
R6           2.2.2   2017-06-17 CRAN (R 3.5.1)
Rcpp         0.12.18 2018-07-23 CRAN (R 3.5.1)
readr        1.1.1   2017-05-16 CRAN (R 3.5.1)
rlang        0.2.2   2018-08-16 CRAN (R 3.5.1)
rstudioapi   0.7     2017-09-07 CRAN (R 3.5.1)
stats      * 3.5.1   2018-07-02 local         
tibble       1.4.2   2018-01-22 CRAN (R 3.5.1)
tools        3.5.1   2018-07-02 local         
utils      * 3.5.1   2018-07-02 local         
withr        2.1.2   2018-03-15 CRAN (R 3.5.1)
yaml         2.2.0   2018-07-25 CRAN (R 3.5.1)

@jonovik
Copy link

jonovik commented Sep 20, 2018

Don't know if it is related, but using Rterm in cmd I cannot even type æøå (I get `o+). Norwegian characters work fine at the cmd command line, it's only Rterm that is having problems. Maybe I could fix those with some setting, and maybe that would fix read_tsv too...?

In Rgui, the example runs the same as inside RStudio.

@yutannihilation
Copy link
Member Author

After 1 month of thinking, I'm coming to the conclusion that this is not possible directly as long as we rely on boostorg/interprocess. Maybe creating a link can be a possible workaround?

tmp <- tempfile(fileext = ".csv")
file.link("萼片长.csv", tmp)
readr::read_csv(tmp)

If this looks good, I can send a PR...

@jimhester
Copy link
Collaborator

I think this is just a limitation of base R, you need to be able to represent the file paths in the current locale.

@yutannihilation
Copy link
Member Author

@jimhester

this is just a limitation of base R, you need to be able to represent the file paths in the current locale.

FYI, this is wrong. R can handle paths that is not representable in the current locale (that's why path.expand() now returns UTF-8 string). In the code below read.csv() and write.csv() works fine.

write.csv(iris, "萼片长.csv", row.names = FALSE)
read.csv("萼片长.csv")

I'm OK to close this as wontfix, but this is a limitation of Boost, not that of base R.

(Sorry if it was a bit confusing that I explained that saveRDS() won't work, but it's just saveRDS()'s problem, not a limitation of base R...)

@lock
Copy link

lock bot commented May 12, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators May 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants