Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv can't recognize chinese file path on R 3.5.0 #834

Closed
nan1949 opened this issue Apr 26, 2018 · 11 comments · Fixed by #838
Closed

read_csv can't recognize chinese file path on R 3.5.0 #834

nan1949 opened this issue Apr 26, 2018 · 11 comments · Fixed by #838
Labels
bug an unexpected problem or unintended behavior

Comments

@nan1949
Copy link

nan1949 commented Apr 26, 2018

R.version.string
[1] "R version 3.4.3 (2017-11-30)"
repaid <- read_csv("D:/work51/催收专员绩效m1/还款4月0425.csv",
col_types = list(advance_clear_amount = col_double()))

R.version.string
[1] "R version 3.5.0 (2018-04-23)"
repaid <- read_csv("D:/work51/催收专员绩效m1/还款4月0425.csv",
col_types = list(advance_clear_amount = col_double()))
Error in guess_header_(datasource, tokenizer, locale) : Cannot read file D:/work51/鍌敹涓撳憳缁╂晥m1/杩樻4鏈?425.csv: 系统找不到指定的路径。

@yutannihilation
Copy link
Member

This is due to the change of the behavior of normalizePath() (more precisely, path.expand(), which is used inside normalizePath()).

R 3.4.4 on Windows 10:

Encoding(normalizePath("~/鬼"))
#> [1] "unknown"

R 3.5.0 on Windows 10:

Encoding(normalizePath("~/鬼"))
#> [1] "UTF-8"

The release note says:

path.expand() on Windows now accepts paths specified as UTF-8-encoded character strings even if not representable in the current locale. (PR#17120)

so this seems intentional and unlikely to be reverted.

Rather, I feel it's not robust that readr naively assumes the path is already encoded in the native locale; instead, path should be explicitly converted to the native locale before passed to boost::interprocess::file_mapping().

@GegznaV
Copy link
Contributor

GegznaV commented Jun 4, 2018

I'm not sure if I have the same problem as in this issue: the readr functions fail if non-ASCII letters are present in a file path.

I create a file:

library(readr)
Sys.setlocale(locale = "Lithuanian")
dir.create("C:/data/medž/", recursive = TRUE)
write.table(x = iris, file =  "C:/data/medž/data.txt")

And the code fails to read it:

read_file("C:/data/medž/data.txt")

Fails with the message:

Error in read_file_(ds, locale) : Cannot read file C:/data/medž/data.txt: The system cannot find the path specified.

3. stop(structure(list(message = "Cannot read file C:/data/medž/data.txt: The system cannot find the path specified.", call = read_file_(ds, locale), cppstack = structure(list( file = "", line = -1L, stack = "C++ stack not available on this system"), class = "Rcpp_stack_trace")), class = c("Rcpp::exception", "C++Error", "error", "condition")))
2. read_file_(ds, locale)
1. read_file("C:/data/medž/data.txt")

@yutannihilation
Copy link
Member

Yes, the same problem.

jimhester pushed a commit that referenced this issue Jun 5, 2018
…ss::file_mapping() (#838)

* use Rf_translateChar()

* declare path as CharacterVector

* add a NEWS item

Fixes #834 
Fixes #837
@vnijs
Copy link

vnijs commented Aug 23, 2018

I see this issue is closed but after installing the latest version of readr from github the problem still occurs, at least on my system (Windows 10, R 3.5.1)

Question: Will the fix you are working on in readr also work for .rds (and perhaps .rda?) files?

> readr::read_rds("Z:/GitHub/radiant.data/萼片长/diamonds_萼片长.rds")
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file 'Z:/GitHub/radiant.data/<U+843C><U+7247><U+957F>/diamonds_<U+843C><U+7247><U+957F>.rds', probable reason 'Invalid argument'
> readr::read_csv("Z:/GitHub/radiant.data/萼片长/test_萼片长.csv")
Error in guess_header_(datasource, tokenizer, locale) : 
  Cannot read file Z:/GitHub/radiant.data/<U+843C><U+7247><U+957F>/test_<U+843C><U+7247><U+957F>.csv: The filename, directory name, or volume label syntax is incorrect.
> devtools::session_info()
Session info -----------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, mingw32             
 ui       RStudio (1.2.907)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/Los_Angeles         
 date     2018-08-22                  

Packages ---------------------------------------------------------------------------
 package    * version    date       source                          
 base       * 3.5.1      2018-07-02 local                           
 compiler     3.5.1      2018-07-02 local                           
 crayon       1.3.4      2017-09-16 CRAN (R 3.5.0)                  
 datasets   * 3.5.1      2018-07-02 local                           
 devtools     1.13.6     2018-06-27 CRAN (R 3.5.0)                  
 digest       0.6.15     2018-01-28 CRAN (R 3.5.0)                  
 graphics   * 3.5.1      2018-07-02 local                           
 grDevices  * 3.5.1      2018-07-02 local                           
 hms          0.4.2.9001 2018-08-23 Github (tidyverse/hms@979286f)  
 memoise      1.1.0      2017-04-21 CRAN (R 3.5.0)                  
 methods    * 3.5.1      2018-07-02 local                           
 pillar       1.3.0      2018-07-14 CRAN (R 3.5.1)                  
 pkgconfig    2.0.2      2018-08-16 CRAN (R 3.5.1)                  
 R6           2.2.2      2017-06-17 CRAN (R 3.5.0)                  
 Rcpp         0.12.18    2018-07-23 CRAN (R 3.5.1)                  
 readr      * 1.2.0      2018-08-23 Github (tidyverse/readr@4b2e93a)
 rlang        0.2.2      2018-08-16 CRAN (R 3.5.1)                  
 rstudioapi   0.7.0-9001 2018-05-25 local                           
 stats      * 3.5.1      2018-07-02 local                           
 tibble       1.4.2      2018-01-22 CRAN (R 3.5.0)                  
 tools        3.5.1      2018-07-02 local                           
 utils      * 3.5.1      2018-07-02 local                           
 withr        2.1.2      2018-03-15 CRAN (R 3.5.0) 

@yutannihilation
Copy link
Member

yutannihilation commented Aug 23, 2018

Will the fix you are working on in readr also work for .rds (and perhaps .rda?) files?

No, read_rds() is just an alias for readRDS().

readr/R/rds.R

Lines 18 to 20 in 3715a2d

read_rds <- function(path) {
readRDS(path)
}

But, I think the second one should work.

> readr::read_csv("Z:/GitHub/radiant.data/萼片长/test_萼片长.csv")
Error in guess_header_(datasource, tokenizer, locale) : 
  Cannot read file Z:/GitHub/radiant.data/<U+843C><U+7247><U+957F>/test_<U+843C><U+7247><U+957F>.csv: The filename, directory name, or volume label syntax is incorrect.

Could you show the result of this code? Is this already garbled?

normalizePath("Z:/GitHub/radiant.data/萼片长/diamonds_萼片长.rds")

@vnijs
Copy link

vnijs commented Aug 23, 2018

Thanks for the response @yutannihilation. Is there a work-around you could suggest to load .rda or .rds files with Chinese, Russian, etc. characters in R 3.5.1 on Windows?

RE the read_csv issue, as you can see below normalizePath("Z:/GitHub/radiant.data/萼片长/diamonds_萼片长.rda") is not garbled.

> readr::read_csv("Z:/GitHub/radiant.data/萼片长/test_萼片长.csv")
Error in guess_header_(datasource, tokenizer, locale) : 
  Cannot read file Z:/GitHub/radiant.data/<U+843C><U+7247><U+957F>/test_<U+843C><U+7247><U+957F>.csv: The filename, directory name, or volume label syntax is incorrect.
> normalizePath("Z:/GitHub/radiant.data/萼片长/diamonds_萼片长.rda", winslash = "/")
[1] "Z:/GitHub/radiant.data/萼片长/diamonds_萼片长.rda"
> fs::file_exists("Z:/GitHub/radiant.data/萼片长/diamonds_萼片长.rda")
Z:/GitHub/radiant.data/萼片长/diamonds_萼片长.rda 
                                             TRUE 

I also tried using Rstudio's import data interface but that gives similar errors

image

@yutannihilation
Copy link
Member

Hmm, curious... Considering that garbling happens even with readRDS(), I guess the problem lies in base functions. I have no idea about the exact cause yet, though.

I suspect this won't happen on Windows with CJK locale, which I use, so it may be difficult for me to investigate... Anyway I'll have a look tomorrow.

@yutannihilation
Copy link
Member

Good news or bad news, this happens on my Windows, but the error is slightly different:

saveRDS(iris, file = "萼片长.rds")
#> Error in gzfile(file, mode) : cannot open the connection
#> In addition: Warning message:
#> In gzfile(file, mode) :
#>   cannot open compressed file '萼片<U+957F>.rds', probable reason 'Invalid argument'

The reason and were not garbled in my case is that they are representable with characters in my locale, whereas is not. This means, I guess, R might be unable to handle the file path that contains any character that cannot be represented with the current locale.

Anyway, I'll file a new issue. Sorry for keeping discussing on the closed issue.

@yutannihilation
Copy link
Member

Ah, for a workaround, I guess you can rename or copy the file:

tmp_rds <- tempfile(fileext = ".rds")
file.rename("Z:/GitHub/radiant.data/萼片长/test_萼片长.rds", tmp_rds)
readr::read_rds(tmp_rds)

@vnijs
Copy link

vnijs commented Aug 24, 2018

Thanks @yutannihilation!

@lock
Copy link

lock bot commented Feb 20, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Feb 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
5 participants