Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite encoding problem with file names #3078

Closed
dpprdan opened this issue Sep 27, 2018 · 15 comments · Fixed by #3141
Closed

fwrite encoding problem with file names #3078

dpprdan opened this issue Sep 27, 2018 · 15 comments · Fixed by #3141
Milestone

Comments

@dpprdan
Copy link
Contributor

dpprdan commented Sep 27, 2018

fwrite() cannot handle umlauts (and presumably all non-ASCII chars) in file names and paths on Windows (here with LC_COLLATE=German_Germany.1252 but from my experience this will also be a problem in other non-UTF-8 locales).

When the umlaut is in the file name, fwrite writes the file, but with a faulty file name.

library(data.table)
setwd(tempdir())
DF = data.frame(A=1:3, B=c("foo","A,Name","baz"))
fwrite(DF, "töst.csv")
list.files(pattern = "\\.csv")
#> [1] "töst.csv"

When the umlaut is in the path, fwrite cannot write the file at all.

dir.create("ä")
data.table::fwrite(DF, "ä/test.csv")
#> Error in data.table::fwrite(DF, "ä/test.csv"): No such file or directory: 'ä/test.csv'. Unable to create new file for writing (it does not exist already). Do you have permission to write here, is there space on the disk and does the path exist?

I looked at and debug-ed the R code and it seems to me that up until line 67 the file argument is encoded as “UTF-8” (as it should IMO) and looks fine. So my guess would be that the file path’s encoding goes wrong in the CfwriteR code.

From looking at the characters that should be “ö” or “ä” respectively, the problem seems to be that CfwriteR get’s a UTF-8 string but handles it as if it were encoded as latin-1, see this table.

If the error were in the R code, I would solve it with a file <- Encoding("UTF-8") line, but I do not know how this is done in C.

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language en                          
#>  collate  German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2018-09-27
#> Packages -----------------------------------------------------------------
#>  package    * version date       source                            
#>  base       * 3.5.1   2018-07-02 local                             
#>  compiler     3.5.1   2018-07-02 local                             
#>  data.table * 1.11.6  2018-09-19 CRAN (R 3.5.1)                    
#>  datasets   * 3.5.1   2018-07-02 local                             
#>  devtools     1.13.6  2018-06-27 CRAN (R 3.5.1)                    
#>  digest       0.6.17  2018-09-12 CRAN (R 3.5.1)                    
#>  evaluate     0.11    2018-07-17 CRAN (R 3.5.1)                    
#>  graphics   * 3.5.1   2018-07-02 local                             
#>  grDevices  * 3.5.1   2018-07-02 local                             
#>  htmldeps     0.1.1   2018-07-30 Github (rstudio/htmldeps@c1023e0) 
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.5.1)                    
#>  knitr        1.20    2018-02-20 CRAN (R 3.5.1)                    
#>  magrittr     1.5     2014-11-22 CRAN (R 3.5.1)                    
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.5.1)                    
#>  methods    * 3.5.1   2018-07-02 local                             
#>  Rcpp         0.12.18 2018-07-23 CRAN (R 3.5.1)                    
#>  rmarkdown    1.10.13 2018-09-04 Github (rstudio/rmarkdown@19008bf)
#>  stats      * 3.5.1   2018-07-02 local                             
#>  stringi      1.2.4   2018-07-20 CRAN (R 3.5.1)                    
#>  stringr      1.3.1   2018-05-10 CRAN (R 3.5.1)                    
#>  tools        3.5.1   2018-07-02 local                             
#>  utils      * 3.5.1   2018-07-02 local                             
#>  withr        2.1.2   2018-03-15 CRAN (R 3.5.1)                    
#>  yaml         2.2.0   2018-07-25 CRAN (R 3.5.1)
@MichaelChirico
Copy link
Member

Nice catch!

@MichaelChirico
Copy link
Member

@dpprdan is mingw32 a Windows system or Unix-alike?

@dpprdan
Copy link
Contributor Author

dpprdan commented Sep 27, 2018

@MichaelChirico It is Windows (10). mingw is short for "Minimalist GNU for Windows" (I think).

@kiesner
Copy link

kiesner commented Nov 9, 2018

Any news on this? We have the same problem and unfortunately have to resort to write.csv2 for now.

@MichaelChirico
Copy link
Member

@kiesner can't you just write with fwrite and use file.rename?

@dpprdan
Copy link
Contributor Author

dpprdan commented Nov 9, 2018

@MichaelChirico No, not if @kiesner's problem is with umlauts in the path (and not only the file name), see my second example.

@jangorecki
Copy link
Member

@dpprdan file.rename is not only for renaming files but directories also

@MichaelChirico
Copy link
Member

@jangorecki I'm starting work on a hacky solution that just uses file.rename internally, what do you think?

@jangorecki
Copy link
Member

@MichaelChirico there is generally much more interesting stuff to do than such tedious workarounds for filename encoding :) but if it won't bloat the code, will be just single chunk in R fwrite then I don't see any problem.

@MichaelChirico
Copy link
Member

Hmm actually I'm not able to reproduce the original issue on macOS... so not sure how to test any solution...

@st-pasha
Copy link
Contributor

st-pasha commented Nov 9, 2018

You could make a PR with a test, mark it [WIP], and then just wait for the results from AppVeyor. This is rather slow, but at least will get the job done.

@MichaelChirico
Copy link
Member

@dpprdan does file.rename work on your OS?

f = "ä/test.csv"
f_ascii = iconv(f, from = 'UTF-8', to = 'ASCII', sub = 'XXX')

dir.create(dirname(f_ascii), showWarnings = FALSE, recursive = TRUE)
fwrite(DT, f_ascii)
file.rename(f_ascii, f)

@dpprdan
Copy link
Contributor Author

dpprdan commented Nov 11, 2018

@MichaelChirico Your example probably works, but the issue with umlauts in the path name extends to existing directories, possibly in the user’s home directory or on network drives. And TBH I fail to see where temporarily renaming existing paths that could potentially be used by other processes and users is a good idea.

I dug a little deeper and this is what I discovered.

  1. fread cannot handle UTF-8 encoded file names/paths either:
library(data.table)
setwd(tempdir())
DF = data.frame(A=1:3, B=c("foo","A,Name","baz"))
fwrite(DF, "test.csv")
file.rename("test.csv", "töst.csv")
#> [1] TRUE
list.files(pattern = "\\.csv")
#> [1] "töst.csv"
fname <- "töst.csv"
Encoding(fname)
#> [1] "latin1"
fname_utf8 <- enc2utf8(fname)
Encoding(fname_utf8)
#> [1] "UTF-8"
fread(fname)
#>    A      B
#> 1: 1    foo
#> 2: 2 A,Name
#> 3: 3    baz
fread(fname_utf8)
#> Error in fread(fname_utf8): File not found: töst.csv

This is probably not intended, at least lines 56 ff. of fread.h reads: “Name of the file to open (a \0-terminated C string). If the file name contains non-ASCII characters, it should be UTF-8 encoded (however fread will not validate the encoding).” (emphasis mine) The same comment can be found on lines 31 ff. of fwrite.h
This also has practical significance, because e.g. basename(), dirname, or path.expand all convert the path to UTF-8 (at least on R 3.5.1 on Windows 10 and with non-ASCII characters in the path).

Encoding(basename(fname))
#> [1] "UTF-8"
Encoding(dirname("ö/ä"))
#> [1] "UTF-8"
Encoding(path.expand(fname))
#> [1] "UTF-8"

fread(path.expand(fname))
#> Error in fread(path.expand(fname)) : File not found: töst.csv
  1. path.expand converting to UTF-8 is actually the reason why fwrite fails in my original post. Contrary to what I wrote there, the file argument is encoded as latin1 until line 44 of fwrite.R (or rather in the “native encoding” which is latin1 on Windows with a cp1252 locale). In line 44 path.expand converts it to UTF-8. Now it is not in the native encoding anymore, which is why fwrite fails, because file or filename is not explicitly marked as UTF-8 and thus assumed to be in the native encoding in the C part of the function.

BTW, on macOS the native encoding is UTF-8, which is why you cannot reproduce this, @MichaelChirico, because everything stays in UTF-8 there.

From my point of view there are two options to fix this:

  1. Change the encoding of the file parameter in the R code with enc2native(file), so that file is passed to fwriteR in the native encoding.
  2. It would be safer though, IMHO, and in accordance with the comment mentioned above to stick to UTF-8 encoded file paths in both fread and fwrite (file would have to be encoded as UTF-8 first with enc2utf8 in fread, however) and mark them as such when passing it to the C code (with something like mkCharCE(filename, CE_UTF8)) if my google fu serves me right).

@MichaelChirico
Copy link
Member

MichaelChirico commented Nov 11, 2018 via email

@dpprdan
Copy link
Contributor Author

dpprdan commented Nov 11, 2018

I only know how to do the enc2native(file) solution, which, IMO, is duct tape as well, just a different brand.

The second option should be a quick fix as well, for someone who knows how to do that in C and would be more robust, I think.

But I can surely do a PR with the first option if that's what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants