-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fwrite encoding problem with file names #3078
Comments
Nice catch! |
@dpprdan is |
@MichaelChirico It is Windows (10). mingw is short for "Minimalist GNU for Windows" (I think). |
Any news on this? We have the same problem and unfortunately have to resort to write.csv2 for now. |
@kiesner can't you just write with |
@MichaelChirico No, not if @kiesner's problem is with umlauts in the path (and not only the file name), see my second example. |
@dpprdan |
@jangorecki I'm starting work on a hacky solution that just uses |
@MichaelChirico there is generally much more interesting stuff to do than such tedious workarounds for filename encoding :) but if it won't bloat the code, will be just single chunk in R fwrite then I don't see any problem. |
Hmm actually I'm not able to reproduce the original issue on macOS... so not sure how to test any solution... |
You could make a PR with a test, mark it [WIP], and then just wait for the results from AppVeyor. This is rather slow, but at least will get the job done. |
@dpprdan does
|
@MichaelChirico Your example probably works, but the issue with umlauts in the path name extends to existing directories, possibly in the user’s home directory or on network drives. And TBH I fail to see where temporarily renaming existing paths that could potentially be used by other processes and users is a good idea. I dug a little deeper and this is what I discovered.
library(data.table)
setwd(tempdir())
DF = data.frame(A=1:3, B=c("foo","A,Name","baz"))
fwrite(DF, "test.csv")
file.rename("test.csv", "töst.csv")
#> [1] TRUE
list.files(pattern = "\\.csv")
#> [1] "töst.csv"
fname <- "töst.csv"
Encoding(fname)
#> [1] "latin1"
fname_utf8 <- enc2utf8(fname)
Encoding(fname_utf8)
#> [1] "UTF-8"
fread(fname)
#> A B
#> 1: 1 foo
#> 2: 2 A,Name
#> 3: 3 baz
fread(fname_utf8)
#> Error in fread(fname_utf8): File not found: töst.csv This is probably not intended, at least lines 56 ff. of fread.h reads: “Name of the file to open (a \0-terminated C string). If the file name contains non-ASCII characters, it should be UTF-8 encoded (however fread will not validate the encoding).” (emphasis mine) The same comment can be found on lines 31 ff. of fwrite.h Encoding(basename(fname))
#> [1] "UTF-8"
Encoding(dirname("ö/ä"))
#> [1] "UTF-8"
Encoding(path.expand(fname))
#> [1] "UTF-8"
fread(path.expand(fname))
#> Error in fread(path.expand(fname)) : File not found: töst.csv
BTW, on macOS the native encoding is From my point of view there are two options to fix this:
|
yes, my solution was only intended as duct tape. it sounds like you're
pretty close to understanding the issue well enough to maybe file a PR? if
so that'd be great and put the issue to rest more firmly.
…On Sun, Nov 11, 2018, 4:36 PM Daniel Possenriede ***@***.*** wrote:
@MichaelChirico <https://github.com/MichaelChirico> Your example probably
works, but the issue with umlauts in the path name extends to existing
directories, possibly in the user’s home directory or on network drives.
And TBH I fail to see where temporarily renaming existing paths that could
potentially be used by other processes and users is a good idea.
I dug a little deeper and this is what I discovered.
1. fread cannot handle UTF-8 encoded file names/paths either:
library(data.table)
setwd(tempdir())DF = data.frame(A=1:3, B=c("foo","A,Name","baz"))
fwrite(DF, "test.csv")
file.rename("test.csv", "töst.csv")#> [1] TRUE
list.files(pattern = "\\.csv")#> [1] "töst.csv"fname <- "töst.csv"
Encoding(fname)#> [1] "latin1"fname_utf8 <- enc2utf8(fname)
Encoding(fname_utf8)#> [1] "UTF-8"
fread(fname)#> A B#> 1: 1 foo#> 2: 2 A,Name#> 3: 3 baz
fread(fname_utf8)#> Error in fread(fname_utf8): File not found: töst.csv
This is probably not intended, at least lines 56 ff. of fread.h
<https://github.com/Rdatatable/data.table/blob/d3ccd8139d3d592201e0f085c5f37a4fc5a426e3/src/fread.h#L56>
reads: “Name of the file to open (a \0-terminated C string). If the file
name contains non-ASCII characters, it *should be UTF-8 encoded* (however
fread will not validate the encoding).” (emphasis mine) The same comment
can be found on lines 31 ff. of fwrite.h
<https://github.com/Rdatatable/data.table/blob/d3ccd8139d3d592201e0f085c5f37a4fc5a426e3/src/fwrite.h#L31>
This also has practical significance, because e.g. basename(), dirname,
or path.expand all convert the path to UTF-8 (at least on R 3.5.1 on
Windows 10 and with non-ASCII characters in the path).
Encoding(basename(fname))#> [1] "UTF-8"
Encoding(dirname("ö/ä"))#> [1] "UTF-8"
Encoding(path.expand(fname))#> [1] "UTF-8"
fread(path.expand(fname))#> Error in fread(path.expand(fname)) : File not found: töst.csv
1. path.expand converting to UTF-8 is actually the reason why fwrite
fails in my original post. Contrary to what I wrote there, the file
argument is encoded as latin1 until line 44 of fwrite.R
<https://github.com/Rdatatable/data.table/blob/df4e56f5d960741c864204a713aed43d7867f17f/R/fwrite.R#L44>
(or rather in the “native encoding” which is latin1 on Windows with a
cp1252 locale). In line 44 path.expand converts it to UTF-8. Now it is
not in the native encoding anymore, which is why fwrite fails, because
file or filename is not explicitly marked as UTF-8 and thus assumed to
be in the native encoding in the C part of the function.
BTW, on macOS the native encoding is UTF-8, which is why you cannot
reproduce this, @MichaelChirico <https://github.com/MichaelChirico>,
because everything stays in UTF-8 there.
From my point of view there are two options to fix this:
1. Change the encoding of the file parameter in the R code with
enc2native(file), so that file is passed to fwriteR in the native
encoding.
2. It would be safer though, IMHO, and in accordance with the comment
mentioned above to stick to UTF-8 encoded file paths in both fread and
fwrite (file would have to be encoded as UTF-8 first with enc2utf8 in
fread, however) and mark them as such when passing it to the C code
(with something like mkCharCE(filename, CE_UTF8))
<https://github.com/wch/r-source/blob/7f0ae7735816eccba5e2e507543f0486c264bc28/src/main/envir.c#L3839>
if my google fu serves me right
<https://cran.r-project.org/doc/manuals/R-exts.html#Character-encoding-issues>
).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3078 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdQSeJUGQNKbL7bniBe4xDdZfMYR4ks5ut-F7gaJpZM4W8GH_>
.
|
I only know how to do the The second option should be a quick fix as well, for someone who knows how to do that in C and would be more robust, I think. But I can surely do a PR with the first option if that's what you want. |
fwrite()
cannot handle umlauts (and presumably all non-ASCII chars) in file names and paths on Windows (here withLC_COLLATE=German_Germany.1252
but from my experience this will also be a problem in other non-UTF-8 locales).When the umlaut is in the file name,
fwrite
writes the file, but with a faulty file name.When the umlaut is in the path,
fwrite
cannot write the file at all.I looked at and
debug
-ed the R code and it seems to me that up until line 67 thefile
argument is encoded as “UTF-8” (as it should IMO) and looks fine. So my guess would be that the file path’s encoding goes wrong in theCfwriteR
code.From looking at the characters that should be “ö” or “ä” respectively, the problem seems to be that
CfwriteR
get’s a UTF-8 string but handles it as if it were encoded as latin-1, see this table.If the error were in the R code, I would solve it with a
file <- Encoding("UTF-8")
line, but I do not know how this is done in C.Session info
The text was updated successfully, but these errors were encountered: