Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accented column names: current regression #1726

Open
fabnicol opened this issue Jun 3, 2016 · 17 comments
Open

Accented column names: current regression #1726

fabnicol opened this issue Jun 3, 2016 · 17 comments
Labels
bug encoding issues related to Encoding regression

Comments

@fabnicol
Copy link

fabnicol commented Jun 3, 2016

A regression has crept in some time after March 12 (sha1 c250e9f) and before current master branch code as of June 2nd.

It is related to accented (column) variable names, specifically when the syntax dt[ , accented_variable] is used, i.e. dt[ , Année].
Error message says the Année object is not found.
The bug does not show up when the alternative syntax dt[ , "Année", with = FALSE] is used or with non-accented variable names.

Platform is: Windows10, libraries built using Rtools 3.3.0.1959 from source code, encoding is ISO-8859-1.
Edit: Bug shows up under Windows7 too.

@jangorecki
Copy link
Member

jangorecki commented Jun 3, 2016

Could you provide reproducible example? I've tested on Ubuntu with the following code and was not able to reproduce issue, so maybe it is Windows related issue. Anyway reproducible example is important to address any issue.

library(data.table)
dt=data.table(Année=1)
dt[,Année]
#[1] 1

@fabnicol
Copy link
Author

fabnicol commented Jun 3, 2016

Partial solution: the issue comes from the "encoding" parameter of fread.
Minimal example:

A <- data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "Latin-1")      
A[, Année]`         
# Error in eval(expr, envir, enclos) : object 'Année' not found         

B <- data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "unknown")    
B[, Année]    
# [1] 2011     

It does not seem that "Latin-1" is a wrong value however:

data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "ISO-8859-1")        
# Error in data.table::fread(input = "Année;Mois\n2011;1", sep = ";", encoding = "ISO-8859-1") :     
#   Argument 'encoding' must be 'unknown', 'UTF-8' or 'Latin-1'.   

Currently I've been circumventing the issue using "unknown".
This is obviously not ideal, as the encoding parameter looks faulty.

@jangorecki
Copy link
Member

jangorecki commented Jun 3, 2016

fread manual is quite clear on allowed values for that argument, so no point in trying ISO-8859-1. Latin-1 should work here, it works on reading the data, but later that column cannot be accessed.
I'm able to reproduce it on Ubuntu using latin1.txt data and recent devel version.

library(data.table)
A = data.table::fread("https://github.com/Rdatatable/data.table/files/298049/latin1.txt", sep=";", encoding="Latin-1")
A[, Année]
#Error in eval(expr, envir, enclos) : object 'Année' not found 
Encoding(names(A))
#[1] "latin1"  "unknown"
sessionInfo()
#R version 3.3.0 (2016-05-03)
#Platform: x86_64-pc-linux-gnu (64-bit)
#Running under: Ubuntu 15.10
#
#locale:
# [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
# [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
# [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
# [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
# [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base     
#
#other attached packages:
#[1] data.table_1.9.7
#
#loaded via a namespace (and not attached):
#[1] curl_0.9.7

Issue looks to be related to #1680

@arunsrinivasan
Copy link
Member

Thanks for catching and reporting this. I've not looked at the code, but I think my assumption that mkCharLenCE() marks after checking iff the string's encoding matches desired encoding is incorrect. We'll need to add a check ourselves.

@fabnicol
Copy link
Author

fabnicol commented Jun 4, 2016

@arunsrinivasan
Correct assumption.
Regression was caused by fix on mkCharLenCE() at commit f91bba1 on April 27.
Retesting with library built from source code at preceding commit d6f7959, my above minimal test has correct output for A[, Année]

@arunsrinivasan
Copy link
Member

@fabnicol why did you close this?

@fabnicol
Copy link
Author

fabnicol commented Sep 16, 2018

Follow-up on this issue and related bug (in my opinion same cause).
Commit: faeae2e
Same issue with 1.11.4.

Reproducible examples:

  A <- data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "Latin-1")      
  A[, Année]
  Error in `[.data.table`(A, , Année) : 
   j (the 2nd argument inside [...]) is a single symbol but column name 'Année' is not found. Perhaps you intended DT[, ..Année]. This difference to data.frame is deliberate and explained in FAQ 1.1.

   system("echo 'A[, Année][]' > a.R && iconv -f UTF-8 -t ISO-8859-1 a.R > b.R")
   source("b.R", encoding="ISO-8859-1")
  # same error message

@fabnicol
Copy link
Author

fabnicol commented Sep 16, 2018

I would advise, as this nagging issue seems not to be documented and is very annoying for non-English only coders, that a noticeable warning be issued in the official documentation, to the effect that Latin-1 bases should not have non-ASCII column names (but accented lines are OK).
Below is a hack that may come in handy to users and give ideas to devs:

 names(A) <- iconv(names(A), to = "UTF-8")
 A[, Année]
  #2011

I ususally turn things around in this (not ideal) way.
Pending deeper fixes, fread cound be patched as follows:

fread_ <- function(...) {
  DT <- data.table::fread(...)
  if (any(Encoding(names(DT)) == "latin1"))  names(DT) <- iconv(names(DT), to = "UTF-8")
  DT
 }

A <- fread_(input="Année;Mois\n2011;1", sep=";", encoding = "Latin-1")
A[, Année]

yields the expected 2011

@jangorecki
Copy link
Member

I would say non-ascii names should be avoided in the first place, see #4351

@jangorecki jangorecki added the encoding issues related to Encoding label May 26, 2020
@fabnicol
Copy link
Author

I would say non-ascii names should be avoided in the first place, see #4351

I rather disagree with this. The point of this issue is that prior to commit c250e9f, accented column names were entirely OK. They are also OK, at least for Western latinate languages of the ISO-8859-1x family, with base R. So this cannot be an R problem, contrary to what is written in comments of issue #4351

@jangorecki
Copy link
Member

jangorecki commented May 27, 2020

@fabnicol Thanks for following up. Could you test if that yields expected results then? on the "OK" version

dt[ , Année := 1L]
dt[ , "Année2" := 2L]

AFAIU non-ascii names works in many places, but not in all.
In such case I would lean towards Tomas Kalibera advice.

@fabnicol
Copy link
Author

Follow-up on this issue and related bug (in my opinion same cause).
Commit: faeae2e
Same issue with 1.11.4.

Reproducible examples:

  A <- data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "Latin-1")      
  A[, Année]
  Error in `[.data.table`(A, , Année) : 
   j (the 2nd argument inside [...]) is a single symbol but column name 'Année' is not found. Perhaps you intended DT[, ..Année]. This difference to data.frame is deliberate and explained in FAQ 1.1.

   system("echo 'A[, Année][]' > a.R && iconv -f UTF-8 -t ISO-8859-1 a.R > b.R")
   source("b.R", encoding="ISO-8859-1")
  # same error message

I'm using again my reproducible test in the post above, with R version 4.0.0 (2020-04-24) -- "Arbor Day" under W10.
Result is OK now: A[, Année] gives the expected value 2011.
Your assignment tests are OK too: A[, Année := 1L] yields 1 and changing variables into Année2 makes no difference.
So it looks like the bug introduced by commit c250e9f was cured somewhere along the way.
I would suggest closing the issue.

@jangorecki
Copy link
Member

jangorecki commented May 27, 2020

It seems that Année is utf8, but is not ascii. If you would try to use non-utf8 as a column name, then you would run in troubles. If it is fixed, then to close it we should submit a unit test so we can be informated if behavior will change.

@fabnicol
Copy link
Author

fabnicol commented May 27, 2020

Année cannot be ascii although it can be Latin-1 (i.e. ISO-8859-1) or UTF-8, as there are no accented vowels in the Ascii table
In my `fread example above, quoted from a 2016 post, you have noticed that the string is imported as Latin-1, not UTF8. So apparently the issue is closed. The above test stands as a unit test for me.
I've also tried with a real ISO-8859-1 csv input file, with or without encoding parameter, and it makes no difference under W10.
It would be interesting to precisely pinpoint which commit solved the issue just based on the simple above test.

@fabnicol
Copy link
Author

Issue is currently closed as bug is now fixed with R 4.0.2

A <- data.table::fread(input="Année;Mois\n2011;1", sep=";", encoding = "Latin-1")

A[ , Année]
[1] 2011

@jangorecki
Copy link
Member

I think it make sense to add a test for that. We can also escape that test for older versions of R.

@jangorecki jangorecki reopened this Aug 19, 2020
@fabnicol
Copy link
Author

An interesting side issue is that with the current R-devel-win branch for Windows UTF-8, the issue remains if encoding of the table is Latin-1, yet not for UTF-8.
This shows that data.table relies on R default system encoding of strings, whilst it should process the input considering both system encoding and parametrized value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug encoding issues related to Encoding regression
Projects
None yet
Development

No branches or pull requests

3 participants