Weird issue related to index and non-ASCII character #1826

shrektan · 2016-08-25T07:48:47Z

Hi, I want to report an issue related to non-ASCII character when join use the index or key. It's complicated to explain in words. Luckily, I have a reproducible example as the following (took me 3 hours to find the example T.T ):

Under current dev version (1.9.7) of `data.table`

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##             PL_Type HS_Port_Code
## 1: 公允价值变动损益           NA
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                                              
##  version  R version 3.3.1 (2016-06-21)                       
##  system   i386, mingw32                                      
##  ui       RTerm                                              
##  language (EN)                                               
##  collate  Chinese (Simplified)_People's Republic of China.936
##  tz       Asia/Taipei                                        
##  date     2016-08-25
## Packages ------------------------------------------------------------------
##  package    * version date       source        
##  data.table * 1.9.7   2016-08-25 local         
##  devtools     1.12.0  2016-06-24 CRAN (R 3.3.1)
##  digest       0.6.10  2016-08-02 CRAN (R 3.3.1)
##  evaluate     0.9     2016-04-29 CRAN (R 3.2.5)
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.2.4)
##  knitr        1.14    2016-08-13 CRAN (R 3.3.1)
##  magrittr     1.5     2014-11-22 CRAN (R 3.1.2)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.2.3)
##  Rcpp         0.12.4  2016-03-26 CRAN (R 3.2.4)
##  rmarkdown    1.0     2016-07-08 CRAN (R 3.3.1)
##  stringi      1.1.1   2016-05-27 CRAN (R 3.2.5)
##  stringr      1.1.0   2016-08-19 CRAN (R 3.3.1)
##  withr        1.0.2   2016-06-20 CRAN (R 3.2.5)

Under CRAN version (1.9.6) of `data.table`

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
## Warning in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends,
## nomatch, : A known encoding (latin1 or UTF-8) was detected in a join
## column. data.table compares the bytes currently, so doesn't support *mixed*
## encodings well; i.e., using both latin1 and UTF-8, or if any unknown
## encodings are non-ascii and some of those are marked known and others
## not. But if either latin1 or UTF-8 is used exclusively, and all unknown
## encodings are ascii, then the result should be ok. In future we will check
## for you and avoid this warning if everything is ok. The tricky part is
## doing this without impacting performance for ascii-only cases.
##             PL_Type HS_Port_Code
## 1: 公允价值变动损益           NA
devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                                              
##  version  R version 3.3.1 (2016-06-21)                       
##  system   i386, mingw32                                      
##  ui       RTerm                                              
##  language (EN)                                               
##  collate  Chinese (Simplified)_People's Republic of China.936
##  tz       Asia/Taipei                                        
##  date     2016-08-25
## Packages ------------------------------------------------------------------
##  package    * version date       source        
##  chron        2.3-47  2015-06-24 CRAN (R 3.2.1)
##  data.table * 1.9.6   2015-09-19 CRAN (R 3.3.1)
##  devtools     1.12.0  2016-06-24 CRAN (R 3.3.1)
##  digest       0.6.10  2016-08-02 CRAN (R 3.3.1)
##  evaluate     0.9     2016-04-29 CRAN (R 3.2.5)
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.2.4)
##  knitr        1.14    2016-08-13 CRAN (R 3.3.1)
##  magrittr     1.5     2014-11-22 CRAN (R 3.1.2)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.2.3)
##  Rcpp         0.12.6  2016-07-19 CRAN (R 3.3.1)
##  rmarkdown    1.0     2016-07-08 CRAN (R 3.3.1)
##  stringi      1.1.1   2016-05-27 CRAN (R 3.2.5)
##  stringr      1.1.0   2016-08-19 CRAN (R 3.3.1)
##  withr        1.0.2   2016-06-20 CRAN (R 3.2.5)

Note

As you can see, the behavior changes under the different version of data.table. And I can't reproduce the example without the csv file. I'm not sure if it only occurs when the data is read from a csv file or from the database... And in my real cases, the thing happens like "at first it's ok, but when I set the encoding to native, it won't work. And then I set to UTF-8, it's ok. And then I set to native again, it works~"...

I strongly doubt it's an issue related to the commits within 3 months, because I'm kind of updating the dev version of data.table regularly.

BTW, I install the dev version of data.table as the instruction in https://github.com/Rdatatable/data.table/wiki/Installation:

remove.packages("data.table")                         # First remove the current version
install.packages("data.table", type = "source",
    repos = "http://Rdatatable.github.io/data.table") # Then install devel version

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2016-08-25T14:56:01Z

Seems like a shortcoming is the limits to the encoding parameter to fread, which cannot accept "GB2312" at the moment.

However, the following seems to work:

DT <- fread("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
            encoding = "unknown")

DT[ , PL_Type := iconv(PL_Type, "GB2312", "UTF-8")]
setkey(DT, PL_Type)
DT[J("公允价值变动损益")]
#              PL_Type HS_Port_Code
#  1: 公允价值变动损益         2042
#  2: 公允价值变动损益         2013
#  3: 公允价值变动损益         2032
#  4: 公允价值变动损益         2052
#  5: 公允价值变动损益         2035
#  6: 公允价值变动损益         2022
#  7: 公允价值变动损益         2015
#  8: 公允价值变动损益         2025
#  9: 公允价值变动损益         2023
# 10: 公允价值变动损益         2012
# 11: 公允价值变动损益         2055
# 12: 公允价值变动损益         8212
# 13: 公允价值变动损益         8222
# 14: 公允价值变动损益         2045

The iconv part doesn't seem too expensive -- perhaps we can just have fread do this iconv step under the hood if the encoding is something atypical?

DT <- fread("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
            encoding = "unknown")

DTN <- rbindlist(lapply(integer(1e5), function(...) DT))

system.time(DTN[ , PL_Type := iconv(PL_Type, "GB2312", "UTF-8")])
#    user  system elapsed 
#   3.056   0.000   3.055

In fact, read.csv's handling of fileEncoding appears tightly related to iconv; from ?read.table:

The encoding of the input/output stream of a connection can be specified by name in the same way as it would be given to iconv

iconvlist() will also be helpful for flagging inappropriate inputs.

Barring that implementation, a note in ?fread regarding strange encodings and the utility of iconv could suffice.

shrektan · 2016-08-25T16:19:52Z

@MichaelChirico Sorry, I don't understand why this issue has any relation to fread... BTW, as I mentioned above, the same issue happens when fetching from a database...

MichaelChirico · 2016-08-25T16:40:22Z

Because ideally, the encoding issue would be handled immediately upon incorporating the data into R.

To me, an ideal workflow for this would be:

DT <- fread("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
            encoding = "GB2312", key = "PL_Type")
DT[.("公允价值变动损益")]

shrektan · 2016-08-26T01:27:59Z

@MichaelChirico Yes, for csv the fread misses the support to arbitrage encoding like read.csv.

However, I don't think this issue itself is directly related to fread, because not only for csv files but also for the data read from database, which is not able to be handled by fread... Also, the case itself is rather complicated as I described above, it seems like there's something wrong when implementing the support to different encoding character index, like 03cd45f .

But I don't think it's this commit 03cd45f that causes this case, since it was committed in Jan. So, my guess is it's related to some internal changes for encoding after that.

@arunsrinivasan Please take a look on this issue... Thanks.

arunsrinivasan · 2016-08-26T17:03:13Z

On OS X 10.11.6, Ubuntu 14 and 16, I get this:

> library(data.table)
data.table 1.9.7 IN DEVELOPMENT built 2016-08-26 16:52:35 UTC
For help type ?data.table or https://github.com/Rdatatable/data.table/wiki
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
> dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
+                stringsAsFactors = FALSE, fileEncoding = "GB2312")
> setDT(dt)
> setkey(dt, PL_Type)
> dt[J("公允价值变动损益")]
             PL_Type HS_Port_Code
 1: 公允价值变动损益         2042
 2: 公允价值变动损益         2013
 3: 公允价值变动损益         2032
 4: 公允价值变动损益         2052
 5: 公允价值变动损益         2035
 6: 公允价值变动损益         2022
 7: 公允价值变动损益         2015
 8: 公允价值变动损益         2025
 9: 公允价值变动损益         2023
10: 公允价值变动损益         2012
11: 公允价值变动损益         2055
12: 公允价值变动损益         8212
13: 公允价值变动损益         8222
14: 公允价值变动损益         2045

Could you please edit your session_info() with the normal sessionInfo() output, which gives a nicer platform, running under and locale info?

MichaelChirico · 2016-08-26T17:06:28Z

Hmm that's odd. I swear when this was posted I was getting the same thing as OP on Linux Mint (over Ubuntu 14.04)... don't think I've touched my install since then (startup says built 2016-08-20 14:14:50 UTC)...

shrektan · 2016-08-28T03:12:13Z

@arunsrinivasan Sorry, I didn't make it clear. I tested it under win7. Below is the new test code using sessionInfo

Under current dev version (1.9.7) of data.table

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##             PL_Type HS_Port_Code
## 1: 公允价值变动损益           NA
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.7
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    tools_3.3.1     htmltools_0.3.5 Rcpp_0.12.4    
##  [5] stringi_1.1.1   rmarkdown_1.0   knitr_1.14      stringr_1.1.0  
##  [9] digest_0.6.10   evaluate_0.9

Under CRAN version (1.9.6) of data.table

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
## Warning in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends,
## nomatch, : A known encoding (latin1 or UTF-8) was detected in a join
## column. data.table compares the bytes currently, so doesn't support *mixed*
## encodings well; i.e., using both latin1 and UTF-8, or if any unknown
## encodings are non-ascii and some of those are marked known and others
## not. But if either latin1 or UTF-8 is used exclusively, and all unknown
## encodings are ascii, then the result should be ok. In future we will check
## for you and avoid this warning if everything is ok. The tricky part is
## doing this without impacting performance for ascii-only cases.
##             PL_Type HS_Port_Code
## 1: 公允价值变动损益           NA
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.6
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    tools_3.3.1     htmltools_0.3.5 Rcpp_0.12.6    
##  [5] stringi_1.1.1   rmarkdown_1.0   knitr_1.14      stringr_1.1.0  
##  [9] digest_0.6.10   chron_2.3-47    evaluate_0.9

shrektan · 2016-08-28T03:21:42Z

Also, the code run's a different output in my Mac. I guess it's because the native encoding is not UTF-8 in windows.

Result in Mac OS X with data.table 1.9.7

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12 (Sierra)
## 
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.7
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    tools_3.3.1     htmltools_0.3.5 Rcpp_0.12.6    
##  [5] stringi_1.1.1   rmarkdown_1.0   knitr_1.14      stringr_1.1.0  
##  [9] digest_0.6.10   evaluate_0.9

shrektan · 2016-08-30T14:55:08Z

Sorry, but can anyone reproduce this?

shrektan · 2016-09-09T01:17:34Z

@arunsrinivasan Sorry, I know you're very busy. However, I personally think it's an important issue for data.table users who use non-ASCII characters in Windows, since all of them will meet the same issue...

I don't have the expertise to fix the problem... So, when you're available, please take a look on this...

Thanks so much.

jangorecki · 2016-09-09T02:10:24Z

On ubuntu and 3.3.1 it works as on Mac, probably something windows related

library(data.table)
#data.table 1.9.7 IN DEVELOPMENT built 2016-09-01 21:24:37 UTC; travis
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
#             PL_Type HS_Port_Code
# 1: 公允价值变动损益         2042
# 2: 公允价值变动损益         2013
# 3: 公允价值变动损益         2032
# 4: 公允价值变动损益         2052
# 5: 公允价值变动损益         2035
# 6: 公允价值变动损益         2022
# 7: 公允价值变动损益         2015
# 8: 公允价值变动损益         2025
# 9: 公允价值变动损益         2023
#10: 公允价值变动损益         2012
#11: 公允价值变动损益         2055
#12: 公允价值变动损益         8212
#13: 公允价值变动损益         8222
#14: 公允价值变动损益         2045

MichaelChirico · 2016-09-09T02:39:33Z

yes, please be sure to report your system specs if you can't get it to work
on development.

On Sep 8, 2016 10:10 PM, "Jan Gorecki" notifications@github.com wrote:

On ubuntu and 3.3.1 it does work out of the box

library(data.table)#data.table 1.9.7 IN DEVELOPMENT built 2016-09-01 21:24:37 UTC; travisdt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)dt[J("公允价值变动损益")]# PL_Type HS_Port_Code# 1: 公允价值变动损益 2042# 2: 公允价值变动损益 2013# 3: 公允价值变动损益 2032# 4: 公允价值变动损益 2052# 5: 公允价值变动损益 2035# 6: 公允价值变动损益 2022# 7: 公允价值变动损益 2015# 8: 公允价值变动损益 2025# 9: 公允价值变动损益 2023#10: 公允价值变动损益 2012#11: 公允价值变动损益 2055#12: 公允价值变动损益 8212#13: 公允价值变动损益 8222#14: 公允价值变动损益 2045

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1826 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdeq2T2-wBFv6D509bHbp_Rm0cI_Fks5qoMAVgaJpZM4JsyR6
.

shrektan · 2016-09-20T05:15:23Z

@arunsrinivasan @jangorecki @MichaelChirico I have put my system session info above #1826 (comment)

I tried to install different commit of data.table to see which one causes this bug, and found that the commit is 03cd45f

Please help me when you have the time (And please remove the label not reproducible, since I can reproduce it in my colleague's computer... I'm pretty sure it can be reproduced in every windows machine as long as the default encoding is not UTF-8). I really appreciate that! Thanks again.

Here's the session info again:

d> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.7

loaded via a namespace (and not attached):
[1] rsconnect_0.4.3 tools_3.3.1     withr_1.0.2     memoise_1.0.0  
[5] digest_0.6.10   devtools_1.12.0

d> devtools::session_info()
Session info -------------------------------------------------------------
 setting  value                                              
 version  R version 3.3.1 (2016-06-21)                       
 system   i386, mingw32                                      
 ui       RStudio (0.99.902)                                 
 language (EN)                                               
 collate  Chinese (Simplified)_People's Republic of China.936
 tz       Asia/Taipei                                        
 date     2016-09-20                                         

Packages -----------------------------------------------------------------
 package    * version date       source        
 data.table * 1.9.7   2016-09-09 local         
 devtools     1.12.0  2016-06-24 CRAN (R 3.3.1)
 digest       0.6.10  2016-08-02 CRAN (R 3.3.1)
 memoise      1.0.0   2016-01-29 CRAN (R 3.2.3)
 rsconnect    0.4.3   2016-05-02 CRAN (R 3.2.5)
 withr        1.0.2   2016-06-20 CRAN (R 3.2.5)

shrektan · 2016-09-20T05:24:26Z

BTW, the R version might matter, since the function base::match was modified in R3.3.0 and there's a bug that has been fixed in R3.3.1 (see https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885)

shrektan · 2016-09-25T07:16:12Z

Well, can anybody take a look on this? or should I open a new issue?

jangorecki · 2016-09-25T12:10:19Z

I doubt opening another issue will help, at least as long as this one is still open. If you are in rush, you can always use a fork until it is resolved in master, this is a common practice, not just in R, but generally in open source projects. There is nothing wrong about it. Many companies modify open source projects to better fits their needs.

shrektan · 2016-09-25T12:21:13Z

@jangorecki thanks for the advices. 👍

shrektan · 2017-03-14T02:38:49Z

@arunsrinivasan @jangorecki First of all, thanks for your intention on this issue. After some experiments, I think I locate the root of why this issue happens.

It's because the strings might have different orders under different encoding.

See the example below:

library(data.table)
dtRaw <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dtRaw)
dt <- data.table(CN = unique(dtRaw$PL_Type), VALUE = seq_len(uniqueN(dtRaw$PL_Type)))
setkey(dt, CN)

dt2 <- data.table(CN = enc2utf8(unique(dtRaw$PL_Type)), VALUE = seq_len(uniqueN(dtRaw$PL_Type)))
setkey(dt2, CN)

print(dt)
##                   CN VALUE
##  1: 公允价值变动损益     1
##  2:         红利收入     2
##  3:         汇兑损益    10
##  4:         价差收入     3
##  5:         交易费用     9
##  6:         利息支出     5
##  7:         利息收入     4
##  8:     其他业务支出     7
##  9:   营业税金及附加     6
## 10:     资产减值损失     8
print(dt2)
##                   CN VALUE
##  1:         交易费用     9
##  2:         价差收入     3
##  3: 公允价值变动损益     1
##  4:     其他业务支出     7
##  5:         利息支出     5
##  6:         利息收入     4
##  7:         汇兑损益    10
##  8:         红利收入     2
##  9:   营业税金及附加     6
## 10:     资产减值损失     8
dt[J("公允价值变动损益")]
##                  CN VALUE
## 1: 公允价值变动损益    NA
dt2[J("公允价值变动损益")]
##                  CN VALUE
## 1: 公允价值变动损益     1

The sessionInfo, I strongly believe that only occurs on windows platform:

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936  LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] tools_3.3.2          withr_1.0.2          memoise_1.0.0        digest_0.6.12        devtools_1.12.0.9000

What data.table currently do is to compare strings in UTF-8 encoding, after being set keys, and using the binary search, so...

So the fix I guess should be: When setkey() for data.table objects, order them in UTF-8 encodings, instead of raw encoding.

UPDATED Actually after the script above, if you do setkey(dt, CN) again, you will get warning :

Warning message:
In setkeyv(x, cols, verbose = verbose, physical = physical) :
  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

And I though it should have been fixed by 409d709

But it didn't... I have no clue now... 😭

Thanks.

shrektan · 2017-05-02T13:59:02Z

I know that your guys are very busy. However, can anyone take a look? Maybe you can provide me some hints so that I can help to solve this issue? I will be very appreciated. Thanks.

shrektan · 2017-11-03T04:29:14Z

Closes because the conversation here is quite confused.

arunsrinivasan added the not reproducible label Aug 26, 2016

arunsrinivasan removed the not reproducible label Sep 25, 2016

arunsrinivasan added this to the v2.0.0 milestone Sep 25, 2016

shrektan closed this as completed Nov 3, 2017

shrektan mentioned this issue Jan 13, 2018

fix the bug when keys contain non UTF8 strings #2566

Merged

mattdowle modified the milestones: Candidate, v1.11.0 May 10, 2018

shrektan added the encoding issues related to Encoding label Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird issue related to index and non-ASCII character #1826

Weird issue related to index and non-ASCII character #1826

shrektan commented Aug 25, 2016

MichaelChirico commented Aug 25, 2016 •

edited

Loading

shrektan commented Aug 25, 2016

MichaelChirico commented Aug 25, 2016

shrektan commented Aug 26, 2016 •

edited

Loading

arunsrinivasan commented Aug 26, 2016 •

edited

Loading

MichaelChirico commented Aug 26, 2016

shrektan commented Aug 28, 2016

shrektan commented Aug 28, 2016

shrektan commented Aug 30, 2016

shrektan commented Sep 9, 2016

jangorecki commented Sep 9, 2016 •

edited

Loading

MichaelChirico commented Sep 9, 2016

shrektan commented Sep 20, 2016 •

edited

Loading

shrektan commented Sep 20, 2016

shrektan commented Sep 25, 2016

jangorecki commented Sep 25, 2016

shrektan commented Sep 25, 2016

shrektan commented Mar 14, 2017 •

edited

Loading

shrektan commented May 2, 2017

shrektan commented Nov 3, 2017

Weird issue related to index and non-ASCII character #1826

Weird issue related to index and non-ASCII character #1826

Comments

shrektan commented Aug 25, 2016

Under current dev version (1.9.7) of data.table

Under CRAN version (1.9.6) of data.table

Note

MichaelChirico commented Aug 25, 2016 • edited Loading

shrektan commented Aug 25, 2016

MichaelChirico commented Aug 25, 2016

shrektan commented Aug 26, 2016 • edited Loading

arunsrinivasan commented Aug 26, 2016 • edited Loading

MichaelChirico commented Aug 26, 2016

shrektan commented Aug 28, 2016

Under current dev version (1.9.7) of data.table

Under CRAN version (1.9.6) of data.table

shrektan commented Aug 28, 2016

Result in Mac OS X with data.table 1.9.7

shrektan commented Aug 30, 2016

shrektan commented Sep 9, 2016

jangorecki commented Sep 9, 2016 • edited Loading

MichaelChirico commented Sep 9, 2016

shrektan commented Sep 20, 2016 • edited Loading

shrektan commented Sep 20, 2016

shrektan commented Sep 25, 2016

jangorecki commented Sep 25, 2016

shrektan commented Sep 25, 2016

shrektan commented Mar 14, 2017 • edited Loading

shrektan commented May 2, 2017

shrektan commented Nov 3, 2017

Under current dev version (1.9.7) of `data.table`

Under CRAN version (1.9.6) of `data.table`

MichaelChirico commented Aug 25, 2016 •

edited

Loading

shrektan commented Aug 26, 2016 •

edited

Loading

arunsrinivasan commented Aug 26, 2016 •

edited

Loading

jangorecki commented Sep 9, 2016 •

edited

Loading

shrektan commented Sep 20, 2016 •

edited

Loading

shrektan commented Mar 14, 2017 •

edited

Loading