-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird issue related to index and non-ASCII character #1826
Comments
Seems like a shortcoming is the limits to the However, the following seems to work:
The
In fact,
Barring that implementation, a note in |
@MichaelChirico Sorry, I don't understand why this issue has any relation to |
Because ideally, the encoding issue would be handled immediately upon incorporating the data into R. To me, an ideal workflow for this would be:
|
@MichaelChirico Yes, for csv the However, I don't think this issue itself is directly related to But I don't think it's this commit 03cd45f that causes this case, since it was committed in Jan. So, my guess is it's related to some internal changes for encoding after that. @arunsrinivasan Please take a look on this issue... Thanks. |
On OS X 10.11.6, Ubuntu 14 and 16, I get this: > library(data.table)
data.table 1.9.7 IN DEVELOPMENT built 2016-08-26 16:52:35 UTC
For help type ?data.table or https://github.com/Rdatatable/data.table/wiki
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
> dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
+ stringsAsFactors = FALSE, fileEncoding = "GB2312")
> setDT(dt)
> setkey(dt, PL_Type)
> dt[J("公允价值变动损益")]
PL_Type HS_Port_Code
1: 公允价值变动损益 2042
2: 公允价值变动损益 2013
3: 公允价值变动损益 2032
4: 公允价值变动损益 2052
5: 公允价值变动损益 2035
6: 公允价值变动损益 2022
7: 公允价值变动损益 2015
8: 公允价值变动损益 2025
9: 公允价值变动损益 2023
10: 公允价值变动损益 2012
11: 公允价值变动损益 2055
12: 公允价值变动损益 8212
13: 公允价值变动损益 8222
14: 公允价值变动损益 2045 Could you please edit your |
Hmm that's odd. I swear when this was posted I was getting the same thing as OP on Linux Mint (over Ubuntu 14.04)... don't think I've touched my install since then (startup says |
@arunsrinivasan Sorry, I didn't make it clear. I tested it under win7. Below is the new test code using Under current dev version (1.9.7) of data.tablelibrary(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
## PL_Type HS_Port_Code
## 1: 公允价值变动损益 NA
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
## PL_Type HS_Port_Code
## 1: 公允价值变动损益 2042
## 2: 公允价值变动损益 2013
## 3: 公允价值变动损益 2032
## 4: 公允价值变动损益 2052
## 5: 公允价值变动损益 2035
## 6: 公允价值变动损益 2022
## 7: 公允价值变动损益 2015
## 8: 公允价值变动损益 2025
## 9: 公允价值变动损益 2023
## 10: 公允价值变动损益 2012
## 11: 公允价值变动损益 2055
## 12: 公允价值变动损益 8212
## 13: 公允价值变动损益 8222
## 14: 公允价值变动损益 2045
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.9.7
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 tools_3.3.1 htmltools_0.3.5 Rcpp_0.12.4
## [5] stringi_1.1.1 rmarkdown_1.0 knitr_1.14 stringr_1.1.0
## [9] digest_0.6.10 evaluate_0.9 Under CRAN version (1.9.6) of data.tablelibrary(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
## PL_Type HS_Port_Code
## 1: 公允价值变动损益 2042
## 2: 公允价值变动损益 2013
## 3: 公允价值变动损益 2032
## 4: 公允价值变动损益 2052
## 5: 公允价值变动损益 2035
## 6: 公允价值变动损益 2022
## 7: 公允价值变动损益 2015
## 8: 公允价值变动损益 2025
## 9: 公允价值变动损益 2023
## 10: 公允价值变动损益 2012
## 11: 公允价值变动损益 2055
## 12: 公允价值变动损益 8212
## 13: 公允价值变动损益 8222
## 14: 公允价值变动损益 2045
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
## Warning in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends,
## nomatch, : A known encoding (latin1 or UTF-8) was detected in a join
## column. data.table compares the bytes currently, so doesn't support *mixed*
## encodings well; i.e., using both latin1 and UTF-8, or if any unknown
## encodings are non-ascii and some of those are marked known and others
## not. But if either latin1 or UTF-8 is used exclusively, and all unknown
## encodings are ascii, then the result should be ok. In future we will check
## for you and avoid this warning if everything is ok. The tricky part is
## doing this without impacting performance for ascii-only cases.
## PL_Type HS_Port_Code
## 1: 公允价值变动损益 NA
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.9.6
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 tools_3.3.1 htmltools_0.3.5 Rcpp_0.12.6
## [5] stringi_1.1.1 rmarkdown_1.0 knitr_1.14 stringr_1.1.0
## [9] digest_0.6.10 chron_2.3-47 evaluate_0.9 |
Also, the code run's a different output in my Mac. I guess it's because the native encoding is not Result in Mac OS X with data.table 1.9.7library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
## PL_Type HS_Port_Code
## 1: 公允价值变动损益 2042
## 2: 公允价值变动损益 2013
## 3: 公允价值变动损益 2032
## 4: 公允价值变动损益 2052
## 5: 公允价值变动损益 2035
## 6: 公允价值变动损益 2022
## 7: 公允价值变动损益 2015
## 8: 公允价值变动损益 2025
## 9: 公允价值变动损益 2023
## 10: 公允价值变动损益 2012
## 11: 公允价值变动损益 2055
## 12: 公允价值变动损益 8212
## 13: 公允价值变动损益 8222
## 14: 公允价值变动损益 2045
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
## PL_Type HS_Port_Code
## 1: 公允价值变动损益 2042
## 2: 公允价值变动损益 2013
## 3: 公允价值变动损益 2032
## 4: 公允价值变动损益 2052
## 5: 公允价值变动损益 2035
## 6: 公允价值变动损益 2022
## 7: 公允价值变动损益 2015
## 8: 公允价值变动损益 2025
## 9: 公允价值变动损益 2023
## 10: 公允价值变动损益 2012
## 11: 公允价值变动损益 2055
## 12: 公允价值变动损益 8212
## 13: 公允价值变动损益 8222
## 14: 公允价值变动损益 2045
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12 (Sierra)
##
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.9.7
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 tools_3.3.1 htmltools_0.3.5 Rcpp_0.12.6
## [5] stringi_1.1.1 rmarkdown_1.0 knitr_1.14 stringr_1.1.0
## [9] digest_0.6.10 evaluate_0.9 |
Sorry, but can anyone reproduce this? |
@arunsrinivasan Sorry, I know you're very busy. However, I personally think it's an important issue for I don't have the expertise to fix the problem... So, when you're available, please take a look on this... Thanks so much. |
On ubuntu and 3.3.1 it works as on Mac, probably something windows related library(data.table)
#data.table 1.9.7 IN DEVELOPMENT built 2016-09-01 21:24:37 UTC; travis
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
# PL_Type HS_Port_Code
# 1: 公允价值变动损益 2042
# 2: 公允价值变动损益 2013
# 3: 公允价值变动损益 2032
# 4: 公允价值变动损益 2052
# 5: 公允价值变动损益 2035
# 6: 公允价值变动损益 2022
# 7: 公允价值变动损益 2015
# 8: 公允价值变动损益 2025
# 9: 公允价值变动损益 2023
#10: 公允价值变动损益 2012
#11: 公允价值变动损益 2055
#12: 公允价值变动损益 8212
#13: 公允价值变动损益 8222
#14: 公允价值变动损益 2045 |
yes, please be sure to report your system specs if you can't get it to work On Sep 8, 2016 10:10 PM, "Jan Gorecki" notifications@github.com wrote:
|
@arunsrinivasan @jangorecki @MichaelChirico I have put my system session info above #1826 (comment) I tried to install different commit of Please help me when you have the time (And please remove the label not reproducible, since I can reproduce it in my colleague's computer... I'm pretty sure it can be reproduced in every windows machine as long as the default encoding is not UTF-8). I really appreciate that! Thanks again. Here's the session info again: d> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.7
loaded via a namespace (and not attached):
[1] rsconnect_0.4.3 tools_3.3.1 withr_1.0.2 memoise_1.0.0
[5] digest_0.6.10 devtools_1.12.0
d> devtools::session_info()
Session info -------------------------------------------------------------
setting value
version R version 3.3.1 (2016-06-21)
system i386, mingw32
ui RStudio (0.99.902)
language (EN)
collate Chinese (Simplified)_People's Republic of China.936
tz Asia/Taipei
date 2016-09-20
Packages -----------------------------------------------------------------
package * version date source
data.table * 1.9.7 2016-09-09 local
devtools 1.12.0 2016-06-24 CRAN (R 3.3.1)
digest 0.6.10 2016-08-02 CRAN (R 3.3.1)
memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
rsconnect 0.4.3 2016-05-02 CRAN (R 3.2.5)
withr 1.0.2 2016-06-20 CRAN (R 3.2.5) |
BTW, the R version might matter, since the function |
Well, can anybody take a look on this? or should I open a new issue? |
I doubt opening another issue will help, at least as long as this one is still open. If you are in rush, you can always use a fork until it is resolved in master, this is a common practice, not just in R, but generally in open source projects. There is nothing wrong about it. Many companies modify open source projects to better fits their needs. |
@jangorecki thanks for the advices. 👍 |
@arunsrinivasan @jangorecki First of all, thanks for your intention on this issue. After some experiments, I think I locate the root of why this issue happens. It's because the strings might have different orders under different encoding. See the example below: library(data.table)
dtRaw <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dtRaw)
dt <- data.table(CN = unique(dtRaw$PL_Type), VALUE = seq_len(uniqueN(dtRaw$PL_Type)))
setkey(dt, CN)
dt2 <- data.table(CN = enc2utf8(unique(dtRaw$PL_Type)), VALUE = seq_len(uniqueN(dtRaw$PL_Type)))
setkey(dt2, CN)
print(dt)
## CN VALUE
## 1: 公允价值变动损益 1
## 2: 红利收入 2
## 3: 汇兑损益 10
## 4: 价差收入 3
## 5: 交易费用 9
## 6: 利息支出 5
## 7: 利息收入 4
## 8: 其他业务支出 7
## 9: 营业税金及附加 6
## 10: 资产减值损失 8
print(dt2)
## CN VALUE
## 1: 交易费用 9
## 2: 价差收入 3
## 3: 公允价值变动损益 1
## 4: 其他业务支出 7
## 5: 利息支出 5
## 6: 利息收入 4
## 7: 汇兑损益 10
## 8: 红利收入 2
## 9: 营业税金及附加 6
## 10: 资产减值损失 8
dt[J("公允价值变动损益")]
## CN VALUE
## 1: 公允价值变动损益 NA
dt2[J("公允价值变动损益")]
## CN VALUE
## 1: 公允价值变动损益 1 The sessionInfo, I strongly believe that only occurs on windows platform:
What data.table currently do is to compare strings in UTF-8 encoding, after being set keys, and using the binary search, so... So the fix I guess should be: When UPDATED Actually after the script above, if you do
And I though it should have been fixed by 409d709 But it didn't... I have no clue now... 😭 Thanks. |
I know that your guys are very busy. However, can anyone take a look? Maybe you can provide me some hints so that I can help to solve this issue? I will be very appreciated. Thanks. |
Closes because the conversation here is quite confused. |
Hi, I want to report an issue related to non-ASCII character when join use the index or key. It's complicated to explain in words. Luckily, I have a reproducible example as the following (took me 3 hours to find the example T.T ):
Under current dev version (1.9.7) of
data.table
Under CRAN version (1.9.6) of
data.table
Note
As you can see, the behavior changes under the different version of
data.table
. And I can't reproduce the example without the csv file. I'm not sure if it only occurs when the data is read from a csv file or from the database... And in my real cases, the thing happens like "at first it's ok, but when I set the encoding to native, it won't work. And then I set to UTF-8, it's ok. And then I set to native again, it works~"...I strongly doubt it's an issue related to the commits within 3 months, because I'm kind of updating the dev version of data.table regularly.
BTW, I install the dev version of data.table as the instruction in https://github.com/Rdatatable/data.table/wiki/Installation:
The text was updated successfully, but these errors were encountered: