Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #4842] setkey to sort using session's locale not C locale #565

Closed
arunsrinivasan opened this issue Jun 8, 2014 · 2 comments
Closed

Comments

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Jun 8, 2014

Submitted by: Edgaras Dunajevas; Assigned to: Nobody; R-Forge link

data.table sorts strings in the C-locale which is different from base which uses English_United States.1252 locale. Here is reproducible example.

require(data.table)

d = data.frame(cn=c("USA","Ubuntu","Uzbekistan"), stringsAsFactors=FALSE)
d[order(d$cn),,drop=F]
#           cn
#2     Ubuntu
#1        USA
#3 Uzbekistan

dt = data.table(d, key="cn")
dt
#            cn
#1:        USA
#2:     Ubuntu
#3: Uzbekistan

As reported here on SO.

@mattdowle
Copy link
Member

mattdowle commented Sep 26, 2017

data.table sorts in C locale always. This is now more clearly documented by PR #2387.

There are two reasons :

  • consistency so that results are not affected by the environment that R is started in; e.g. servers and services
  • speed because the locale-aware C library calls are slower than data.table's ascii-sort (i.e. C-locale)

Even if we found a way to allow this option efficiently, let's say a key was set on column cn in your example. We would have to ensure that the option of where USA sorted to is maintained inside the key because binary search would need to know which option was used to create the key. It might be possible that some keys in some tables had been created with the option set, and other keys in other tables created later or loaded from disk without the option set, and this could lead to bugs. One main reason for data.table's speed is sorting and that theme runs through the whole code base. To allow a locale-sort option would be too risky for a low benefit.

See PR #2387 for several new sentences in the documentation.

@MichaelChirico
Copy link
Member

excellent explication here @mattdowle, I think worthy of adding to the FAQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants