Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Documentation] clarifies 'in C-locale' #2387

Merged
merged 9 commits into from
Sep 27, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,8 @@

7. Added some clarification about the usage of `on` to `?data.table`, [#2383](https://github.com/Rdatatable/data.table/issues/2383). Thanks to @peterlittlejohn for volunteering his confusion and @MichaelChirico for brushing things up.

8. Clarified that "data.table always sorts in `C-locale`" means that upper-case letters are sorted before lower-case letters by ordering in data.table (e.g. `setorder`, `setkey`, `DT[order(...)]`). Thanks to @hughparsonage for the pull request editing the documentation. Note this makes no difference in most cases of data; e.g. ids where only uppercase or lowercase letters are used (`"AB123"<"AC234"` is always true, regardless), or country names and words which are consistently capitalized. For example, `"America" < "Brazil"` is not affected (it's always true), and neither is `"america" < "brazil"` (always true too); since the first letter is consistently capitalized. But, whether `"america" < "Brazil"` (the words are not consistently capitalized) is true or false in base R depends on the locale of your R session. In America it is true by default and false if you i) type `Sys.setlocale(locale="C")`, ii) the R session has been started in a C locale for you which can happen on servers/services (the locale comes from the environment the R session is started in). However, `"america" < "Brazil"` is always, consistently false in data.table which can be a surprise because it differs to base R by default in most regions. It is false because `"B"<"a"` is true because all upper-case letters come first, followed by all lower case letters (the ascii number of each letter determines the order, which is what is meant by `C-locale`).


### Changes in v1.10.4 (on CRAN 01 Feb 2017)

Expand Down
22 changes: 20 additions & 2 deletions man/setorder.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ based on the columns (and column order) provided. It reorders the table
\emph{by reference} and is therefore very memory efficient.

Also \code{x[order(.)]} is now optimised internally to use data.table's fast
order by default. data.table always reorders in C-locale. To sort by session
locale, use \code{x[base::order(.)]} instead.
order. data.table always reorders in "C-locale" (see Details). To sort by session
locale, use \code{x[base::order(.)]}.

\code{bit64::integer64} type is also supported for reordering rows of a
\code{data.table}.
Expand Down Expand Up @@ -75,6 +75,24 @@ Only \code{x[order(.)]} can have \code{na.last = NA} as it's a subset operation
as opposed to \code{setorder} or \code{setorderv} which reorders the data.table
by reference.

\code{data.table} always reorders in "C-locale".
As a consequence, the ordering may be different to that obtained by \code{base::order}.
In English locales, for example, sorting is case-sensitive in C-locale.
Thus, sorting \code{c("c", "a", "B")} returns \code{c("B", "a", "c")} in \code{data.table}
but \code{c("a", "B", "c")} in \code{base::order}. Note this makes no difference in most cases
of data; both return identical results on ids where only upper-case or lower-case letters are present (\code{"AB123" < "AC234"}
is true in both), or on country names and other proper nouns which are consistently capitalized.
For example, neither \code{"America" < "Brazil"} nor
\code{"america" < "brazil"} are affected since the first letter is consistently
capitalized.

Using C-locale makes the behaviour of sorting in \code{data.table} more consistent across sessions and locales.
The behaviour of \code{base::order} depends on assumptions about the locale of the R session.
In English locales, \code{"america" < "BRAZIL"} is true by default
but false if you either type \code{Sys.setlocale(locale="C")} or the R session has been started in a C locale
for you -- which can happen on servers/services since the locale comes from the environment the R session
was started in. By contrast, \code{"america" < "BRAZIL"} is always false in \code{data.table} regardless of the way your R session was started.

If \code{setorder} results in reordering of the rows of a keyed \code{data.table},
then it's key will be set to \code{NULL}.
}
Expand Down