If you are viewing this file on CRAN, please check latest news on GitHub where the formatting is also better.
data.table v1.15.99 (in development)
-
droplevels(in.place=TRUE)
is deprecated in favor of callingsetdroplevels()
, #6014. Given the associated risks/pain points, we strongly prefer all in-place/by-reference behavior within data.table come from functionsset*
(and:=
) to make it as clear as possible that inputs are mutable. See below and?setdroplevels
for more. -
`[.data.table`
is un-exported again. This was exported to support an experimental feature (DT()
functional form of[
) that never made it to release, but we forgot to claw back this export in the NAMESPACE; sorry about that. We didn't find anyone calling the method directly (which is inadvisable to begin with).
-
We continue to use user feedback to prioritize development. See #3189 for the current list of most-requested issues. In this release we add five highly-requested features:
a. Using
dt[, names(.SD) := lapply(.SD, fx)]
now works to update all columns, #795. Thanks to @brodieG for the report, 20 or so others for chiming in, and @ColeMiller1 for PR.b.
fread
now supports automatic detection ofdec
(as either.
or,
, the latter being common in many places in Europe, Africa, and South America); this behavior is now the default, i.e.dec='auto'
, #2431. Thanks @mattdowle for the original issue, 50 or more others for expressing support, and @MichaelChirico for the fix.c.
fcase()
supports scalars in conditions (e.g. supplying justTRUE
), vectors indefault=
(so the default can vary by row), anddefault=
is now lazily evaluated, #4258. Thanks @sindribaldur for the feature request, @shrektan for doing most of the implementation, and @MichaelChirico for sewing things up.d.
[.data.table
gains argumentshowProgress
, allowing users to toggle progress printing for large "by" operations, #3060. The progress bar reports information such as the number of groups processed, total groups, total time elapsed and estimated time until completion. This feature doesn't apply toGForce
optimized operations. Thanks to @eatonya and @zachmayer for filing FRs, and to everyone else that up-voted/chimed in on the issue. Thanks to @joshhwuu for the PR.e.
rbindlist(l, use.names=TRUE)
andrbind
now work correctly on columns with different class attributes across the inputs for certain classes such asDate
,IDate
,ITime
,POSIXct
andAsIs
with matched columns of similar classes, e.g.,rbind(data.table(d = Sys.Date()), data.table(d = as.IDate(Sys.Date()-1)))
. The conversion is done automatically and the class attribute of the final column is determined by the first class attribute encountered in the binding list, #5309, #4934, #5391.rbindlist(l, ignore.attr=TRUE)
andrbind
also gains argumentignore.attr
(defaultFALSE
) to manually deactivate the safety net preventing binding columns with different column classes, #3911, #5542. Thanks to @dcaseykc, @fox34, @adrian-quintario, @berg-michael, @arunsrinivasan, @statquant, @pkress, @jrausch12, @therosko, @OfekShilon, @iMissile, @tdhock for the request and @ben-schwen for the PR. -
print.data.table()
shows empty (NULL
) list column entries as[NULL]
for emphasis. Previously they would just print nothing (same as for empty string). Part of #4198. Thanks @sritchie73 for the proposal and fix.data.table(a=list(NULL, "")) # a # <list> # 1: [NULL] # 2:
-
cedta()
now returnsFALSE
if.datatable.aware = FALSE
is set in the calling environment, #5654. Thanks @dvg-p4 for the request and PR. -
The
split()
method fordata.table
s is more consistent with that for base methods:a.
f
can be a formula, #5392, mirroring the same inbase::split.data.frame
since R 4.1.0 (May 2021). Thanks to @XiangyunHuang for the request, and @ben-schwen for the PR.b.
sep=
is recognized when splitting withby=
, just like the default and data.frame methods #5417. Thanks @MichaelChirico for the request and PR. -
Namespace-qualifying
data.table::shift()
,data.table::first()
, ordata.table::last()
will not deactivate GForce, #5942. Thanks @MichaelChirico for the proposal and fix. Namespace-qualifying other calls likestats::sum()
,base::prod()
, etc., continue to work as an escape valve to avoid GForce, e.g. to ensure S3 method dispatch. -
transpose
gainslist.cols=
argument (defaultFALSE
), #5639. Use this to return output with list columns and avoid type promotion (an exception isfactor
columns which are promoted tocharacter
for consistency betweenlist.cols=TRUE
andlist.cols=FALSE
). This is convenient for creating a row-major representation of a table. Thanks to @MLopez-Ibanez for the request, and @ben-schwen for the PR. -
fread
'sfill
argument now also accepts aninteger
in addition to boolean values -- an upper bound on the number of columns in the file.fread
always guesses the number of columns based on reading a sample of rows in the file. Whenfill=TRUE
,fread
stops reading and ignores subsequent rows when this estimate winds up too low, e.g. when the sampled rows happen to exclude some rows that are even wider, #2691, #4130, #3436, #1812 and #5378. The suggestion forfill
to allow a manual estimate of the number of columns instead comes from #2727. Usingfill=Inf
reads the full file for estimating the number of columns. Thanks to @jangorecki, @christellacaze, @Yiguan, @alexdthomas, @ibombonato, @Befrancesco, @TobiasGold for reporting/requesting, and @ben-schwen for the PR. -
Computations in
j
can return a matrix or array if it is one-dimensional, e.g. a row or column vector, whenj
is a list of columns during grouping, #783. Previously a matrix could be providedDT[, expr, by]
form, but notDT[, list(expr), by]
form; this resolves that inconsistency. It is still an error to return a "true" array, e.g. a2x3
matrix. -
measure
now supports user-specifiedcols
argument, which can be useful to specify a subset of columns tomelt
, without having to use a regex, #5063. Thanks to @UweBlock and @Henrik-P for reporting, and @tdhock for the PR. -
setDT
is faster for data with many columns, thanks @MichaelChirico for reporting and fixing the issue, #5426. -
dcast
gainsvalue.var.in.dots
,value.var.in.LHSdots
andvalue.var.in.RHSdots
arguments, #5824. This allows thevalue.var
variable(s) indcast
to be represented by...
in the formula (if not otherwise mentioned). Thanks to @iago-pssjd for the report and PR. -
fread
loads.bgz
files directly, #5461. Thanks to @TMRHarrison for the request with proposed fix, and @ben-schwen for the PR. -
New
setdroplevels()
as a by-reference version of thedroplevels()
method, which returns a copy of its input, #6014. Thanks @MichaelChirico for the suggestion and implementation.
-
unique()
returns a copy the case whennrows(x) <= 1
instead of a mutable alias, #5932. This is consistent with existingunique()
behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix. -
dcast
handles coercion offill
tointeger64
correctly, #4561. Thanks to @emallickhossain for the bug report and @MichaelChirico for the fix. -
Optimized
shift
per group produced wrong results when simultaneously subsetting, for example,DT[i==1L, shift(x), by=group]
, #5962. Thanks to @renkun-ken for the report and @ben-schwen for the fix. -
dcast(fill=NULL)
only computes default fill value if necessary, which eliminates some previous warnings which were potentially confusing (for example, whenfun.aggregate=min
ormax
, warning was "NAs introduced by coercion to integer range"), #5512, #5390. Thanks to @tdhock for the report and fix. -
fwrite(x, row.names=TRUE)
withx
amatrix
writesrow.names
when present, not row numbers, #5315. Thanks to @Liripo for the report, and @ben-schwen for the fix. -
patterns()
helper for.SDcols
now accepts argumentsignore.case
,perl
,fixed
, anduseBytes
, which are passed togrep
, #5387. Thanks to @iago-pssjd for the feature request, and @tdhock for the implementation. -
Adding a list column to an empty
data.table
works consistently with other column types, #5738. Thanks to Benjamin Schwendinger for the report and the fix. -
In
DT[,j,by]
,by
retains its attributes (e.g. class) whenj
is GForce optimized, #5567. Thanks to @danwwilson for the report, and @ben-schwen for the PR. -
dt[,,by=año]
(i.e., using a column name containing a non-ASCII character inby
as a plain symbol) no longer errors with "object 'año' not found", #4708. Thanks @pfv07 for the report, and @MichaelChirico for the fix. -
Fixed some memory management issues in the C routines backing
melt()
,froll()
, and GForcemean()
, as identified byrchk
. Thanks Tomas Kalibera and the CRAN team for setting up therchk
system, and @MichaelChirico for the fix. -
data.table's
all.equal()
method now dispatches to each column's ownall.equal()
method as appropriate, #4543. Thanks @MichaelChirico for the report and fix. Note that this had two noteworthy changes to data.table's own test suite that might affect you: (1) comparisons of POSIXct columns compare absolute, not relative differences, meaning that millisecond-scale differences might trigger a "not equal" report that was hidden before; and (2) comparisons of integer64 columns could be totally wrong since they were being compared on the basis of their representation as doubles, not long integers. The former might be a matter of preference requiring you to specify a differenttolerance=
, while the latter was clearly a bug. -
rbindlist
andshift
could lead to a protection stack overflow when applied to a list containing many nested lists exceeding the pointer protection stack size, #4536. Thanks to @ProfFancyPants for reporting, and Benjamin Schwendinger (rbindlist
) and @MichaelChirico (shift
) for the fix. -
fread(x, colClasses="POSIXct")
now also works for columns containing onlyNA
values, #6208. Thanks to @markus-schaffer for the report, and @ben-schwen for the fix. -
fread()
is more careful about detecting that a file is compressed in bzip2 format, #6304. In particular, we also check the 4th byte of the file is a digit; in rare cases, a legitimate uncompressed CSV file could match 'BZh' as the first 3 bytes. We think an uncompressed CSV file matching 'BZh[1-9]' is all the more rare and unlikely to be encountered in "real" examples. Other formats (zip, gzip) are friendly enough to use non-printable characters in their magic numbers. Thanks @grainnemcguire for the report and @MichaelChirico for the fix. -
Selecting the key column like
DT[, .(key1, key2)]
will retain the key without a performance penalty, #4498. Thanks to @user9439449 on StackOverflow for the report and @MichaelChirico for the fix. -
Passing functions programmatically with
env=
doesn't produce an opaque error, e.g.DT[, f(b), env = list(f=sum)]
, #6026. Note that it's much better to pass functions likef="sum"
instead. Thanks to @MichaelChirico for the bug report and fix.
-
transform
method for data.table sped up substantially when creating new columns on large tables. Thanks to @OfekShilon for the report and PR. The implemented solution was proposed by @ColeMiller1. -
The documentation for the
fill
argument inrbind()
andrbindlist()
now notes the expected behaviour for missinglist
columns whenfill=TRUE
, namely to useNULL
(notNA
), #4198. Thanks @sritchie73 for the proposal and fix. -
data.table now depends on R 3.3.0 (2016) instead of 3.1.0 (2014). Recent versions of R have good features that we would gradually like to incorporate, and we see next to no usage of these very old versions of R. We originally attempted to bump only to R 3.2.0 in this release, but {knitr} requiring 3.3.0 and
R CMD check
lacking an--ignore-vignettes
option until 3.3.0 essentially forced our hands. -
Erroneous assignment calls in
[
with a trailing comma (e.g.DT[, `:=`(a = 1, b = 2,)]
) get a friendlier error since this situation is common during refactoring and easy to miss visually. Thanks @MichaelChirico for the fix. -
Input files are now kept open during
mmap()
when running under Emscripten, emscripten-core/emscripten#20459. This avoids an error infread()
when running in WebAssembly, #5969. Thanks to @maek-ies for the report and @georgestagg for the PR. -
dcast()
improves behavior for the situation that thefun.aggregate
value oflength()
is used but not provided by the user.a. This now triggers a warning, not a message, since relying on this default often signals unexpected duplicates in the data, #5386. The warning is classed as
dt_missing_fun_aggregate_warning
, allowing for more targeted handling in user code. Thanks @MichaelChirico for the suggestion and @Nj221102 for the fix.b. The warning itself does better explaining the behavior and suggesting alternatives, #5217. Thanks @MichaelChirico for the suggestion and @Nj221102 for the fix.
-
Updated a test relying on operator
>
working for comparing language objects to a string, which will be deprecated by R, #5977; no user-facing effect. Thanks to R-core for continuously improving the language. -
OpenMP detection when building from source on Mac is improved, #4348. Thanks @jameshester and @kevinushey for the request and @kevinushey for the PR, @jameslamb for the advice and @s-u of R-core for ensuring CRAN machines are configured to support the expected setup.
-
print.data.table
:a. Now handles combination multibyte characters correctly when truncating wide string entries, #5096. Thanks to @MichaelChirico for the report and @joshhwuu for the fix.
b. Prints the indicator
---
in every value column when truncation is needed androw.names = FALSE
instead of adding a blank column where therownames
would have been just to include---
, #4083. Thanks @MichaelChirico for the report and @joshhwuu for the fix.c. Honors
na.print
, as seen inprint.default
, allowing for string replacement ofNA
values when printing. Thanks @HughParsonage for the report and @joshhwuu for the fix.d. Gains new argument
show.indices
and optiondatatable.show.indices
that allows the user to print adata.table
's indices as columns without having to modify thedata.table
itself. Thanks @MichaelChirico for the report and @joshhwuu for the PR.e. Displays
integer64
columns well even if {bit64} has not yet been loaded, #6224. Thanks @renkun-ken for the report and @MichaelChirico for the fix. -
test.data.table()
runs robustly:
- In sessions where the
digits
orwarn
options are not their defaults (7
and0
, respectively), #5285. Thanks @OfekShilon for the report and suggested fix and @MichaelChirico for the PR. - In locales where
letters != sort(letters)
, e.g. Latvian, #3502. Thanks @minemR for the report and @MichaelChirico for the fix. - Initialises the numeric rounding value to 0 using
setNumericRounding(0)
to avoid failed tests if the user has set a different value, #6082. The user's value is restored on exit. Thanks to @MichaelChirico for the report and for describing the solution, and @markseeto for implementing. To enable this,setNumericRounding()
now invisibly returns the old rounding value instead ofNULL
, which is consistent with similar behavior bysetwd()
,options()
, etc. Thanks @MichaelChirico for the report and @joshhwuu for the fix.
-
The
measure
andpatterns
functions are now exported for use within[
andmelt()
to ensure consistency with other non-standard evaluation (NSE) exports like.N
and:=
. This change addresses #5604, allowing package developers to import these names and avoidR CMD check
NOTE
s about undefined variables. Thanks to @MichaelChirico and @ylelkes for their suggestions, and to @Nj221102 for the implementation.We plan to export similar placeholders for
.
andJ
in roughly one year (e.g. data.table 1.18.0), but excluded them from this release to avoid back-compatibility issues. Specifically, some packages doingimport(plyr)
andimport(data.table)
, and/or with those packages inDepends
, will error when data.table starts exporting.
(and similarly for a potential conflict withrJava::J()
). We discourage using data.table (or any package, really) in Depends; blanketimport()
of package is also generally best avoided. Seevignette("datatable-importing")
. -
fwrite()
header names are no longer quoted automatically whenna
argument is given, #2964. Thanks @jangorecki for the report and @joshhwuu for the fix. -
Removed a warning about the now totally-obsolete option
datatable.CJ.names
, as discussed in previous releases. -
Refactored some non-API calls in the package C code, #6180. There should be no user-visible change. Thanks to various R users, R core, and especially Luke Tierney for pushing to have a clearer definition of "API" for R and for offering clear documentation and suggested workarounds. Thanks @MichaelChirico and @TysonStanley for implementing changes for this release; more will follow.
-
C code is more unified in how failures to allocate memory (
malloc()
/calloc()
) are handled, #1115. No OOM issues were reported, as these regions of code typically request relatively small blocks of memory, but it is good to handle memory pressure consistently. Thanks @elfring for the report and @MichaelChirico for the clean-up effort and future-proofing linter. -
The internal routine for finding sort order (
forder
) will now re-use any existing index. A similar optimization was already present in R code, but this has now been pushed to C and covers a wider range of use cases and collects more statistics about its input (e.g. whether any infinite entries were found), opening the possibility for more optimizations in other functions.Functions
setindex
(andsetindexv
) will now compute groups' positions as well.setindex()
also collects the extra statistics alluded to above.Finding sort order in other routines (for example subset
d2[id==1L]
) does not include those extra statistics so as not to impose a slowdown.d2 = data.table(id=2:1, v2=1:2) setindexv(d2, "id") str(attr(attr(d2, "index"), "__id")) # int [1:2] 2 1 # - attr(*, "starts")= int [1:2] 1 2 # - attr(*, "maxgrpn")= int 1 # - attr(*, "anyna")= int 0 # - attr(*, "anyinfnan")= int 0 # - attr(*, "anynotascii")= int 0 # - attr(*, "anynotutf8")= int 0 d2 = data.table(id=2:1, v2=1:2) invisible(d2[id==1L]) str(attr(attr(d2, "index"), "__id")) # int [1:2] 2 1
This feature also enables re-use of sort index during joins, in cases where one of the calls to find sort order is made from C code.
d1 = data.table(id=1:2, v1=1:2) d2 = data.table(id=2:1, v2=1:2) setindexv(d2, "id") d1[d2, on="id", verbose=TRUE] #... #Starting bmerge ... #forderReuseSorting: using existing index: __id #forderReuseSorting: opt=2, took 0.000s #...
This feature resolves #4387, #2947, #4380, and #1321. Thanks to @jangorecki, @jan-glx, and @MichaelChirico for the reports and @jangorecki for implementing.
-
set()
now adds new columns even if no rows are updated, #5409. This behavior is now consistent with:=
, thanks to @mb706 for the report and @joshhwuu for the fix. -
The internal
init()
function infread.c
module has been marked asstatic
, #6328. This is to avoid name collisions, and the resulting segfaults, with other libraries that might expose the same symbol name, and be already loaded by the R process. This was observed in Cray HPE environments where thelibsci
library providing LAPACK to R already has aninit
symbol. Thanks to @rtobar for the report and fix. -
?melt
has long documented that the returnedvariable
column should contain integer column indices whenmeasure.vars
is a list, but when the list length is 1,variable
is actually a character column name, which is inconsistent with the documentation, #5209. To increase consistency in the next release, we plan to changevariable
to integer, so users who were relying on this behavior should changemeasure.vars=list("col_name")
(outputvariable
is column name, will be column index/integer) tomeasure.vars="col_name"
(variable
is column name before and after the planned change). For now, relying on this undocumented behavior throws a new warning. -
dcast()
docs have always requiredfun.aggregate
to return a single value, and whenfill=NULL
,dcast
would indeed error if vector withlength!=1
was returned, but silently return an undefined result when fill is notNULL
. Nowdcast
will additionally warn that this is undefined behavior, when fill is notNULL
, #6032. In particular, this will warn forfun.aggregate=identity
, which was observed in several revdeps. We may change this to an error in a future release, so revdeps should fix their code as soon as possible. Thanks to Toby Dylan Hocking for the PR, and Michael Chirico for analysis of GitHub revdeps.
-
Fix a typo in a Mandarin translation of an error message that was hiding the actual error message, #6172. Thanks @trafficfan for the report and @MichaelChirico for the fix.
-
data.table is now translated into Brazilian Portuguese (
pt_BR
) and Spanish (es
) as well as Mandarin (zh_CN
). Thanks to the new translation teams consisting initially of @rffontenelle, @leofontenelle, and @italo-07 for Portuguese; and @rikivallalba, @rivaquiroga, and @MaraDestefanis for Spanish. The teams are open if you'd also like to join and support maintenance of these translations. -
A more helpful error message for using
:=
inside the first argument (i
) of[.data.table
is now available in translation, #6293. Previously, the code to display this assumed an earlier message was printed in English. The solution is for calling:=
directly (i.e., outside the second argumentj
of[.data.table
) to throw an error of classdt_invalid_let_error
. Thanks to Spanish translator @rikivillalba for spotting the issue and @MichaelChirico for the fix.
data.table v1.15.4 (27 March 2024)
- Optimized
shift
per group produced wrong results when simultaneously subsetting, for example,DT[i==1L, shift(x), by=group]
, #5962. Thanks to @renkun-ken for the report and Benjamin Schwendinger for the fix.
- Updated a test relying on
>
working for comparing language objects to a string, which will be deprecated by R, #5977; no user-facing effect. Thanks to R-core for continuously improving the language.
data.table v1.15.2 (27 Feb 2024)
-
An error in
fwrite()
is more robust across platforms -- CRAN found the use ofPRId64
does not always match the output ofxlength()
, e.g. on some Mac M1 builds #5935. Thanks CRAN for identifying the issue and @ben-schwen for the fix. -
shift()
of a vector in grouped queries (under GForce) returns a vector, consistent withshift()
in other contexts, #5939. Thanks @shrektan for the report and @MichaelChirico for the fix.
data.table v1.15.0 (30 Jan 2024)
shift
andnafill
will now raise errorinput must not be matrix or array
whenmatrix
orarray
is provided on input, rather than giving useless result, #5287. Thanks to @ethanbsmith for reporting.
-
nafill()
now appliesfill=
to the front/back of the vector whentype="locf|nocb"
, #3594. Thanks to @ben519 for the feature request. It also now returns a named object based on the input names. Note that if you are considering joining and then usingnafill(...,type='locf|nocb')
afterwards, please reviewroll=
/rollends=
which should achieve the same result in one step more efficiently.nafill()
is for when filling-while-joining (i.e.roll=
/rollends=
/nomatch=
) cannot be applied. -
mean(na.rm=TRUE)
by group is now GForce optimized, #4849. Thanks to the h2oai/db-benchmark project for spotting this issue. The 1 billion row example in the issue shows 48s reduced to 14s. The optimization also applies to typeinteger64
resulting in a difference to thebit64::mean.integer64
method:data.table
returns adouble
result whereasbit64
rounds the mean to the nearest integer. -
fwrite()
now writes UTF-8 or native csv files by specifying theencoding=
argument, #1770. Thanks to @shrektan for the request and the PR. -
data.table()
no longer fills empty vectors withNA
with warning. Instead a 0-rowdata.table
is returned, #3727. Sincedata.table()
is used internally by.()
, this brings the following examples in line with expectations in most cases. Thanks to @shrektan for the suggestion and PR.DT = data.table(A=1:3, B=letters[1:3]) DT[A>3, .(ITEM='A>3', A, B)] # (1) DT[A>3][, .(ITEM='A>3', A, B)] # (2) # the above are now equivalent as expected and return: Empty data.table (0 rows and 3 cols): ITEM,A,B # Previously, (2) returned : ITEM A B <char> <int> <char> 1: A>3 NA <NA> Warning messages: 1: In as.data.table.list(jval, .named = NULL) : Item 2 has 0 rows but longest item has 1; filled with NA 2: In as.data.table.list(jval, .named = NULL) : Item 3 has 0 rows but longest item has 1; filled with NA
DT = data.table(A=1:3, B=letters[1:3], key="A") DT[.(1:3, double()), B] # new result : character(0) # old result : [1] "a" "b" "c" Warning message: In as.data.table.list(i) : Item 2 has 0 rows but longest item has 3; filled with NA
-
%like%
on factors with a large number of levels is now faster, #4748. The example in the PR shows 2.37s reduced to 0.86s on a factor length 100 million containing 1 million unique 10-character strings. Thanks to @statquant for reporting, and @shrektan for implementing. -
keyby=
now acceptsTRUE
/FALSE
together withby=
, #4307. The primary motivation is benchmarking whereby=
vskeyby=
is varied across a set of queries. Thanks to Jan Gorecki for the request and the PR.DT[, sum(colB), keyby="colA"] DT[, sum(colB), by="colA", keyby=TRUE] # same
-
fwrite()
gains a newdatatable.fwrite.sep
option to change the default separator, still","
by default. Thanks to Tony Fischetti for the PR. As is good practice in R in general, we usually resist new global options for the reason that a user changing the option for their own code can inadvertently change the behaviour of any package usingdata.table
too. However, in this case, the global option affects file output rather than code behaviour. In fact, the very reason the user may wish to change the default separator is that they know a different separator is more appropriate for their data being passed to the package usingfwrite
but cannot otherwise change thefwrite
call within that package. -
melt()
now supportsNA
entries when specifying a list ofmeasure.vars
, which translate into runs of missing values in the output. Useful for melting wide data with some missing columns, #4027. Thanks to @vspinu for reporting, and @tdhock for implementing. -
melt()
now supports multiple output variable columns via thevariable_table
attribute ofmeasure.vars
, #3396 #2575 #2551, #4998. It should be adata.table
with one row that describes each element of themeasure.vars
vector(s). These data/columns are copied to the output instead of the usual variable column. This is backwards compatible since the previous behavior (one output variable column) is used when there is novariable_table
. New functionsmeasure()
andmeasurev()
which use either a separator or a regex to create ameasure.vars
list/vector withvariable_table
attribute; useful for melting data that has several distinct pieces of information encoded in each column name. See new?measure
and new section in reshape vignette. Thanks to Matthias Gomolka, Ananda Mahto, Hugh Parsonage, Mark Fairbanks for reporting, and to Toby Dylan Hocking for implementing. Thanks to @keatingw for testing before release, requestingmeasure()
accept single groups too #5065, and Toby for implementing. -
A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's
substitute
-like interface via a newenv
argument to[.data.table
. For details see the new vignette programming on data.table, and the new?substitute2
manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.DT = data.table(x = 1:5, y = 5:1) # parameters in_col_name = "x" fun = "sum" fun_arg1 = "na.rm" fun_arg1val = TRUE out_col_name = "sum_x" # parameterized query #DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val))] # desired query DT[, .(sum_x = sum(x, na.rm=TRUE))] # new interface DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val)), env = list( in_col_name = "x", fun = "sum", fun_arg1 = "na.rm", fun_arg1val = TRUE, out_col_name = "sum_x" )]
-
DT[, if (...) .(a=1L) else .(a=1L, b=2L), by=group]
now returns a 1-column result with warningj may not evaluate to the same number of columns for each group
, rather than error'names' attribute [2] must be the same length as the vector
, #4274. Thanks to @robitalec for reporting, and Michael Chirico for the PR. -
Typo checking in
i
available since 1.11.4 is extended to work in non-English sessions, #4989. Thanks to Michael Chirico for the PR. -
fifelse()
now coerces logicalNA
to other types and thena
argument supports vectorized input, #4277 #4286 #4287. Thanks to @michaelchirico and @shrektan for reporting, and @shrektan for implementing. -
.datatable.aware
is now recognized in the calling environment in addition to the namespace of the calling package, dtplyr#184. Thanks to Hadley Wickham for the idea and PR. -
New convenience function
%plike%
maps tolike(..., perl=TRUE)
, #3702.%plike%
uses Perl-compatible regular expressions (PCRE) which extend TRE, and may be more efficient in some cases. Thanks @KyleHaynes for the suggestion and PR. -
fwrite()
now acceptssep=""
, #4817. The motivation is an example where the result ofpaste0()
needs to be written to file butpaste0()
takes 40 minutes due to constructing a very large number of unique long strings in R's global character cache. Allowingfwrite(, sep="")
avoids thepaste0
and saves 40 mins. Thanks to Jan Gorecki for the request, and Ben Schwen for the PR. -
data.table
printing now supports customizable methods for both columns and list column row items, part of #1523.format_col
is S3-generic for customizing how to print whole columns and by default defers to the S3format
method for the column's class if one exists; e.g.format.sfc
for geometry columns from thesf
package, #2273. Similarly,format_list_item
is S3-generic for customizing how to print each row of list columns (which lack a format method at a column level) and also by default defers to the S3format
method for that item's class if one exists. Thanks to @mllg who initially filed #3338 with the seed of the idea, @franknarf1 who earlier suggested the idea of providing custom formatters, @fparages who submitted a patch to improve the printing of timezones for #2842, @RichardRedding for pointing out an error relating to printing wideexpression
columns in #3011, @JoshOBrien for improving the output for geometry columns, and @MichaelChirico for implementing. See?print.data.table
for examples. -
tstrsplit(,type.convert=)
now accepts a named list of functions to apply to each part, #5094. Thanks to @Kamgang-B for the request and implementing. -
as.data.table(DF, keep.rownames=key='keyCol')
now works, #4468. Thanks to Michael Chirico for the idea and the PR. -
dcast()
now supports complex values invalue.var
, #4855. This extends earlier support for complex values informula
. Thanks Elio Campitelli for the request, and Michael Chirico for the PR. -
melt()
was pseudo generic in thatmelt(DT)
would dispatch to themelt.data.table
method butmelt(not-DT)
would explicitly redirect toreshape2
. Nowmelt()
is standard generic so that methods can be developed in other packages, #4864. Thanks to @odelmarcelle for suggesting and implementing. -
DT[i, nomatch=NULL]
wherei
contains row numbers now excludesNA
and any outside the range [1,nrow], #3109 #3666. Before,NA
rows were returned always for such values; i.e.nomatch=0|NULL
was ignored. Thanks Michel Lang and Hadley Wickham for the requests, and Jan Gorecki for the PR. Usingnomatch=0
in this case wheni
is row numbers generates the warningPlease use nomatch=NULL instead of nomatch=0; see news item 5 in v1.12.0 (Jan 2019)
.DT = data.table(A=1:3) DT[c(1L, NA, 3L, 5L)] # default nomatch=NA # A # <int> # 1: 1 # 2: NA # 3: 3 # 4: NA DT[c(1L, NA, 3L, 5L), nomatch=NULL] # A # <int> # 1: 1 # 2: 3
-
DT[, head(.SD,n), by=grp]
andtail
are now optimized whenn>1
, #5060 #523.n==1
was already optimized. Thanks to Jan Gorecki and Michael Young for requesting, and Benjamin Schwendinger for the PR. -
setcolorder()
gainsbefore=
andafter=
, #4358. Thanks to Matthias Gomolka for the request, and both Benjamin Schwendinger and Xianghui Dong for implementing. Also thanks to Manuel López-Ibáñez for testing dev and mentioning needed documentation before release. -
base::droplevels()
gains a fast method fordata.table
, #647. Thanks to Steve Lianoglou for requesting, Boniface Kamgang and Martin Binder for testing, and Jan Gorecki and Benjamin Schwendinger for the PR.fdroplevels()
for use on vectors has also been added. -
shift()
now also supportstype="cyclic"
, #4451. Arguments that are normally pushed out bytype="lag"
ortype="lead"
are re-introduced at this type at the first/last positions. Thanks to @RicoDiel for requesting, and Benjamin Schwendinger for the PR.# Usage shift(1:5, n=-1:1, type="cyclic") # [[1]] # [1] 2 3 4 5 1 # # [[2]] # [1] 1 2 3 4 5 # # [[3]] # [1] 5 1 2 3 4 # Benchmark x = sample(1e9) # 3.7 GB microbenchmark::microbenchmark( shift(x, 1, type="cyclic"), c(tail(x, 1), head(x,-1)), times = 10L, unit = "s" ) # Unit: seconds # expr min lq mean median uq max neval # shift(x, 1, type = "cyclic") 1.57 1.67 1.71 1.68 1.70 2.03 10 # c(tail(x, 1), head(x, -1)) 6.96 7.16 7.49 7.32 7.64 8.60 10
-
fread()
now supports "0" and "1" inna.strings
, #2927. Previously this was not permitted since "0" and "1" can be recognized as boolean values. Note that it is still not permitted to use "0" and "1" inna.strings
in combination withlogical01 = TRUE
. Thanks to @msgoussi for the request, and Benjamin Schwendinger for the PR. -
setkey()
now supports typeraw
as value columns (not as key columns), #5100. Thanks Hugh Parsonage for requesting, and Benjamin Schwendinger for the PR. -
shift()
is now optimized by group, #1534. Thanks to Gerhard Nachtmann for requesting, and Benjamin Schwendinger for the PR. Thanks to @neovom for testing dev and filing a bug report, #5547 which was fixed before release. This helped also in improving the logic for when to turn on optimization by group in general, making it more robust.N = 1e7 DT = data.table(x=sample(N), y=sample(1e6,N,TRUE)) shift_no_opt = shift # different name not optimized as a way to compare microbenchmark( DT[, c(NA, head(x,-1)), y], DT[, shift_no_opt(x, 1, type="lag"), y], DT[, shift(x, 1, type="lag"), y], times=10L, unit="s") # Unit: seconds # expr min lq mean median uq max neval # DT[, c(NA, head(x, -1)), y] 8.7620 9.0240 9.1870 9.2800 9.3700 9.4110 10 # DT[, shift_no_opt(x, 1, type = "lag"), y] 20.5500 20.9000 21.1600 21.3200 21.4400 21.5200 10 # DT[, shift(x, 1, type = "lag"), y] 0.4865 0.5238 0.5463 0.5446 0.5725 0.5982 10
Example from stackoverflow
set.seed(1) mg = data.table(expand.grid(year=2012:2016, id=1:1000), value=rnorm(5000)) microbenchmark(v1.9.4 = mg[, c(value[-1], NA), by=id], v1.9.6 = mg[, shift_no_opt(value, n=1, type="lead"), by=id], v1.14.4 = mg[, shift(value, n=1, type="lead"), by=id], unit="ms") # Unit: milliseconds # expr min lq mean median uq max neval # v1.9.4 3.6600 3.8250 4.4930 4.1720 4.9490 11.700 100 # v1.9.6 18.5400 19.1800 21.5100 20.6900 23.4200 29.040 100 # v1.14.4 0.4826 0.5586 0.6586 0.6329 0.7348 1.318 100
-
rbind()
andrbindlist()
now supportfill=TRUE
withuse.names=FALSE
instead of issuing the warninguse.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE.
, #5444. Thanks to @sindribaldur, @dcaseykc, @fox34, @adrian-quintario and @berg-michael for testing dev and filing a bug report which was fixed before release.DT1 # A B # <int> <int> # 1: 1 5 # 2: 2 6 DT2 # foo # <int> # 1: 3 # 2: 4 rbind(DT1, DT2, fill=TRUE) # no change # A B foo # <int> <int> <int> # 1: 1 5 NA # 2: 2 6 NA # 3: NA NA 3 # 4: NA NA 4 rbind(DT1, DT2, fill=TRUE, use.names=FALSE) # was: # A B foo # <int> <int> <int> # 1: 1 5 NA # 2: 2 6 NA # 3: NA NA 3 # 4: NA NA 4 # Warning message: # In rbindlist(l, use.names, fill, idcol) : # use.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE. # now: # A B # <int> <int> # 1: 1 5 # 2: 2 6 # 3: 3 NA # 4: 4 NA
-
fread()
already made a good guess as to whether column names are present by comparing the type of the fields in row 1 to the type of the fields in the sample. This guess is now improved when a column contains a string in row 1 (i.e. a potential column name) but all blank in the sample rows, #2526. Thanks @st-pasha for reporting, and @ben-schwen for the PR. -
fread()
can now read.zip
and.tar
directly, #3834. Moreover, if a compressed file name is missing its extension,fread()
now attempts to infer the correct filetype from its magic bytes. Thanks to Michael Chirico for the idea, and Benjamin Schwendinger for the PR. -
DT[, let(...)]
is a new alias for the functional form of:=
; i.e.DT[, ':='(...)]
, #3795. Thanks to Elio Campitelli for requesting, and Benjamin Schwendinger for the PR.DT = data.table(A=1:2) DT[, let(B=3:4, C=letters[1:2])] DT # A B C # <int> <int> <char> # 1: 1 3 a # 2: 2 4 b
-
weighted.mean()
is now optimized by group, #3977. Thanks to @renkun-ken for requesting, and Benjamin Schwendinger for the PR. -
as.xts.data.table()
now supports non-numeric xts coredata matrixes, 5268. Existing numeric only functionality is supported by a newnumeric.only
parameter, which defaults toTRUE
for backward compatibility and the most common use case. To convert non-numeric columns, set this parameter toFALSE
. Conversions ofdata.table
columns to amatrix
now usesdata.table::as.matrix
, with all its performance benefits. Thanks to @ethanbsmith for the report and fix. -
unique.data.table()
gainscols
to specify a subset of columns to include in the resultingdata.table
, #5243. This saves the memory overhead of subsetting unneeded columns, and provides a cleaner API for a common operation previously needing more convoluted code. Thanks to @MichaelChirico for the suggestion & implementation. -
:=
is now optimized by group, #1414. Thanks to Arun Srinivasan for suggesting, and Benjamin Schwendinger for the PR. Thanks to @clerousset, @dcaseykc, @OfekShilon, @SeanShao98, and @ben519 for testing dev and filing detailed bug reports which were fixed before release and their tests added to the test suite. -
.I
is now available inby
for rowwise operations, #1732. Thanks to Rafael H. M. Pereira for requesting, and Benjamin Schwendinger for the PR.DT # V1 V2 # <int> <int> # 1: 3 5 # 2: 4 6 DT[, sum(.SD), by=.I] # I V1 # <int> <int> # 1: 1 8 # 2: 2 10
-
New functions
yearmon()
andyearqtr
give a combined representation ofyear()
andmonth()
/quarter()
. These and alsoyday
,wday
,mday
,week
,month
andyear
are now optimized for memory and compute efficiency by removing thePOSIXlt
dependency, #649. Thanks to Matt Dowle for the request, and Benjamin Schwendinger for the PR. Thanks to @berg-michael for testing dev and filing a bug report for special case of missing values which was fixed before release. -
New function
%notin%
provides a convenient alternative to!(x %in% y)
, #4152. Thanks to Jan Gorecki for suggesting and Michael Czekanski for the PR.%notin%
uses half the memory because it computes the result directly as opposed to!
which allocates a new vector to hold the negated result. Ifx
is long enough to occupy more than half the remaining free memory, this can make the difference between the operation working, or failing with an out-of-memory error. -
tables()
is faster by default by excluding the size of character strings in R's global cache (which may be shared) and excluding the size of list column items (which also may be shared).mb=
now accepts any function which accepts adata.table
and returns a higher and better estimate of its size in bytes, albeit more slowly; e.g.mb = utils::object.size
.
-
by=.EACHI
wheni
is keyed buton=
different columns thani
's key could create an invalidly keyed result, #4603 #4911. Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where adata.table
is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries. -
print(DT, trunc.cols=TRUE)
and the correspondingdatatable.print.trunc.cols
option (new feature 3 in v1.13.0) could incorrectly display an extra column, #4266. Thanks to @tdhock for the bug report and @MichaelChirico for the PR. -
fread(..., nrows=0L)
now works as intended and the same asnrows=0
; i.e. returning the column names and typed empty columns determined by the large sample, #4686, #4029. Thanks to @hongyuanjia and @michaelpaulhirsch for reporting, and Benjamin Schwendinger for the PR. Also thanks to @HughParsonage for testing dev and reporting a bug which was fixed before release. -
Passing
.SD
tofrankv()
withties.method='random'
or withna.last=NA
failed with.SD is locked
, #4429. Thanks @smarches for the report. -
Filtering data.table using
which=NA
to return non-matching indices will now properly work for non-optimized subsetting as well, closes #4411. -
When
j
returns an object whose class"X"
inherits fromdata.table
; i.e. classc("X", "data.table", "data.frame")
, the derived class"X"
is no longer incorrectly dropped from the class of thedata.table
returned, #4324. Thanks to @HJAllen for reporting and @shrektan for the PR. -
as.data.table()
failed with.subset2(x, i, exact = exact): attempt to select less than one element in get1index
when passed an object inheriting fromdata.table
with a different[[
method, such as the classdfidx
from thedfidx
package, #4526. Thanks @RicoDiel for the report, and Michael Chirico for the PR. -
rbind()
andrbindlist()
of length-0 ordered factors failed withInternal error: savetl_init checks failed
, #4795 #4823. Thanks to @shrektan and @dbart79 for reporting, and @shrektan for fixing. -
data.table(NULL)[, firstCol:=1L]
createddata.table(firstCol=1L)
ok but did not update the internalrow.names
attribute, causingError in '$<-.data.frame'(x, name, value) : replacement has 1 row, data has 0
when passed to packages likeggplot
which useDT
as if it is adata.frame
, #4597. Thanks to Matthew Son for reporting, and Cole Miller for the PR. -
X[Y, .SD, by=]
(joining and grouping in the same query) could segfault if i)by=
is supplied custom data (i.e. not simple expressions of columns), and ii) some rows ofY
do not match to any rows inX
, #4892. Thanks to @Kodiologist for reporting, @ColeMiller1 for investigating, and @tlapak for the PR. -
Assigning a set of 2 or more all-NA values to a factor column could segfault, #4824. Thanks to @clerousset for reporting and @shrektan for fixing.
-
as.data.table(table(NULL))
now returnsdata.table(NULL)
rather than errorattempt to set an attribute on NULL
, #4179. The result differs slightly toas.data.frame(table(NULL))
(0-row, 1-column) because 0-column works better with otherdata.table
functions likerbindlist()
. Thanks to Michael Chirico for the report and fix. -
melt
with a list formeasure.vars
would outputvariable
inconsistently betweenna.rm=TRUE
andFALSE
, #4455. Thanks to @tdhock for reporting and fixing. -
by=...get()...
could fail withobject not found
, #4873 #4981. Thanks to @sindribaldur for reporting, and @OfekShilon for fixing. -
print(x, col.names='none')
now removes the column names as intended for widedata.table
s whose column names don't fit on a single line, #4270. Thanks to @tdhock for the report, and Michael Chirico for fixing. -
DT[, min(colB), by=colA]
whencolB
is typecharacter
would miss blank strings (""
) at the beginning of a group and return the smallest non-blank instead of blank, #4848. Thanks to Vadim Khotilovich for reporting and for the PR fixing it. -
Assigning a wrong-length or non-list vector to a list column could segfault, #4166 #4667 #4678 #4729. Thanks to @fklirono, Kun Ren, @kevinvzandvoort and @peterlittlejohn for reporting, and to Václav Tlapák for the PR.
-
as.data.table()
onxts
objects containing a column namedx
would return anindex
of type plaininteger
rather thanPOSIXct
, #4897. Thanks to Emil Sjørup for reporting, and Jan Gorecki for the PR. -
A fix to
as.Date(c("", ...))
in R 4.0.3, 17909, has been backported todata.table::as.IDate()
so that it too now returnsNA
for the first item when it is blank, even in older versions of R back to 3.1.0, rather than the incorrect errorcharacter string is not in a standard unambiguous format
, #4676. Thanks to Arun Srinivasan for reporting, and Michael Chirico both for thedata.table
PR and for submitting the patch to R that was accepted and included in R 4.0.3. -
uniqueN(DT, by=character())
is now equivalent touniqueN(DT)
rather than internal error'by' is either not integer or is length 0
, #4594. Thanks Marco Colombo for the report, and Michael Chirico for the PR. Similarly forunique()
,duplicated()
andanyDuplicated()
. -
melt()
on adata.table
withlist
columns formeasure.vars
would silently ignorena.rm=TRUE
, #5044. Now the same logic asis.na()
from base R is used; i.e. if list element is scalar NA then it is considered missing and removed. Thanks to Toby Dylan Hocking for the PRs. -
fread(fill=TRUE)
could segfault if the input contained an improperly quoted character field, #4774 #5041. Thanks to @AndeolEvain and @e-nascimento for reporting and to Václav Tlapák for the PR. -
fread(fill=TRUE, verbose=TRUE)
would segfault on the out-of-sample type bump verbose output if the input did not contain column names, 5046. Thanks to Václav Tlapák for the PR. -
.SDcols=-V2:-V1
and.SDcols=(-1)
could error withxcolAns does not pass checks
andargument specifying columns specify non existing column(s)
, #4231. Thanks to Jan Gorecki for reporting and the PR. Thanks Toby Dylan Hocking for tracking down an error caused by the initial fix and Michael Chirico for fixing it. -
.SDcols=<logical vector>
is now documented in?data.table
and it is now an error if the logical vector's length is not equal to the number of columns (consistent withdata.table
's no-recycling policy; see new feature 1 in v1.12.2 Apr 2019), #4115. Thanks to @Henrik-P for reporting and Jan Gorecki for the PR. -
melt()
now outputs scalar logicalNA
instead ofNULL
in rows corresponding to missing list columns, for consistency with non-list columns when usingna.rm=TRUE
, #5053. Thanks to Toby Dylan Hocking for the PR. -
as.data.frame(DT)
,setDF(DT)
andas.list(DT)
now remove the"index"
attribute which contains any indices (a.k.a. secondary keys), as they already did for otherdata.table
-only attributes such as the primary key stored in the"sorted"
attribute. When indices were left intact, a subsequent subset, assign, or reorder of thedata.frame
bydata.frame
-code in base R or other packages would not update the indices, causing incorrect results if then converted back todata.table
, #4889. Thanks @OfekShilon for the report and the PR. -
dplyr::arrange(DT)
usesvctrs::vec_slice
which retainsdata.table
's class but uses C to bypass[
method dispatch and does not adjustdata.table
's attributes containing the index row numbers, #5042.data.table
's long-standing.internal.selfref
mechanism to detect such operations by other packages was not being checked bydata.table
when using indexes, causingdata.table
filters and joins to use invalid indexes and return incorrect results after adplyr::arrange(DT)
. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to usedata.table
isdata.table::setkey(DT, col1, col2, ...)
which reordersDT
by reference in parallel, sets the primary key for automatic use by subsequentdata.table
queries, and permits rowname-like usage such asDT["foo",]
which returns the now-contiguous-in-memory block of rows where the first column ofDT
's key contains"foo"
. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up usingDT[.("foo",20210728L), ]
. Using==
ini
is also optimized to use the key or indices, if you prefer using column names explicitly and==
. An alternative tosetkey(DT)
is returning a new ordered result usingDT[order(col1, col2, ...), ]
. -
A segfault occurred when
nrow/throttle < nthread
, #5077. With the default throttle of 1024 rows (see?setDTthreads
), at least 64 threads would be needed to trigger the segfault since there needed to be more than 65,535 rows too. It occurred on a server with 256 logical cores wheredata.table
uses 128 threads by default. Thanks to Bennet Becker for reporting, debugging at C level, and fixing. It also occurred when the throttle was increased so as to use fewer threads; e.g. at the limitsetDTthreads(throttle=nrow(DT))
. -
fread(file=URL)
now works rather than errordoes not exist or is non-readable
, #4952.fread(URL)
andfread(input=URL)
worked before and continue to work. Thanks to @pnacht for reporting and @ben-schwen for the PR. -
fwrite(DF, row.names=TRUE)
whereDF
has specific integer rownames (e.g. usingrownames(DF) <- c(10L,20L,30L)
) would ignore the integer rownames and write the row numbers instead, #4957. Thanks to @dgarrimar for reporting and @ColeMiller1 for the PR. Further, whenquote='auto'
(default) and the rownames are integers (either default or specific), they are no longer quoted. -
test.data.table()
would fail on test 1894 if the variablez
was defined by the user, #3705. The test suite already ran in its own separate environment. That environment's parent is no longer.GlobalEnv
to isolate it further. Thanks to Michael Chirico for reporting, and Matt Dowle for the PR. -
fread(text="a,b,c")
(where input data contains no\n
buttext=
has been used) now works instead of errorfile not found: a,b,c
, #4689. Thanks to @trainormg for reporting, and @ben-schwen for the PR. -
na.omit(DT)
did not removeNA
innanotime
columns, #4744. Thanks Jean-Mathieu Vermosen for reporting, and Michael Chirico for the PR. -
DT[, min(intCol, na.rm=TRUE), by=grp]
would returnInf
for any groups containing all NAs, with a type change frominteger
tonumeric
to hold theInf
, and with warning. Similarlymax
would return-Inf
. NowNA
is returned for such all-NA groups, without warning or type change. This is almost-surely less surprising, more convenient, consistent, and efficient. There was no user request for this, likely because our desire to be consistent with base R in this regard was known (base::min(x, na.rm=TRUE)
returnsInf
with warning for all-NA input). Matt Dowle made this change when reworking internals, #5105. The old behavior seemed so bad, and since there was a warning too, it seemed appropriate to treat it as a bug.DT # A B # <char> <int> # 1: a 1 # 2: a NA # 3: b 2 # 4: b NA DT[, min(B,na.rm=TRUE), by=A] # no change in behavior (no all-NA groups yet) # A V1 # <char> <int> # 1: a 1 # 2: b 2 DT[3, B:=NA] # make an all-NA group DT # A B # <char> <int> # 1: a 1 # 2: a NA # 3: b NA # 4: b NA DT[, min(B,na.rm=TRUE), by=A] # old result # A V1 # <char> <num> # V1's type changed to numeric (inconsistent) # 1: a 1 # 2: b Inf # Inf surprising # Warning message: # warning inconvenient # In gmin(B, na.rm = TRUE) : # No non-missing values found in at least one group. Coercing to numeric # type and returning 'Inf' for such groups to be consistent with base DT[, min(B,na.rm=TRUE), by=A] # new result # A V1 # <char> <int> # V1's type remains integer (consistent) # 1: a 1 # 2: b NA # NA because there are no non-NA, naturally # no inconvenient warning
On the same basis,
min
andmax
methods for emptyIDate
input now returnNA_integer_
of classIDate
, rather thanNA_double_
of classIDate
together with base R's warningno non-missing arguments to min; returning Inf
, #2256. The type change and warning would cause an error in grouping, see example below. SinceNA
was returned before it seems clear that still returningNA
but of the correct type and with no warning is appropriate, backwards compatible, and a bug fix. Thanks to Frank Narf for reporting, and Matt Dowle for fixing.DT # d g # <IDat> <char> # 1: 2020-01-01 a # 2: 2020-01-02 a # 3: 2019-12-31 b DT[, min(d[d>"2020-01-01"]), by=g] # was: # Error in `[.data.table`(DT, , min(d[d > "2020-01-01"]), by = g) : # Column 1 of result for group 2 is type 'double' but expecting type # 'integer'. Column types must be consistent for each group. # In addition: Warning message: # In min.default(integer(0), na.rm = FALSE) : # no non-missing arguments to min; returning Inf # now : # g V1 # <char> <IDat> # 1: a 2020-01-02 # 2: b <NA>
-
DT[, min(int64Col), by=grp]
(andmax
) would return incorrect results forbit64::integer64
columns, #4444. Thanks to @go-see for reporting, and Michael Chirico for the PR. -
fread(dec=',')
was able to guesssep=','
and return an incorrect result, #4483. Thanks to Michael Chirico for reporting and fixing. It was already an error to provide bothsep=','
anddec=','
manually.fread('A|B|C\n1|0,4|a\n2|0,5|b\n', dec=',') # no problem # A B C # <int> <num> <char> # 1: 1 0.4 a # 2: 2 0.5 b fread('A|B,C\n1|0,4\n2|0,5\n', dec=',') # A|B C # old result guessed sep=',' despite dec=',' # <char> <int> # 1: 1|0 4 # 2: 2|0 5 # A B,C # now detects sep='|' correctly # <int> <num> # 1: 1 0.4 # 2: 2 0.5
-
IDateTime()
ignored thetz=
andformat=
arguments because...
was not passed through to submethods, #2402. Thanks to Frank Narf for reporting, and Jens Peder Meldgaard for the PR.IDateTime("20171002095500", format="%Y%m%d%H%M%S") # was : # Error in charToDate(x) : # character string is not in a standard unambiguous format # now : # idate itime # <IDat> <ITime> # 1: 2017-10-02 09:55:00
-
DT[i, sum(b), by=grp]
(and other optimized-by-group aggregates:mean
,var
,sd
,median
,prod
,min
,max
,first
,last
,head
andtail
) could segfault ifi
contained row numbers and one or more were NA, #1994. Thanks to Arun Srinivasan for reporting, and Benjamin Schwendinger for the PR. -
identical(fread(text="A\n0.8060667366\n")$A, 0.8060667366)
is now TRUE, #4461. This is one of 13 numbers in the set of 100,000 between 0.80606 and 0.80607 in 0.0000000001 increments that were not already identical. In all 13 cases R's parser (same asread.table
) andfread
straddled the true value by a very similar small amount.fread
now uses/10^n
rather than*10^-n
to match R identically in all cases. Thanks to Gabe Becker for requesting consistency, and Michael Chirico for the PR.for (i in 0:99999) { s = sprintf("0.80606%05d", i) r = eval(parse(text=s)) f = fread(text=paste0("A\n",s,"\n"))$A if (!identical(r, f)) cat(s, sprintf("%1.18f", c(r, f, r)), "\n") } # input eval & read.table fread before fread now # 0.8060603509 0.806060350899999944 0.806060350900000055 0.806060350899999944 # 0.8060614740 0.806061473999999945 0.806061474000000056 0.806061473999999945 # 0.8060623757 0.806062375699999945 0.806062375700000056 0.806062375699999945 # 0.8060629084 0.806062908399999944 0.806062908400000055 0.806062908399999944 # 0.8060632774 0.806063277399999945 0.806063277400000056 0.806063277399999945 # 0.8060638101 0.806063810099999944 0.806063810100000055 0.806063810099999944 # 0.8060647118 0.806064711799999944 0.806064711800000055 0.806064711799999944 # 0.8060658349 0.806065834899999945 0.806065834900000056 0.806065834899999945 # 0.8060667366 0.806066736599999945 0.806066736600000056 0.806066736599999945 # 0.8060672693 0.806067269299999944 0.806067269300000055 0.806067269299999944 # 0.8060676383 0.806067638299999945 0.806067638300000056 0.806067638299999945 # 0.8060681710 0.806068170999999944 0.806068171000000055 0.806068170999999944 # 0.8060690727 0.806069072699999944 0.806069072700000055 0.806069072699999944 # # remaining 99,987 of these 100,000 were already identical
-
dcast(empty-DT)
now returns an emptydata.table
rather than errorCannot cast an empty data.table
, #1215. Thanks to Damian Betebenner for reporting, and Matt Dowle for fixing. -
DT[factor("id")]
now works rather than errori has evaluated to type integer. Expecting logical, integer or double
, #1632.DT["id"]
has worked forever by automatically converting toDT[.("id")]
for convenience, and joins have worked forever between char/fact, fact/char and fact/fact even when levels mismatch, so it was unfortunate thatDT[factor("id")]
managed to escape the simple automatic conversion toDT[.(factor("id"))]
which is now in place. Thanks to @aushev for reporting, and Matt Dowle for the fix. -
All-NA character key columns could segfault, #5070. Thanks to @JorisChau for reporting and Benjamin Schwendinger for the fix.
-
In v1.13.2 a version of an old bug was reintroduced where during a grouping operation list columns could retain a pointer to the last group. This affected only attributes of list elements and only if those were updated during the grouping operation, #4963. Thanks to @fujiaxiang for reporting and @avimallu and Václav Tlapák for investigating and the PR.
-
shift(xInt64, fill=0)
andshift(xInt64, fill=as.integer64(0))
(but notshift(xInt64, fill=0L)
) would error withINTEGER() can only be applied to a 'integer', not a 'double'
wherexInt64
conveysbit64::integer64
,0
is typedouble
and0L
is type integer, #4865. Thanks to @peterlittlejohn for reporting and Benjamin Schwendinger for the PR. -
DT[i, strCol:=classVal]
did not coerce using theas.character
method for the class, resulting in either an unexpected string value or an error such asTo assign integer64 to a target of type character, please use as.character() for clarity
. Discovered during work on the previous issue, #5189.DT # A # <char> # 1: a # 2: b # 3: c DT[2, A:=as.IDate("2021-02-03")] DT[3, A:=bit64::as.integer64("4611686018427387906")] DT # A # <char> # 1: a # 2: 2021-02-03 # was 18661 # 3: 4611686018427387906 # was error 'please use as.character'
-
tables()
failed withargument "..." is missing
when called from within a function taking...
; e.g.function(...) { tables() }
, #5197. Thanks @greg-minshall for the report and @michaelchirico for the fix. -
DT[, prod(int64Col), by=grp]
produced wrong results forbit64::integer64
due to incorrect optimization, #5225. Thanks to Benjamin Schwendinger for reporting and fixing. -
fintersect(..., all=TRUE)
andfsetdiff(..., all=TRUE)
could return incorrect results when the inputs had columns namedx
andy
, #5255. Thanks @Fpadt for the report, and @ben-schwen for the fix. -
fwrite()
could produce not-ISO-compliant timestamps such as2023-03-08T17:22:32.:00Z
when under a whole second by less than numerical tolerance of one microsecond, #5238. Thanks to @avraam-inside for the report and Václav Tlapák for the fix. -
merge.data.table()
silently ignored theincomparables
argument, #2587. It is now implemented and any other ignored arguments (e.g. misspellings) are now warned about. Thanks to @GBsuperman for the report and @ben-schwen for the fix. -
DT[, c('z','x') := {x=NULL; list(2,NULL)}]
now removes columnx
as expected rather than incorrectly assigning2
tox
as well asz
, #5284. Thex=NULL
is superfluous while thelist(2,NULL)
is the final value of{}
whose items correspond toc('z','x')
. Thanks @eutwt for the report, and @ben-schwen for the fix. -
as.data.frame(DT, row.names=)
no longer silently ignoresrow.names
, #5319. Thanks to @dereckdemezquita for the fix and PR, and @ben-schwen for guidance. -
data.table(...)
unnamed arguments are deparsed in an attempt to name the columns but when called fromdo.call()
the input data itself was deparsed taking a very long time, #5501. Many thanks to @OfekShilon for the report and fix, and @michaelchirico for guidance. Unnamed arguments todata.table(...)
may now be faster in other cases not involvingdo.call()
too; e.g. expressions spanning a lot of lines or other function call constructions that led to the data itself being deparsed.DF = data.frame(a=runif(1e6), b=runif(1e6)) DT1 = data.table(DF) # 0.02s before and after DT2 = do.call(data.table, list(DF)) # 3.07s before, 0.02s after identical(DT1, DT2) # TRUE
-
fread(URL)
withhttps:
andftps:
could timeout if proxy settings were not guessed right bycurl::curl_download
, #1686.fread(URL)
now usesdownload.file()
as default for downloading files from urls. Thanks to @cderv for the report and Benjamin Schwendinger for the fix. -
split.data.table()
works for downstream methods that don't implementDT[i]
form (i.e., requiringDT[i, j]
form, like plaindata.frame
s), for examplesf
's[.sf
, #5365. Thanks @barryrowlingson for the report and @michaelchirico for the fix.
-
New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of
0
(type double) vs0L
(type integer) is too inconvenient for you the user. So such coercions happen indata.table
automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when callingset()
many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as"2"
into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS withas.<type>()
so that it is clear to readers of your code that a coercion from character to that type is intended. For example :x = c(2L,NA,4L,5L) nafill(x, fill=3) # no warning; requiring 3L too inconvenient nafill(x, fill="3") # warns in case either x or "3" was a mistake nafill(x, fill=3.14) # warns that precision has been lost nafill(x, fill=as.integer(3.14)) # no warning; the as.<type> conveys intent
-
CsubsetDT
exported C function has been renamed toDT_subsetDT
. This requiresR_GetCCallable("data.table", "CsubsetDT")
to be updated toR_GetCCallable("data.table", "DT_subsetDT")
. Additionally there is now a dedicated header file for data.table C exportsinclude/datatableAPI.h
, #4643, thanks to @eddelbuettel, which makes it easier to import data.table C functions. -
In v1.12.4, fractional
fread(..., stringsAsFactors=)
was added. For example ifstringsAsFactors=0.2
, any character column with fewer than 20% unique strings would be cast asfactor
. This is now documented in?fread
as well, #4706. Thanks to @markderry for the PR. -
cube(DT, by="a")
now gives a more helpful error thatj
is missing, #4282. -
v1.13.0 (July 2020) fixed a segfault/corruption/error (depending on version of R and circumstances) in
dcast()
whenfun.aggregate
returnedNA
(typelogical
) in an otherwisecharacter
result, #2394. This fix was the result of other internal rework and there was no news item at the time. A new test to cover this case has now been added. Thanks Vadim Khotilovich for reporting, and Michael Chirico for investigating, pinpointing when the fix occurred and adding the test. -
DT[subset]
whereDT[(subset)]
orDT[subset==TRUE]
was intended; i.e., subsetting by a logical column whose name conflicts with an existing function, now gives a friendlier error message, #5014. Thanks @michaelchirico for the suggestion and PR, and @ColeMiller1 for helping with the fix. -
Grouping by a
list
column has its error message improved stating this is unsupported, #4308. Thanks @sindribaldur for filing, and @michaelchirico for the PR. Please add your vote and especially use cases to the #1597 feature request. -
OpenBSD 6.9 released May 2021 uses a 16 year old version of zlib (v1.2.3 from 2005) plus cherry-picked bug fixes (i.e. a semi-fork of zlib) which induces
Compress gzip error: -9
fromfwrite()
, #5048. Thanks to Philippe Chataignon for investigating and fixing. Matt asked on OpenBSD's mailing list if zlib could be upgraded to 4 year old zlib 1.2.11 but forgot his tin hat: https://marc.info/?l=openbsd-misc&m=162455479311886&w=1. -
?"."
,?".."
,?".("
, and?".()"
now point to?data.table
, #4385 #4407. To help users find the documentation for these convenience features available insideDT[...]
. Recall that.
is an alias forlist
, and..var
tellsdata.table
to look forvar
in the calling environment as opposed to a column of the table. -
DT[, lhs:=rhs]
andset(DT, , lhs, rhs)
no longer raise a warning on zero lengthlhs
, #4086. Thanks to Jan Gorecki for the suggestion and PR. For example,DT[, grep("foo", names(dt)) := NULL]
no longer warns if there are no column names containing"foo"
. -
melt()
's internal C code is now more memory efficient, #5054. Thanks to Toby Dylan Hocking for the PR. -
?merge
and?setkey
have been updated to clarify that the row order is retained whensort=FALSE
, and whyNA
s are always first whensort=TRUE
, #2574 #2594. Thanks to Davor Josipovic and Markus Bonsch for the reports, and Jan Gorecki for the PR. -
For nearly two years, since v1.12.4 (Oct 2019) (note 11 below in this NEWS file), using
options(datatable.nomatch=0)
has produced the following message :The option 'datatable.nomatch' is being used and is not set to the default NA. This option is still honored for now but will be deprecated in future. Please see NEWS for 1.12.4 for detailed information and motivation. To specify inner join, please specify `nomatch=NULL` explicitly in your calls rather than changing the default using this option.
The message is now upgraded to warning that the option is now ignored.
-
The options
datatable.print.class
anddatatable.print.keys
are nowTRUE
by default. They have been available since v1.9.8 (Nov 2016) and v1.11.0 (May 2018) respectively. -
Thanks to @ssh352, Václav Tlapák, Cole Miller, András Svraka and Toby Dylan Hocking for reporting and bisecting a significant performance regression in dev. This was fixed before release thanks to a PR by Jan Gorecki, #5463.
-
key(x) <- value
is now fully deprecated (from warning to error). Usesetkey()
to set a table's key. We started warning not to use this approach in 2012, with a stronger warning starting in 2019 (1.12.2). This function will be removed in the next release. -
Argument
logicalAsInt
tofwrite()
now warns. Uselogical01
instead. We stated the intention to begin removing this option in 2018 (v1.11.0). It will be upgraded to an error in the next release before being removed in the subsequent release. -
Option
datatable.CJ.names
no longer has any effect, after becomingTRUE
by default in v1.12.2 (2019). Setting it now gives a warning, which will be dropped in the next release. -
In the NEWS for v1.11.0 (May 2018), section "NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES" item 2, we stated the intention to eventually change
logical01
to beTRUE
by default. After some consideration, reflection, and community input, we have decided not to move forward with this plan, i.e.,logical01
will remainFALSE
by default in bothfread()
andfwrite()
. See discussion in #5856; most importantly, we think changing the default would be a major disruption to reading "sharded" CSVs where data with the same schema is split into many files, some of which could be converted tological
while others remaininteger
. We will retain the optiondatatable.logical01
for users who wish to use a different default -- for example, if you are doing input/output on tables with a large number of logical columns, where writing '0'/'1' to the CSV many millions of times is preferable to writing 'TRUE'/'FALSE'. -
Some clarity is added to
?GForce
for the case when subtle changes toj
produce different results because of differences in locale. Becausedata.table
always uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, #5331. The inspirational example comparedDT[, .(max(a), max(b)), by=grp]
andDT[, .(max(a), max(tolower(b))), by=grp]
-- in the latter case, GForce is deactivated owing to the ad-hoc column, so the result formax(a)
might differ for the two queries. An example is added to?GForce
. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce:DT[, .(base::max(a), base::max(b)), by=grp]
; (2) turn off all optimizations withoptions(datatable.optimize = 0)
; or (3) set your R session to always sort in C locale withSys.setlocale("LC_COLLATE", "C")
(or temporarily with e.g.withr::with_locale()
). Thanks @markseeto for the example and @michaelchirico for the improved documentation.