Vignettes #944

arunsrinivasan · 2014-11-11T12:49:31Z

HTML vignette series:

Planned for v1.9.8

Quick tour of data.table
Keys and fast binary search based subset
Secondary indices and auto indexing
Joins vignette. a) joins vs subsets -- extending binary search based subset to joins + conditional / non-equi joins, rolling and interval joins. b) by=.EACHI, join + update feature. c) Document i.col usage as filed in Docs: explain and document the i.col notation for joins #1038. d) Also cover about performance/advantages from on performing slower than double setkey #1232.
~~[ ] Cover get() and mget(). E.g., http://stackoverflow.com/q/33785747/559784~~ covered in programming on data.table #4304
Add about on= argument rationale in FAQ ([Documentation] Use of the on= argument for joins #1623).
FAQ 5.3 needs to mention that it's a shallow copy that's done in order to restore over-allocation. Thanks to Jan for linking it in := changes address of a data table #1729.

Future releases

Finished:

Introduction to data.table - data.table syntax, general form, subset rows in i, select / do in j and aggregations using by.
Reference Semantics (add/update/delete columns by reference, and see that we can combine with i and by in the same way as before)
Efficient reshaping using data.tables
Link to this answer on SO on by=.EACHI until the vignette is done.

Minor:

Operations using integer64, and promoting it for large integers.

Notes (to update current vignettes based on feedbacks): Please let me know if I missed anything..

Introduction to data.table:

order in i.
Explain how to name columns in j while selecting/computing.
Emphasise that keyby is applied after obtaining the result on the computed result, not on the original data.table.
Mention new updates to .SDcols and cols in with=FALSE being able to select columns as colA:colB.

Reference semantics:

Also explain all other relevant set* functions here.. (setnames, setcolorder etc..)
Mainly set.
Explain that 1b) the := operator is just defining ways to use it - the example there doesn't work as it just shows two different ways of using it -- Following this comment.

Keys and fast binary search based subsets:

Add an example of subset using integer/double keys.
Difference in "nomatch" default in binary search based subsets.
replacing NAs with binary search based subsets possible?

FAQ (most appropriate here, I think).

Update FAQ with issue on external pointer being NULL when reading an R object from file, for example, using readRDS(). Update this SO post.
Explain with example, on over allocating the data.table using alloc.col(), and when to use it (when you need to create multiple columns), and why. Update this SO post.

The text was updated successfully, but these errors were encountered:

matthieugomez · 2014-11-14T14:25:25Z

I'm curious about what makes a cold by faster than say tapply. One part of the answer is gforce, but what about user written functions? I could not find anything about this. There's a nice post about panda : http://wesmckinney.com/blog/?p=489
One could even compare it with sapply. For instance, suppose I start from a list of vectors. Is it ever worth it to append all the vectors in one column in a data.table and use by instead of sapply ?

markdanese · 2014-11-30T02:44:50Z

Being new to R and data.table (since March), I would say that there needs to be a basic outcome-oriented introduction as opposed to the current function-oriented one. In other words, it is one thing to read what each parameter in data.table does, but they often make little sense without having a use-case in mind. While there are examples of output, many people need to go the other direction. That is, they know what output they need, but they don't know what function/parameter/setting is most appropriate to use. It would be helpful to have a simple recipe approach to get them started.

How to I create subsets of my data?
How do I do an operation on subsets of my data to create a new or updated data set?
How do I add a new column?
How do I delete a column?
How do I create a single variable?
How do I create multiple variables?
How do I do different operations on different subsets of my data? (.BY)
How do I use data.table in a function and pass in data.table names and columns on which to operate?
How do I do multiple sequential operations on the same data.table?
Can I select a subset of data and do an operation on it at the same time?
When do I need to be careful about creating/updating variables by reference?
How do I select one observation per group (first, last)?
How do I set a key and how is it different from setting an index?
Under what conditions does my key get deleted when I do an operation on my data.table?
Can I just use the regular "merge" syntax or do I need to use data.table syntax (Y[X])?
How do I collapse a list of lists into one big data.table? What if the columns are in different order?

There are probably a ton of other items all on SO that could be edited into a simple compilation of questions and answers.

smartinsightsfromdata · 2015-01-30T23:22:06Z

Great work on these vignettes!
My comments may be late or already covered:

I would like to see a variety of ways / examples of using dynamic rows and columns.
More extensive comparison on merge and joins.
Different / richer ways to use set. Also, it would be nice to see an explanation why the following does give an error (see here ):

for (j in  valCols)
   set(dt_,  
    i = which(is.na(dt_[[j]])),
    j = j, 
    value= as.numeric(originTable[[j]]))

pakom · 2016-11-30T15:06:39Z

Thank you for the updated vignettes with the release of v1.9.8.
The "Reference semantics" refers to the copy() function and its new capabilities to make shallow copies (especially inside functions, something that I am really interested in):

"However we could improve this functionality further by shallow copying instead of deep copying. In fact, we would very much like to provide this functionality for v1.9.8. We will touch up on this again in the data.table design vignette."

But the design vignette is missing and the link points to an old issue. The reference manual does not provide more information on copy() than the one provided in the vignette. The rest of the vignettes do not provide any information on copy.

Will this vignette become available soon?

MichaelChirico · 2017-08-11T05:27:45Z

+1 for internals vignette. I (and I guess a few others) am quite interested in contributing a bit on the C side of things, but am a bit intimidated by the (as it stands) 35k lines of C code... quite the learning curve to 'go it alone' -- an intro to internals could do wonders!

zeomal · 2020-04-24T11:38:46Z

Wanted to chime in and ask if contributions to the vignette are accepted from non-code contributors (like me). I am particularly interested in contributing to the joins vignette as I had quite a bit of trouble with it initially and was guided to solutions from Arun's answers on Stackoverflow, and I'd like some guidance on how to do so, if allowed.

Henrik-P · 2020-04-24T12:00:34Z

@arunsrinivasan I see that you have a point IDateTime vignette. Perhaps it could be included in the more general vignette suggested by @jangorecki: vignettes: timeseries - ordered observations?

In addition, I am preparing a first draft on some of the topics suggested by jan. Perhaps parts of it may be relevant for a join vignette as well? I'm happy to share if anyone may find it useful.

MichaelChirico · 2020-04-24T15:23:51Z

@zeomal such a contribution would be highly valuable and much appreciated!

zeomal · 2020-04-24T16:03:29Z

@MichaelChirico, thank you. @Henrik-P, will your brief on normal joins be comprehensive - i.e. will your focus be more on timeseries? If not, I can start work on it - I haven't used rolling joins yet, so no knowledge there. :)

Henrik-P · 2020-04-24T16:24:25Z

@zeomal Hopefully I will be able to upload the first draft soon, so you can have a look at it. In my draft, I provide a simple example of a "normal" join on a single variable, time, where there are non-matching rows. I use nomatch = NA. (maaaybe also a quick example with nomatch = NULL)

My idea was that this simple join could provide a context and a feeling for the problem, which I then treat more thoroughly in the following sections on rolling and non-equi joins et al.

Thanks a lot for your willingness to contribute! .

Henrik-P · 2020-04-25T18:46:22Z

@zeomal If you wish to check how brief my treatment on normal (equi) joins is, I just want to let you know that I posted a PR on a timeseries vignette.

arunsrinivasan added internals documentation High labels Nov 11, 2014

This comment has been minimized.

Sign in to view

arunsrinivasan added this to the v1.9.8 milestone Nov 16, 2014

This comment has been minimized.

Sign in to view

arunsrinivasan self-assigned this Nov 26, 2014

This comment has been minimized.

Sign in to view

arunsrinivasan mentioned this issue Apr 4, 2016

[Documentation] Use of the on= argument for joins #1623

Closed

arunsrinivasan mentioned this issue Jul 22, 2016

Docs: explain and document the i.col notation for joins #1038

Closed

mattdowle modified the milestones: v2.0.0, v1.9.8 Sep 15, 2016

franknarf1 mentioned this issue May 27, 2017

join vignette #2181

Closed

franknarf1 mentioned this issue Dec 11, 2017

add examples of named list elements in i #1945

Open

franknarf1 mentioned this issue May 9, 2018

fread (and fwrite) vignette #2855

Open

mattdowle removed this from the Candidate milestone May 10, 2018

franknarf1 mentioned this issue Dec 4, 2018

Rolling join does not retrieve duplicated date in x #3180

Open

MichaelChirico mentioned this issue Dec 12, 2018

when will sep2 in fread be implemented? #1162

Open

asardaes mentioned this issue Jun 10, 2019

Supporting joining verbs asardaes/table.express#1

Closed

This comment has been minimized.

Sign in to view

MichaelChirico mentioned this issue Aug 15, 2019

unexpected subsetting results #2396

Closed

This comment has been minimized.

Sign in to view

jangorecki removed the High label Jun 3, 2020

This comment has been minimized.

Sign in to view

lucasmation mentioned this issue Mar 1, 2025

Joins in data.table vignette. Simplify and extend the "update by reference section" #6846

Open

Vignettes #944

Vignettes #944

Comments

arunsrinivasan commented Nov 11, 2014 • edited by tonyfischetti Loading

HTML vignette series:

Minor:

Introduction to data.table:

Reference semantics:

Keys and fast binary search based subsets:

FAQ (most appropriate here, I think).

This comment has been minimized.

This comment has been minimized.

matthieugomez commented Nov 14, 2014

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

markdanese commented Nov 30, 2014

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

smartinsightsfromdata commented Jan 30, 2015

This comment has been minimized.

This comment has been minimized.

pakom commented Nov 30, 2016 • edited Loading

MichaelChirico commented Aug 11, 2017

This comment has been minimized.

zeomal commented Apr 24, 2020

Henrik-P commented Apr 24, 2020

MichaelChirico commented Apr 24, 2020

zeomal commented Apr 24, 2020 • edited Loading

Henrik-P commented Apr 24, 2020

This comment has been minimized.

This comment has been minimized.

Henrik-P commented Apr 25, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

arunsrinivasan commented Nov 11, 2014 •

edited by tonyfischetti

Loading

pakom commented Nov 30, 2016 •

edited

Loading

zeomal commented Apr 24, 2020 •

edited

Loading

Henrik-P commented Apr 25, 2020 •

edited

Loading