Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An empty DT() doesn't print at the end of a pipe chain #5106

Open
markfairbanks opened this issue Aug 20, 2021 · 10 comments · Fixed by #5113
Open

An empty DT() doesn't print at the end of a pipe chain #5106

markfairbanks opened this issue Aug 20, 2021 · 10 comments · Fixed by #5113

Comments

@markfairbanks
Copy link

markfairbanks commented Aug 20, 2021

I was testing out the new DT() function and was expecting it to cause printing much like the equivalent [.data.table syntax.

library(data.table)

test_df = data.table(x = 1:3, y = 1:3)

# `[` syntax
copy(test_df)[, double_x := x * 2][]
#>    x y double_x
#> 1: 1 1        2
#> 2: 2 2        4
#> 3: 3 3        6

# Using `DT()` (no printing)
copy(test_df) |>
  DT(, double_x := x * 2) |>
  DT()

Edit: One option would be to define DT() as:

DT = function(x, ...) {
  x[...]
}
@mattdowle
Copy link
Member

Thanks for testing! Yes agree on DT().

I guess you mean

DT = function(x, ...) {
  `[.data.table`(x, ...)
}

otherwise method dispatch would mean x[...] wouldn't work on a data.frame. I don't know why that would solve printing, and I'm having heebie jeebies remembering the hoops I jumped through to get DT[] to print, but I'll give it a go ...

@mattdowle mattdowle added this to the 1.14.1 milestone Aug 25, 2021
@mattdowle
Copy link
Member

mattdowle commented Aug 25, 2021

I've changed DT() so that it always prints when it's returned to the REPL. Since it's functional form, that's what sprung to mind to do. This way you don't need to tack on a |> DT().

@Kamgang-B
Copy link
Contributor

@mattdowle thank you for all your effort!
I wonder if it is really good to always print the output when piping (using DT function).
While this helps to avoid the extra |> DT(), it also has some drawbacks:

  1. When running chunks of codes (several blocks of codes where each is based on DT function), the output of all chunks gets printed while only the output of the last block is likely needed (to be printed).
  2. When we are creating a variable or performing some data.table query, printing is useful to inspect the value of the output; and once we deem that the value created is correct, we don't usually want to print the output of that piece/chunk of code/s anymore because it is no longer useful to do so. If the DT always print its output when piping, the only ways (I think) to avoid this behavior is either to reassign the (whole) output (like in X = X |> DT(, names(X) := lapply(.SD, min))) or to pipe the last output to the invisible function (like in X |> DT(, names(X) := lapply(.SD, min)) |> invisible()). In the first case (reassigning), I think it is not convenient to make an assignment every time we want to avoid to print the output just to get rid of printing. The latter case (calling the invisible function) is a bit weird: it is an additional layer that (in my opinion) does not have any significant importance and is distractive, and therefore can be avoided.
  3. Always printing creates an inconsistency with the classical data.table chaining style based on square brackets; and I wonder what could be the real benefit of this difference (given that DT is (I think) supposed to be the functional form of square brackets chaining style and so should behave similarly.)

If always printing the output when returning to REPL is really needed, a good compromise could be to make it optional
(via something like options(datatable.print.end.pipe.always=TRUE/FALSE) or options(datatable.print.REPL=TRUE/FALSE): TRUE if the output after piping should always be printed) and set the default behavior to FALSE to be consistent with the square brackets chaining style.

@mattdowle
Copy link
Member

mattdowle commented Sep 2, 2021

@Kamgang-B it doesn't always print. By "always prints when it's returned to the REPL" that means when the last value in a chain is returned to the REPL then it always prints. As opposed to the last value not printing when that last expression is a :=. DT(, :=) in the middle of a chain would never print. Does that answer point 1, and I think 2 and 3 as well? Was it just a misunderstanding? Could you try latest dev and see if it works as you hope?

Also, in my mind currently, consistency with the classical data.table chaining, is not a goal of DT() per se. What is a goal is that DT() should behave however users want it to behave; the way that is most useful in most cases. Over the years, data.table has suffered in some areas because consistency-with-whatever has been seen as top priority, even when nobody agreed with the behavior that we were being consistent with. If there is no reason to be inconsistent, then sure, let's be consistent. For example, changing fread recently to be consistent with base with those 13 out of 100,000 numbers was no reason not to. But why we were consistent with min(empty) returning Inf beats me (we are now inconsistent with base on that, a good thing). The benefit of DT(, :=) printing is that I thought that would be least surprising to most users coming from dplyr would expect. If that's not the case then I misjudged that and we can change it. Absolutely. But not consistency for consistencies sake. Your 33 examples I added in #5113 were all based on expecting consistency with DT[...] which was correct, I fixed them all and they're all now consistent. There was no reason for inconsistency for any of those. For printing DT(,:=) however, I figured there is a good reason to make it print, because my guess was that users would expect it to print who know dplyr, and because needing to tack on |> DT() looked inconvenient to me. Although it would be consistent with tacking on a [] it's worse because it's 7 characters rather than just the two of a tacked on []. Surely dplyr users would think it's strange to need to do that?

@Kamgang-B
Copy link
Contributor

Hi @mattdowle.
I think I have a better understanding of the goal of the functional pipe now.

About my point 1:
Sorry I was not clear enough in my previous comment but every time I said always print, I actually meant printing when returning to the RELP; So excluding intermediary calls in the chain and considering only the end of the chain. That's why in the solution I proposed (options(datatable.print.end.pipe.always=TRUE/FALSE)) I added the word end. Also, I should have added some examples to make clear my explanation. Note that my opinion was somehow biased by the fact that I tended to think data.table way.
Here is an example of what I meant (assuming that modification/assignment happens by reference).
Consider the following chunks of codes (as part of an R script):

# chunk 1
Data |> 
  ...
  DT(...)

# chunk 2
Data |> 
  ...
  DT(...)

# chunk 3
Data |> 
  ...
  DT(...)

The idea was that when these chunks are sequentially dependent, then we would run all the chunks at the same time but would very likely be interested only in the output of the last chunk. And printing the output of the first two chunks would likely be less desired than adding an extra |> DT() to the last chunk.

About how users would like DT() to behave:
I think that when the dataset being used is a data.table, then DT should behave like when using the square brackets even for printing (not necessarilly for consistency but because I think it is better); that is, it should not print if the end of the chain is an assignment (:=). And if you think that printing when returning to the RELP is really needed even in this case, then why don't we make it optional (let the user decide through an option)? I personally find it a bit weird to print the output even when it is not needed. Note that this is what happens actually even if the dataset is a data.table (example: setDT(Data) |> DT(, count := .N, by=cyl) assigns the new variable and prints the output. So it would be nice to be able to turn this behavior off if there is need). If the data was a data.frame and assignment did not happen by reference, then reassigning the dataset would not print anything. So, always printing sounds reasonable if the dataset needs to be reassigned (that is if assignment by reference is not allowed).
I don't know if my explanation is clear but here is some example:

dt = setDT(copy(mtcars))
df = copy(mtcars)
# data.table
dt |>
  DT(, .(count = .N))         #(a) output is printed + no assignment to the original data
dt |>
  DT(, count := .N)           #(b) output is printed + assignment to the original data
# data.frame 
df |>
  DT(, count := .N)           #(c) output is printed + no assignment to the original data
df = df |>
  DT(, count := .N)           #(d) output is not printed + new data. What would be the dt equivalent of this?

I know that in (b), reassigning would also prevent printing but it would not be useful since the assignment is done by reference.
I wonder what would be the equivalent of (d) for a data.table?

Thinking in terms of new users:

a goal is that DT() should behave however users want it to behave; the way that is most useful in most cases.
The benefit of DT(, :=) printing is that I thought that would be least surprising to most users coming from dplyr would expect.

Altought I can't find a link to some examples now, I remember that I read several times some people comments about the fact that they find the reassignment as it is done in the (d) when working with dplyr annoying (although they still preferred dplyr's piping style over square brackets). For this category of people, assigning by reference when using a data.frame would likely be preferred (and me too I don't like the reassignment but think that it is safe for users that are not used to the data.table approach).
In general, users will likely expect DT() to be close to how piping works in dplyr but I wonder if it is not possible to satisfy different types of users by leaving them the possibility to choose whether to assign by reference or copy-on-modify when working with a data.frame by doing something like options(data.table.copy.on.modify=TRUE/FALSE) (TRUE by default). Is it a bad idea? In this case, both dplyr and data.table users could freely choose what they find better, and the default behavior would be less surprising to dplyr's (and base R) users. Further, users who turn this option on would/should be aware it does and its implication.

@mattdowle
Copy link
Member

Thanks. Still mulling everything over and rereading.
I don't see any problem with adding an option to control printing, good idea. But a global option like that to control copy-on-modify is unlikely to fly, iiuc. Because R's options are global. So a user could change the option for their own use, and inadvertently change the behavior inside packages that are using DT() too. We want the intention of the code-writer to be explicitly clear in the line of code they write, rather than depending on a global option that might change. I've written about that in the news item for removing datatable.nomatch which was the last option like that. The remaining options are display and format options which don't affect code logic or data differences.

@mattdowle mattdowle reopened this Sep 2, 2021
@Kamgang-B
Copy link
Contributor

Thanks @mattdowle
When I suggested the global option on copy-on-modify vs assignment by reference, that was my biggest fear; that is, changing a global option could really have undesired consequences on packages that are based on data.table. But at the same time, I thought that a user who does that is aware of the implications. But now I even realize that, even if a user is fully aware of what it means, it is still not safe because a package based on data.table could be tempted to change this option internally.

Another option could be to add an additional argument (inplace to DT()) to control how the assignment should be done when working with a data.frame. The advantage, in this case, is that it cannot affect other packages and the original data cannot be modified by reference unless this argument is explicitly set to TRUE.

I feel like if this part of the discussion about the assignment/modification is more related to #5129 than this issue.

@mattdowle
Copy link
Member

Yes good points. Just a quick thought to add is that since data.frame are not over-allocated, they have to be shallow copied anyway to be able to add a new column. There is just no way around that. So at best there'd be a difference in DT(,:=) between whether it is updating existing columns (that could be by reference as happens currently in dev) and adding new columns which couldn't be by reference. Whereas a data.table can be updated by reference consistently w.r.t. updating existing columns vs adding new columns. Caveat when the over allocation is used up.

@Kamgang-B
Copy link
Contributor

Hi @mattdowle

Hope you are fine.

This is more a feature request related to the use of DT and not an issue. I also know that this is not strictly related to printing but have placed it here because it is not really an issue and is relatively simple (I think) but if you think I should open an issue, I would happily do it.

While discussing with some colleagues about the new function DT for chaining, I noticed that some struggled to remember the arguments that are used in it; And if they were not previous users of data.table library, the level of difficulty is even more pronounced. I think that it is true (specially because data.table:::[.data.table in the dev version currently has 17 arguments and DT argument are x and ...). Using ... somehow supposes that the user is familiar with the data.table and already know what should be passed to ... and the corresponding order.
My suggestion is to make the data.table:::[.data.table arguments explicit in the new DT rather than using ....

  • It makes it easy to remember the arguments that are allowed and their default values (without looking at the help page)
  • One big benefit of it is the autocompletion of arguments names (that I sometimes missed in data.table:::[.data.table, particularly when typing .SDcols but also other arguments)

This feature can be created by adding all the formal arguments (with their default values) from data.table:::[.data.table function to DT at once by doing formals(DT) = formals(data.table:::[.data.table) after creating the function DT (See below).

library(data.table)

# current DT (hard to know the argument that are allowed, the order, ... specially for a new user)
DT  # formal arguments: x and ...
function (x, ...) 
{
  old = getOption("datatable.optimize")
  if (!is.data.table(x) && old > 2L) {
	options(datatable.optimize = 2L)
  }
  fun = match.call()
  fun[[1L]] = as.name("[.data.table")
  ans = eval(fun, envir = parent.frame(), enclos = parent.frame())
  options(datatable.optimize = old)
  .global$print = ""
  ans
}


# DT after my suggestion. formal arguments: x,i,j,by, etc.
formals(DT) = formals(data.table:::`[.data.table`)
DT
function (x, i, j, by, keyby, with = TRUE, nomatch = NA, mult = "all", roll = FALSE, rollends = if (roll == "nearest") c(TRUE, TRUE) else if (roll >= 0) c(FALSE, TRUE) else c(TRUE, FALSE), which = FALSE, .SDcols, verbose = getOption("datatable.verbose"), allow.cartesian = getOption("datatable.allow.cartesian"), drop = NULL, on = NULL, env = NULL) 
{
  old = getOption("datatable.optimize")
  if (!is.data.table(x) && old > 2L) {
	options(datatable.optimize = 2L)
  }
  fun = match.call()
  fun[[1L]] = as.name("[.data.table")
  ans = eval(fun, envir = parent.frame(), enclos = parent.frame())
  options(datatable.optimize = old)
  .global$print = ""
  ans
}

@jangorecki
Copy link
Member

jangorecki commented Sep 14, 2022

Code from first post is now printing as expected, but it also gives a warning, likely unexpected

copy(test_df) |>
  DT(, double_x := x * 2) |>
  DT()
#       x     y double_x
#   <int> <int>    <num>
#1:     1     1        2
#2:     2     2        4
#3:     3     3        6
#Warning message:
#i and j are both missing so ignoring the other arguments. This warning will be upgraded to error in future. 

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@jangorecki jangorecki removed this from the 1.15.0 milestone Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants