Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first on empty DT should return empty DT #3858

Closed
jangorecki opened this issue Sep 11, 2019 · 6 comments · Fixed by #3859
Closed

first on empty DT should return empty DT #3858

jangorecki opened this issue Sep 11, 2019 · 6 comments · Fixed by #3859
Assignees
Milestone

Comments

@jangorecki
Copy link
Member

dt = data.table(a=1,b=2)[0,]
first(dt)
#       a     b
#1:    NA    NA
head(dt, 1)
#Empty data.table (0 rows and 2 cols): a,b
@st-pasha
Copy link
Contributor

I was recently struggling with a similar question in datatable, regarding applying reduce operators to a 0-row Frame. Conceptually there could be 2 approaches for grouping such a frame:

  • it creates 1 group of 0 rows;
  • it creates 0 groups of any rows.

Curious to hear your reasoning as to which of them is better.

@jangorecki
Copy link
Member Author

jangorecki commented Sep 11, 2019

Definitely 0 groups of any rows. 1 group of 0 rows make sense for a grand total summary where we are applying reduce function without any actual grouping. Related subject are grouping sets: rollup, cube

d = data.table(grp=character(), val=numeric())
groupingsets(d, by="grp", sets=list(character()), j=.(sum=sum(val), mean=mean(val), len=length(val)))
#      grp   sum  mean   len
#   <char> <num> <num> <int>
#1:   <NA>     0   NaN     0
d = data.table(grp="a", val=1)
groupingsets(d, by="grp", sets=list(character()), j=.(sum=sum(val), mean=mean(val), len=length(val)))
#      grp   sum  mean   len
#   <char> <num> <num> <int>
#1:   <NA>     1     1     1

sets=list(character()) denotes grand total aggregation only

@st-pasha
Copy link
Contributor

So, is there a difference between grouping by an empty vector (such as in your example with grouping sets), and having no by= clause at all in DT[i,j]? For example, if DT=data.table(A=numeric()), then what should be returned from DT[, .(first(A), sum(A), min(A))]?

mattdowle pushed a commit that referenced this issue Sep 12, 2019
@jangorecki
Copy link
Member Author

jangorecki commented Sep 12, 2019

@st-pasha generally the same as we would run it outside of data.table

> A=numeric()
> list(head(A,1L), sum(A), min(A))
[[1]]
numeric(0)

[[2]]
[1] 0

[[3]]
[1] Inf

Warning message:
In min(A) : no non-missing arguments to min; returning Inf

extra warning inside dt occurs due to different length of results

@st-pasha
Copy link
Contributor

@jangorecki Then why DT[,first(A)] was mapped to head(A,1), and not to first(A) (as seems most straightforward), or A[1] (as alluded to in documentation of first)?

> A = numeric()
> head(A,1L)
numeric(0)
> first(A)
[1] NA
> A[1]
[1] NA

Or is it because first() is not really considered a reduce operation? Because in python datatable we classify first as a reducer, and this is where the discrepancy may be coming from.

@jangorecki
Copy link
Member Author

jangorecki commented Sep 13, 2019

there is no first/last in base R. So first/last needs to wrap either to

  • head(x, n=1) and tail(x, n=1) or
  • x[1] and x[max(length(x), 1)] (yes, extra complexity needed here)

Latter will always expand to a 1 element vector. We decided to wrap to head/tail.
xts which implemented first/last long time ago seems to be affected by same inconsistency: joshuaulrich/xts#309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants