Remove seemingly equivalent ways of producing subsets of DataFrames #323

johnmyleswhite · 2013-07-14T12:46:32Z

We have too many ways of producing things that seem like subsets of a DataFrame. After I added filter, we now have four ways that seem superficially equivalent, but reflect two different algorithms:

julia> df[:(A % 2 .== 0 && B % 4 .== 0), :]
ERROR: type: non-boolean (DataArray{Bool,1}) used in boolean context
 in anonymous at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1457
 in with at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1458
 in getindex at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:428

julia> sub(df, :(A % 2 .== 0 && B % 4 .== 0))
ERROR: type: non-boolean (DataArray{Bool,1}) used in boolean context
 in anonymous at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1457
 in with at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1458
 in sub at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1014

julia> subset(df, :(A % 2 .== 0 && B % 4 .== 0))
ERROR: type: non-boolean (DataArray{Bool,1}) used in boolean context
 in anonymous at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1457
 in with at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1458
 in sub at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1014

julia> filter(:(A % 2 .== 0 && B % 4 .== 0), df)
ERROR: type: non-boolean (DataArray{Bool,1}) used in boolean context
 in anonymous at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1457
 in with at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1458
 in filter at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1023

I think we should:

Return SubDataFrame objects everywhere when accessing all columns of a DataFrame, which is what sub does, but direct indexing does not do.
Rename sub to filter after flipping the argument order. We will then remove sub and subset completely.

Having multiple ways of doing the same thing troubles me in general, but seems particularly problematic when the different ways are not strictly aliases of one another. Over time, end-users discover that, even though all subset operations are equal in outcome, some are more equal than others.

The text was updated successfully, but these errors were encountered:

ViralBShah · 2013-07-14T13:14:45Z

+1 on both proposals.

tshort · 2013-07-14T19:38:34Z

I like the idea. I don't really like the name filter because I think of
that as a signal processing function, but I'll get used to it.

On Sun, Jul 14, 2013 at 9:14 AM, Viral B. Shah notifications@github.comwrote:

+1 on both proposals.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/323#issuecomment-20936263
.

StefanKarpinski · 2013-07-14T19:46:01Z

One possibility is renaming filter in Base to select and doing the same here. I fairly frequently find myself accidentally thinking that filter(f,v) will filter out elements of v as in reject the ones for which the function is true. With select(f,v) there's no danger of such confusion. We could potentially also have a reject function that removes elements for which the predicate is true. Although that seems redundant, it's handy when you already have a named predicate that checks exactly the condition that you want to reject elements based on – writing reject(isunwanted,v) is much simpler and easier to read than select(x->!isunwanted(x),v).

johnmyleswhite · 2013-07-14T20:42:24Z

I do like the name select lot for its SQL parallelism. Instead of having select and reject, I'd prefer having a negate = true keyword argument to select like the rev argument to sort.

tshort · 2013-07-14T21:40:49Z

+1 for select.

johnmyleswhite · 2013-07-14T22:13:17Z

Here's the semantics I'm thinking to implement.

Indexing behaves as:

df[SingleRowIndex, SingleColumIndex] => Scalar
df[SingleRowIndex, MultiColumIndex] => SubDataFrame
df[MultiRowIndex, SingleColumIndex] => DataVector (or SubDataVector)
df[MultiRowIndex, MultiColumIndex] => SubDataFrame

Select always returns a SubDataFrame because it is equivalent to df[RowIndex, All Columns].

tshort · 2013-07-14T22:41:29Z

I like your approach. It'll certainly stress test SubDataFrame support with things like cbind(df1[idx,3:4], df2[:,1], df3[:,1:3]).

If I want a copy, is the best way with deepcopy?

johnmyleswhite · 2013-07-14T22:42:38Z

I think (hope) copy should be sufficient. We were recently debating whether deepcopy should even exist.

ViralBShah · 2013-07-15T03:24:10Z

select is already taken in base to pick the k-th largest element.

johnmyleswhite · 2013-07-15T03:41:09Z

I think it's alright if we generalize select as select(collection, predicate), where the current select uses a shorthand for the maximum and minimum predicates.

StefanKarpinski · 2013-07-15T04:25:28Z

Ah, I forgot about that. I'm not super comfortable with mixing those meanings. It's not clear to me that they're really the same operation or that they won't have signatures that clash.

johnmyleswhite · 2013-07-15T04:27:18Z

I personally think it would make more sense to rename the k-max function to kmax and use select for the SQL-like function.

StefanKarpinski · 2013-07-15T04:34:25Z

kmax doesn't describe what it does – it finds a contiguous range of elements at the indices in a collection if the collection were fully sorted without fully sorting it. That's far more general and useful and it's commonly known as select, unfortunately.

ViralBShah · 2013-07-15T04:36:23Z

select is something that only computer scientists will know about. To everyone else, I believe it is a non-obvious name, where the name does not describe what it actually does. Perhaps there are other better names we can think of.

I would also prefer renaming filter in base and leave the name for signal processing, but that is a different issue altogether.

johnmyleswhite · 2013-07-15T13:37:57Z

Let's keep debating this. For me, the gains from adopting standard SQL terminology are very large: we just switched to join and using select would be a major win. It would be easy to add delete! and update! as well, which take expressions and modify rows that satisfy the expression.

StefanKarpinski · 2013-07-15T15:42:12Z

I'm sympathetic to that. Let's see if we can come up with a better name for the algorithmic select operation.

ViralBShah · 2013-07-15T17:05:09Z

@johnmyleswhite I meant to say that select as in the select(k) terminology is something only computer scientists would know. I thus support the choice of using select in the database sense.

johnmyleswhite · 2013-07-15T17:05:42Z

That's what I understood, Viral.

johnmyleswhite · 2013-07-15T17:06:31Z

I still think the two uses are similar: select(k) is really equivalent to select(source, sortedindex in indices).

ViralBShah · 2013-07-15T17:09:32Z

Ok. I missed that Base.select currently can also take a range as the second index.

StefanKarpinski · 2013-07-15T17:10:01Z

This is going to make deprecation a nightmare since if we rename filter(f,v) to select(f,v) it's going to clash with the current deprecation that redirects to select(v,lt=f). So this might just have to be a straight-up breaking change. Of course, those usages expect f to have different arity, so at least the code would break instead of being silently wrong.

randyzwitch · 2013-07-16T00:42:08Z

@johnmyleswhite If the decision is moving towards SQL naming conventions, is there really a need for a SubDataFrame object? From what it sounds like, wouldn't that be similar to a SQL view...

My confusion/idea stems from your idea above where you are saying 'some are more equal than others'...I feel like a beginning user (especially an R user) might feel uncomfortable and keep trying to cast a SubDataFrame to a DataFrame (though they seem like they are functionally equivalent by the description above).

johnmyleswhite · 2013-07-16T00:47:27Z

SubDataFrame is literally a view on data, not a copy. It's very important for performance. So I think we need it (and we already have it and use it quite a lot).

I'm sure that some R users will be confused. There's really no way to avoid confusion as people learn about pass-by-reference semantics. But I hope that we can design semantics that are simple, if unfamiliar.

My "some are more equal than others" comment refers to the fact that operations that produce equivalent results in R often have very different performance characteristics: the use of with, for example, is basically always a mistake in R. Similarly, you should never use column names in any R code that needs to run efficiently, because it's an O(n) lookup, whereas numeric indexing is O(1). You're told that column names and column indices are equivalent, but in fact one is strictly inferior to the other.

randyzwitch · 2013-07-16T01:02:51Z

Yeah, that makes sense now. At first I was thinking that it was just the output that made a table different from a view in SQL, but from an underlying database structure, a table is a 'data object' and a view is a 'code object'.

At the risk of being annoying, does SubDataFrame make sense as DataView? So DataFrame and DataView analogous to table/view concepts in SQL.

johnmyleswhite · 2013-07-16T01:05:25Z

I have to think a bit more about how views are implemented in SQL to decide whether they're the same. I'm inclined to think they're not: SQL views update when the underlying database changes, whereas a SubDataFrame will continue to reuse the same indices (which may be wrong) if the underlying database changes.

johnmyleswhite · 2013-07-16T15:35:14Z

Reading about views in SQL more, I believe I was right: a view is just a stored SQL query, whereas a SubDataFrame is effectively a stored query result.

@StefanKarpinski: changing to select is a big change, so I'm happy to take it slow.

quinnj · 2017-09-07T03:54:34Z

I don't think this is an issue anymore; DataFrames doesn't have subset or filter anymore and view or bit/bool array indexing seem to be fairly standard and non-controversial.

nalimilan · 2017-09-09T18:37:32Z

Funny timing, as we just deprecated select in Base (JuliaLang/julia#23051).

quinnj closed this as completed Sep 7, 2017

nalimilan mentioned this issue Dec 23, 2017

Add filter() and filter!() methods #1330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove seemingly equivalent ways of producing subsets of DataFrames #323

Remove seemingly equivalent ways of producing subsets of DataFrames #323

johnmyleswhite commented Jul 14, 2013

ViralBShah commented Jul 14, 2013

tshort commented Jul 14, 2013

StefanKarpinski commented Jul 14, 2013

johnmyleswhite commented Jul 14, 2013

tshort commented Jul 14, 2013

johnmyleswhite commented Jul 14, 2013

tshort commented Jul 14, 2013

johnmyleswhite commented Jul 14, 2013

ViralBShah commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

ViralBShah commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

ViralBShah commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

ViralBShah commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

randyzwitch commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

randyzwitch commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

quinnj commented Sep 7, 2017

nalimilan commented Sep 9, 2017

Remove seemingly equivalent ways of producing subsets of DataFrames #323

Remove seemingly equivalent ways of producing subsets of DataFrames #323

Comments

johnmyleswhite commented Jul 14, 2013

ViralBShah commented Jul 14, 2013

tshort commented Jul 14, 2013

StefanKarpinski commented Jul 14, 2013

johnmyleswhite commented Jul 14, 2013

tshort commented Jul 14, 2013

johnmyleswhite commented Jul 14, 2013

tshort commented Jul 14, 2013

johnmyleswhite commented Jul 14, 2013

ViralBShah commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

ViralBShah commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

ViralBShah commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

ViralBShah commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

randyzwitch commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

randyzwitch commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

quinnj commented Sep 7, 2017

nalimilan commented Sep 9, 2017