Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove seemingly equivalent ways of producing subsets of DataFrames #323

Closed
johnmyleswhite opened this issue Jul 14, 2013 · 28 comments
Closed

Comments

@johnmyleswhite
Copy link
Contributor

We have too many ways of producing things that seem like subsets of a DataFrame. After I added filter, we now have four ways that seem superficially equivalent, but reflect two different algorithms:

julia> df[:(A % 2 .== 0 && B % 4 .== 0), :]
ERROR: type: non-boolean (DataArray{Bool,1}) used in boolean context
 in anonymous at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1457
 in with at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1458
 in getindex at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:428

julia> sub(df, :(A % 2 .== 0 && B % 4 .== 0))
ERROR: type: non-boolean (DataArray{Bool,1}) used in boolean context
 in anonymous at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1457
 in with at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1458
 in sub at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1014

julia> subset(df, :(A % 2 .== 0 && B % 4 .== 0))
ERROR: type: non-boolean (DataArray{Bool,1}) used in boolean context
 in anonymous at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1457
 in with at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1458
 in sub at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1014

julia> filter(:(A % 2 .== 0 && B % 4 .== 0), df)
ERROR: type: non-boolean (DataArray{Bool,1}) used in boolean context
 in anonymous at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1457
 in with at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1458
 in filter at /Users/johnmyleswhite/.julia/DataFrames/src/dataframe.jl:1023

I think we should:

  • Return SubDataFrame objects everywhere when accessing all columns of a DataFrame, which is what sub does, but direct indexing does not do.
  • Rename sub to filter after flipping the argument order. We will then remove sub and subset completely.

Having multiple ways of doing the same thing troubles me in general, but seems particularly problematic when the different ways are not strictly aliases of one another. Over time, end-users discover that, even though all subset operations are equal in outcome, some are more equal than others.

@ViralBShah
Copy link
Contributor

+1 on both proposals.

@tshort
Copy link
Contributor

tshort commented Jul 14, 2013

I like the idea. I don't really like the name filter because I think of
that as a signal processing function, but I'll get used to it.

On Sun, Jul 14, 2013 at 9:14 AM, Viral B. Shah notifications@github.comwrote:

+1 on both proposals.


Reply to this email directly or view it on GitHubhttps://github.com//issues/323#issuecomment-20936263
.

@StefanKarpinski
Copy link
Member

One possibility is renaming filter in Base to select and doing the same here. I fairly frequently find myself accidentally thinking that filter(f,v) will filter out elements of v as in reject the ones for which the function is true. With select(f,v) there's no danger of such confusion. We could potentially also have a reject function that removes elements for which the predicate is true. Although that seems redundant, it's handy when you already have a named predicate that checks exactly the condition that you want to reject elements based on – writing reject(isunwanted,v) is much simpler and easier to read than select(x->!isunwanted(x),v).

@johnmyleswhite
Copy link
Contributor Author

I do like the name select lot for its SQL parallelism. Instead of having select and reject, I'd prefer having a negate = true keyword argument to select like the rev argument to sort.

@tshort
Copy link
Contributor

tshort commented Jul 14, 2013

+1 for select.

@johnmyleswhite
Copy link
Contributor Author

Here's the semantics I'm thinking to implement.

Indexing behaves as:

  • df[SingleRowIndex, SingleColumIndex] => Scalar
  • df[SingleRowIndex, MultiColumIndex] => SubDataFrame
  • df[MultiRowIndex, SingleColumIndex] => DataVector (or SubDataVector)
  • df[MultiRowIndex, MultiColumIndex] => SubDataFrame

Select always returns a SubDataFrame because it is equivalent to df[RowIndex, All Columns].

@tshort
Copy link
Contributor

tshort commented Jul 14, 2013

I like your approach. It'll certainly stress test SubDataFrame support with things like cbind(df1[idx,3:4], df2[:,1], df3[:,1:3]).

If I want a copy, is the best way with deepcopy?

@johnmyleswhite
Copy link
Contributor Author

I think (hope) copy should be sufficient. We were recently debating whether deepcopy should even exist.

@ViralBShah
Copy link
Contributor

select is already taken in base to pick the k-th largest element.

@johnmyleswhite
Copy link
Contributor Author

I think it's alright if we generalize select as select(collection, predicate), where the current select uses a shorthand for the maximum and minimum predicates.

@StefanKarpinski
Copy link
Member

Ah, I forgot about that. I'm not super comfortable with mixing those meanings. It's not clear to me that they're really the same operation or that they won't have signatures that clash.

@johnmyleswhite
Copy link
Contributor Author

I personally think it would make more sense to rename the k-max function to kmax and use select for the SQL-like function.

@StefanKarpinski
Copy link
Member

kmax doesn't describe what it does – it finds a contiguous range of elements at the indices in a collection if the collection were fully sorted without fully sorting it. That's far more general and useful and it's commonly known as select, unfortunately.

@ViralBShah
Copy link
Contributor

select is something that only computer scientists will know about. To everyone else, I believe it is a non-obvious name, where the name does not describe what it actually does. Perhaps there are other better names we can think of.

I would also prefer renaming filter in base and leave the name for signal processing, but that is a different issue altogether.

@johnmyleswhite
Copy link
Contributor Author

Let's keep debating this. For me, the gains from adopting standard SQL terminology are very large: we just switched to join and using select would be a major win. It would be easy to add delete! and update! as well, which take expressions and modify rows that satisfy the expression.

@StefanKarpinski
Copy link
Member

I'm sympathetic to that. Let's see if we can come up with a better name for the algorithmic select operation.

@ViralBShah
Copy link
Contributor

@johnmyleswhite I meant to say that select as in the select(k) terminology is something only computer scientists would know. I thus support the choice of using select in the database sense.

@johnmyleswhite
Copy link
Contributor Author

That's what I understood, Viral.

@johnmyleswhite
Copy link
Contributor Author

I still think the two uses are similar: select(k) is really equivalent to select(source, sortedindex in indices).

@ViralBShah
Copy link
Contributor

Ok. I missed that Base.select currently can also take a range as the second index.

@StefanKarpinski
Copy link
Member

This is going to make deprecation a nightmare since if we rename filter(f,v) to select(f,v) it's going to clash with the current deprecation that redirects to select(v,lt=f). So this might just have to be a straight-up breaking change. Of course, those usages expect f to have different arity, so at least the code would break instead of being silently wrong.

@randyzwitch
Copy link
Contributor

@johnmyleswhite If the decision is moving towards SQL naming conventions, is there really a need for a SubDataFrame object? From what it sounds like, wouldn't that be similar to a SQL view...

My confusion/idea stems from your idea above where you are saying 'some are more equal than others'...I feel like a beginning user (especially an R user) might feel uncomfortable and keep trying to cast a SubDataFrame to a DataFrame (though they seem like they are functionally equivalent by the description above).

@johnmyleswhite
Copy link
Contributor Author

SubDataFrame is literally a view on data, not a copy. It's very important for performance. So I think we need it (and we already have it and use it quite a lot).

I'm sure that some R users will be confused. There's really no way to avoid confusion as people learn about pass-by-reference semantics. But I hope that we can design semantics that are simple, if unfamiliar.

My "some are more equal than others" comment refers to the fact that operations that produce equivalent results in R often have very different performance characteristics: the use of with, for example, is basically always a mistake in R. Similarly, you should never use column names in any R code that needs to run efficiently, because it's an O(n) lookup, whereas numeric indexing is O(1). You're told that column names and column indices are equivalent, but in fact one is strictly inferior to the other.

@randyzwitch
Copy link
Contributor

Yeah, that makes sense now. At first I was thinking that it was just the output that made a table different from a view in SQL, but from an underlying database structure, a table is a 'data object' and a view is a 'code object'.

At the risk of being annoying, does SubDataFrame make sense as DataView? So DataFrame and DataView analogous to table/view concepts in SQL.

@johnmyleswhite
Copy link
Contributor Author

I have to think a bit more about how views are implemented in SQL to decide whether they're the same. I'm inclined to think they're not: SQL views update when the underlying database changes, whereas a SubDataFrame will continue to reuse the same indices (which may be wrong) if the underlying database changes.

@johnmyleswhite
Copy link
Contributor Author

Reading about views in SQL more, I believe I was right: a view is just a stored SQL query, whereas a SubDataFrame is effectively a stored query result.

@StefanKarpinski: changing to select is a big change, so I'm happy to take it slow.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

I don't think this is an issue anymore; DataFrames doesn't have subset or filter anymore and view or bit/bool array indexing seem to be fairly standard and non-controversial.

@quinnj quinnj closed this as completed Sep 7, 2017
@nalimilan
Copy link
Member

Funny timing, as we just deprecated select in Base (JuliaLang/julia#23051).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants