-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: clean up getindex: part 2 #1534
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yes, this sounds like a step in the right direction. My main interrogation is about the nature of SubDataFrame
(see comment inline).
src/dataframe/dataframe.jl
Outdated
new_columns = Any[dv[row_inds] for dv in columns(df)] | ||
return DataFrame(new_columns, copy(index(df))) | ||
end | ||
|
||
# df[:, :] => DataFrame | ||
function Base.getindex(df::DataFrame, ::Colon, ::Colon) | ||
Base.depwarn("indexing with colon as row will create a copy of rows in the future", :getindex) | ||
Base.depwarn("indexing with colon as row will create a copy in the future" * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean a copy of column vectors?
src/subdataframe/subdataframe.jl
Outdated
function SubDataFrame(parent::DataFrame, row::Integer) | ||
Base.depwarn("Selecting a single row from a `DataFrame` will return `DataFrameRow` in future.", :getindex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SubDataFrame
src/subdataframe/subdataframe.jl
Outdated
return SubDataFrame(parent, [Int(row)]) | ||
end | ||
|
||
function SubDataFrame(parent::DataFrame, rows::AbstractVector{<:Integer}) | ||
if any(isa.(rows, Bool)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any(x -> x isa Bool, rows)
would avoid an allocation.
Below and elsewhere, better use backticks consistently around types.
src/subdataframe/subdataframe.jl
Outdated
return SubDataFrame(parent, convert(Vector{Int}, rows)) | ||
end | ||
|
||
function SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}) | ||
if length(rows) != nrow(parent) | ||
throw(ArgumentError("invalid length of Vector{Bool} row index")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to print "(got $m, expected $n)"
. Also, the object is not necessarily a Vector{Bool}
.
src/subdataframe/subdataframe.jl
Outdated
any(ismissing, rowinds) && throw(MissingException("missing values are not allowed in indices")) | ||
return SubDataFrame(adf, convert(Vector{Missings.T(T)}, rowinds)) | ||
function Base.view(adf::AbstractDataFrame, rowinds) | ||
Base.depwarn("view(adf, x) will create a `SubDataFrame` containing columns `x` and all rows in the future.", :view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better give the replacement syntax. Same for other deprecations.
function Base.view(adf::AbstractDataFrame, rowinds::Any) | ||
return SubDataFrame(adf, rowinds) | ||
function Base.view(adf::AbstractDataFrame, rowinds, colinds) | ||
return SubDataFrame(adf[colinds], rowinds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I realize looking at this is that SubDataFrame
is a view on rows of a data frame, but not on columns. So while view(adf, 1, :)
will reflect changes due to e.g. rename!(adf, col => newcol)
, view(df, 1, cols)
won't (since adf[colinds]
creates a new parent data frame). The difference is subtle: we get a view of column vectors, but not a view of the parent data frame.
I wonder whether it would be more consistent to add a cols
fields to SubDataFrame
so that it's a real view on both rows and columns. That also depends on whether it would have a significant performance hit or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it when I asked the same about DataFrameRow
earlier. I came to the conclusion that it is impossible to consistently handle columns. The reason is that we allow dual column indexing: by their numbers and by their names. Later we can rename columns or move them breaking one type of indexing or the other and if we break number-name mapping in the parent it is not clear what we should do in the view.
Finally SubDataFrame
in general is supposed to be only used when original DataFrame
is not mutated in terms of rows/columns (e.g. removing rows in source DataFrame
can break SubDataFrame
) - this is something we should clearly indicate in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it when I asked the same about
DataFrameRow
earlier. I came to the conclusion that it is impossible to consistently handle columns. The reason is that we allow dual column indexing: by their numbers and by their names. Later we can rename columns or move them breaking one type of indexing or the other and if we break number-name mapping in the parent it is not clear what we should do in the view.
I'm not sure that's really an issue. Anyway as you note one isn't supposed to change the structure of the parent. What bothers me more is the inconsistency between passing :
for columns and passing a vector. Though in practice it shouldn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To sum up: we should leave not creating a copy when passing :
for performance reasons. I will add a note in the docs when parent
is copied and when not.
In fact - this is consistent with the thinking that DataFrame
is a collection of rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To sum up: we should leave not creating a copy when passing
:
for performance reasons.
Yes, I agree, what I meant is that if SubDataFrame
also stored the indices of a subset of columns we wouldn't need to make a copy when not passing :
either. I'm not saying we should definitely do this, even less than it should be done in this PR, but maybe something to keep in mind.
src/subdataframe/subdataframe.jl
Outdated
return SubDataFrame(adf[[colind]], rowinds) | ||
end | ||
|
||
function Base.view(adf::AbstractDataFrame, rowinds, colind::Bool) | ||
throw(ArgumentError("invalid column index of type Bool")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also print the value?
src/subdataframe/subdataframe.jl
Outdated
@@ -132,11 +148,13 @@ nrow(sdf::SubDataFrame) = ncol(sdf) > 0 ? length(rows(sdf))::Int : 0 | |||
ncol(sdf::SubDataFrame) = length(index(sdf)) | |||
|
|||
function Base.getindex(sdf::SubDataFrame, colinds) | |||
return parent(sdf)[rows(sdf), colinds] | |||
return view(parent(sdf), rows(sdf), colinds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does't this need a deprecation too?
685a4ba
to
fc5b9f3
Compare
Before implementing, based on #1533 and trying to be consistent with Base behavior I have added the documentation of target This is complex and subtly breaking in several places (I would recommend that during a review each proposal is verified against current behavior - at least this is what I had to do and sometimes were surprised by the result). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's really an impressive list!
It should probably go to a separate page as it's really distinct from docstrings. It would also be great if you could summarize in a few sentences the main rules that we follow (in particular when does copying happens).
docs/src/lib/functions.md
Outdated
* `df[rows, cols]` -> a `DataFrame` containing columns `cols` and `df[rows, col]` as a vector in each `col` in `cols`. | ||
* `@view df[col]` -> an alias of a vector contained in column `col`; | ||
* `@view df[cols]` -> a `SubDataFrame` with parent `df` if `cols` is a colon and `df[cols]` otherwise; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the special treatment of colon could be problematic, and we should also use df[:]
as the parent in that case? Same in similar cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retaining df
will be faster and I guess it will be important for groupby
when you have a lot of small groups.
This is a case that currently was not on the table as @view df[x]
treats x
as rows, but it is going to be changed to ensure consistency.
As long as df
is not mutated this should not matter in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"As long as df
is not mutated" is what bothers me. :-)
Why do you expect view(df, :)
to be used in groupby
? Presumably if the overhead of calling df[cols]
is OK when cols != :
, it's also OK when cols == :
? Or is the latter more common than the former?
docs/src/lib/functions.md
Outdated
`SubDataFrame`: | ||
* `sdf[col]` -> a view of vector contained in column `col` of `parent(sdf)` with `DataFrames.rows(sdf)` as selector; | ||
* `sdf[cols]` -> a freshly allocated `DataFrame` containing copies of vectors contained in columns `cols` of `sdf`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inconsistent with sdf[col]
and df[cols]
, which don't make copies. Better return a SubDataFrame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure what is best here myself.
In Base such an operation destroys a view. Also some code in DataFrames.jl relies on the fact that sdf[cols]
returns a DataFrame
if I remember correctly.
I will leave as is in the next iteration, but let us discuss it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in Base x[inds]
always makes a copy, so it's consistent that it's also the case for views. For DataFrame
, df[cols]
doesn't copy the column vectors, so it would be consistent to do that for SubDataFrame
. The goal is that for consistency if you mutate the result, the original AbstractDataFrame
is changed in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have rewritten it to:
a
SubDataFrame
, with parentparent(sdf)
ifcols
is a colon andparent(sdf)[cols]
otherwise
docs/src/lib/functions.md
Outdated
* `@view df[cols]` -> a `SubDataFrame` with parent `df` if `cols` is a colon and `df[cols]` otherwise; | ||
* `@view df[row, col]` -> translates to `view(df[col], row)` (a `0`-dimensional view into `df[col]`); | ||
* `@view df[row, cols]` -> a `DataFrameRow` with parent `df` if `cols` is a colon and `df[cols]` otherwise; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is going to require creating a new DataFrame
for each row? That sounds really bad. Maybe we could change DataFrameRow
to also take a subset of rows? That's really a corner case though, so it could be fixed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the same situation with SubDataFrame
as current SubDataFrame
and DaraFrameRow
were designed with the assumption that only row-indexing is performed.
I would leave it for a separate PR (we will have to handle colon and subset of columns separately and support a mapping from column numbers of a view to column numbers of a parent in the latter). In the worst case user should first create a parent that has correct columns and then run a view
on all of them. I will open an issue for this to keep track of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed the comments. The only problematic thing is what |
docs/src/lib/indexing.md
Outdated
* `df[cols]` -> a freshly allocated `DataFrame` containing the vectors contained in columns `cols`; | ||
* `df[row, col]` -> the value contained in row `row` of column `col`, the same as `df[col][row]`; | ||
* `df[row, cols]` -> a `NamedTuple` containing data from row `row` in columns `cols`; | ||
* `df[rows, col]` -> a copy of the vector `df[col]` with only the entries corresponding to `rows` selected, the same as df[col][rows]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing backticks.
docs/src/lib/functions.md
Outdated
* `df[rows, cols]` -> a `DataFrame` containing columns `cols` and `df[rows, col]` as a vector in each `col` in `cols`. | ||
* `@view df[col]` -> an alias of a vector contained in column `col`; | ||
* `@view df[cols]` -> a `SubDataFrame` with parent `df` if `cols` is a colon and `df[cols]` otherwise; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"As long as df
is not mutated" is what bothers me. :-)
Why do you expect view(df, :)
to be used in groupby
? Presumably if the overhead of calling df[cols]
is OK when cols != :
, it's also OK when cols == :
? Or is the latter more common than the former?
Currently we have:
which essentially will have to be rewritten as So, in short, because we rewrite Also note that currently |
OK, I see. The best fix is really to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I had forgotten about my other comments regarding the implementation.
The good thing is that we will go through a deprecation period when this would not matter much 😄, and in the mean time I will try to make #1557 efficient. |
Actually there is a lot to improve in the implementation, apart from the earlier comments, once we have decided on the functionality. Anyway this PR will be mostly deprecations given what we have discussed. |
dbcab70
to
7f25eb9
Compare
src/dataframe/dataframe.jl
Outdated
# df[SingleColumnIndex] => AbstractDataVector | ||
const ColumnIndex = Union{Integer, Symbol} | ||
|
||
# df[SingleColumnIndex] => AbstractVector, alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be even more precise than "alias": the vector itself is returned, not a view of it.
src/dataframe/dataframe.jl
Outdated
|
||
# df[SingleRowIndex, :] => DataFrame | ||
function Base.getindex(df::DataFrame, row_ind::Integer, ::Colon) | ||
Base.depwarn("Selecting a single row from a `DataFrame` will return a `NamedTupe` in future.", :getindex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Tuple". Also "in the future".
src/dataframe/dataframe.jl
Outdated
return DataFrame(new_columns, Index(_names(df)[selected_columns])) | ||
end | ||
|
||
# df[MultiRowIndex, SingleColumnIndex] => AbstractVector | ||
# df[SingleRowIndex, :] => DataFrame | ||
function Base.getindex(df::DataFrame, row_ind::Bool, ::Colon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, wouldn't it be better to do row_ind isa Bool
inside the method below? Same below in a similar case.
src/dataframe/dataframe.jl
Outdated
# df[SingleRowIndex, :] => DataFrame | ||
function Base.getindex(df::DataFrame, row_ind::Integer, ::Colon) | ||
Base.depwarn("Selecting a single row from a `DataFrame` will return a `NamedTupe` in future.", :getindex) | ||
new_columns = AbstractVector[[dv[[row_ind]]] for dv in columns(df)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't there too many brackets here?
src/other/index.jl
Outdated
if length(idxs) != length(idx) | ||
throw(ArgumentError("missing values are not allowed for column indexing")) | ||
end | ||
idxs = disallowmissing(idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this call really useful now?
src/subdataframe/subdataframe.jl
Outdated
function SubDataFrame(sdf::SubDataFrame, rowinds::Colon) | ||
return sdf | ||
SubDataFrame(parent::DataFrame, rows) = | ||
SubDataFrame(parent, (1:nrow(parent))[rows]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this trick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that:
- we are sure that we get
Vector{Int}
as a result; (1:nrow(parent))[rows]
will use machinery from Base to catch any problematic values inrows
, OTOH anything that is valid in Base will be valid here.
If not for performance and error clarity reasons this could be the only definition we use. I leave it because it is future proof - if anything outside from the definitions in other methods gets allowed in Base for indexing we will allow it.
But I can drop it as it is not strictly needed for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use that approach elsewhere, do we? I'd rather be consistent. Also the error thrown will be probably be unclear for users.
src/subdataframe/subdataframe.jl
Outdated
|
||
function Base.view(adf::AbstractDataFrame, rowinds::Any) | ||
function Base.view(adf::AbstractDataFrame, rowinds) | ||
Base.depwarn("`view(adf, x)` will be translated to `view(adf, x, :)` in the future.", :view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better tell people that it's the new syntax they should use, that's clearer.
src/subdataframe/subdataframe.jl
Outdated
end | ||
|
||
function Base.view(adf::AbstractDataFrame, rowind::Integer, colinds) | ||
Base.depwarn("`view(adf, x)` will create a `DataFrameRow` in the future." * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecation doesn't match the called method.
OK. I tried to clean up everything (thank you for the patience - this PR is complex unfortunately). We are left with a But I guess in #1533 we investigated all the options and this one had least downsides. How well this goes also hinges on JuliaLang/julia#29417 decision, but I guess it did not get a lot of attention there. |
Just to add (is it is better to be sure what we are doing) - I would be also OK, if |
fb0d604
to
17a6543
Compare
83dafd9
to
8ff49d4
Compare
OK - I dug through getting this PR green on CI. Especially as I came to the conclusion that:
|
Hopefully I have cleaned up all depwarn messages. |
src/dataframe/dataframe.jl
Outdated
return DataFrame(new_columns, Index(_names(df)[selected_columns])) | ||
end | ||
|
||
# df[:, SingleColumnIndex] => AbstractVector | ||
# df[:, MultiColumnIndex] => DataFrame | ||
function Base.getindex(df::DataFrame, row_ind::Colon, col_inds) | ||
Base.depwarn("indexing with colon as row will create a copy in the future" * | ||
" use df[col_inds] to get the columns without copying", :getindex) | ||
Base.depwarn("Sndexing with colon as row will create a copy in the future. " * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love Sndexing into data frames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new operation I am working on. A combination of "Selecting" and "Indexing" in one go with a bit more of the latter than the former.
I've played a bit with creating tuples for each row and it appears that even when passing a tuple of In the end @quinnj I guess you've considered these issues a lot when designing Tables.jl? |
Why worse? Can it actually be worse than that anyway? :-) |
@nalimilan Is there a conclusion on the |
I'd say use |
Good. One last thing before I clean it up. As I am OK with both (given the discussion we had before both alternatives have pros and cons) - but prefer to double check before moving forward. |
I'd go with |
Is it too late to just make our own two types |
@pdeffebach Can you explain what you mean in a bit more detail please because I am not sure what you mean (in general I guess it is not too late for changes - we try to clean up DataFrames.jl as much as possible before 1.0 release). |
Something like
Then there is
I like this plan because it means we don't have to worry about losing behavior when a On the other hand, I don't like this plan because its one more type we have to export and maintain. |
The problem is that creating a new |
Thanks for the feedback. If that's the case, then relatively large change in API ( I must admit I don't really know the use-case of copying it. Seems like it will be pretty rare. However EDIT: I think i misinterpreted the above conversation. returning NamedTuple on copy seems great. |
I have improved the documentation to explain design considerations. |
196f322
to
353c59a
Compare
Thanks. It this good to go then? |
I would say is is good to go. Also then I think that #1571 should be rebased against it and try to be consistent with target functionality. |
Great work! |
There are lots of deprecation warnings when running tests. Some seem to be potentially problematic for users. For example:
This seems to happen when calling |
That was the issue I was afraid of when asking about cleaning up deprecations process. |
This PR was hard for me. Therefore I split it into implementation (pushing now) and tests (which I will add when we decide this is the functionality we want).
It implements discussion in #1190, #142 and #1533 and some minor clean-ups.
A particular difficulty was what to deprecate and what to change. @nalimilan - an initial review is appreciated (I will work on it further if we decide this is a right direction).