-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-row indexing inconsistent with Base and other indexing operations #2665
Comments
We've considered returning a copy, but unfortunately that's really inefficient when the number of columns is large, so we decided that it was better to be inconsistent with Base. Have you found it to be a problem in practice, or just difficult to explain/justify? |
In particular when you recommend:
The problem is a copy of what you would like to return - whole data frame (and make a view into it) or only a portion of it that is referenced to? However, as @nalimilan noted whatever decision you made there it would be expensive. Here is an example:
(the behavior in The decision was that when one does |
Just difficult to explain/justifiy. I can't think of a case where I wanted to get a copy of a row and modify it. @bkamins I was thinking a new container independent from the data frame, for example: function f3(df)
s = 0.0
for i in axes(df, 1)
s += sum(NamedTuple(df[i, :]))
end
return s
end This has performance closer to views:
Also it includes the cost of creating the view with function f4(df)
s = 0.0
cols = 1:10
for i in axes(df, 1)
s += sum(NamedTuple{Tuple(DataFrames._names(df)[cols])}(ntuple(j->df[i, cols[j]], length(cols))))
end
return s
end
julia> @time f4(df);
2.204085 seconds (39.38 M allocations: 1.067 GiB, 4.41% gc time, 6.59% compilation time)
julia> @time f4(df);
2.069275 seconds (38.99 M allocations: 1.043 GiB, 4.86% gc time) Here it's quite close to the performance of views (though I'm not sure it's a fair comparison: maybe the compiler can optimize more with the explicit Of course there's the little problem that this solution is also inconsistent, returning an immutable "copy" while other operations return a mutable one 😊. We could just as well return an "ImmutableDataFrameRow" then. But I guess the immutable restriction would feel like a gratuitous limitation when there's no equivalent lightweight syntax for a mutable view on I don't have a better idea... I think the low performance and utility of mutable copies for single rows are a good enough justification for the current behavior so feel free to close this issue. |
OK - thank you for thinking about this. We really need an external validation of our decisions. But your conclusion is the one we had - in normal use cases making a view is indistinguishable from making a copy and it is faster. It also allows to keep the number of types the user needs to learn down. Thank you! |
While writing a DataFrames.jl introduction for a data engineering class, there's one part I found difficult to present in a convincing way: the copy-or-view behavior of indexing operations.
I think it comes down to a single weird case:
df[1, :]
ordf[1, [:A, :B]]
, which returns a view. This seems inconsistent in several ways:A[1, :]
returns a copy)df[1:2, :]
df[:, 1]
!
for viewsIt seems the rules would be much simpler if single-row indexing returned a copy, and
df[1, !]
was introduced as shorthand to get a view (then@view
would be required to get a view ofdf[1, [:A, :B]]
).I think the new rule would be:
I realize I don't have the full picture and there is a long trail of design decisions leading to the current state (the most relevant I found is #1533), and I think it was only recently decided to keep
df[1, !]
. But now that!
is here, why not use it for rows too?The text was updated successfully, but these errors were encountered: