Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-row indexing inconsistent with Base and other indexing operations #2665

Closed
knuesel opened this issue Mar 20, 2021 · 4 comments
Closed
Labels

Comments

@knuesel
Copy link
Contributor

knuesel commented Mar 20, 2021

While writing a DataFrames.jl introduction for a data engineering class, there's one part I found difficult to present in a convincing way: the copy-or-view behavior of indexing operations.

I think it comes down to a single weird case: df[1, :] or df[1, [:A, :B]], which returns a view. This seems inconsistent in several ways:

  • Different from the behavior of Base arrays (A[1, :] returns a copy)
  • Inconsistent with df[1:2, :]
  • Inconsistent with df[:, 1]
  • Hard to make sense of, since DataFrames.jl has a special index ! for views

It seems the rules would be much simpler if single-row indexing returned a copy, and df[1, !] was introduced as shorthand to get a view (then @view would be required to get a view of df[1, [:A, :B]]).

I think the new rule would be:

Indexing returns a copy, just like arrays, except when using the special index ! which works like : but returns a view.

I realize I don't have the full picture and there is a long trail of design decisions leading to the current state (the most relevant I found is #1533), and I think it was only recently decided to keep df[1, !]. But now that ! is here, why not use it for rows too?

@nalimilan
Copy link
Member

We've considered returning a copy, but unfortunately that's really inefficient when the number of columns is large, so we decided that it was better to be inconsistent with Base. Have you found it to be a problem in practice, or just difficult to explain/justify?

@bkamins
Copy link
Member

bkamins commented Mar 20, 2021

In particular when you recommend:

Indexing returns a copy

The problem is a copy of what you would like to return - whole data frame (and make a view into it) or only a portion of it that is referenced to?

However, as @nalimilan noted whatever decision you made there it would be expensive. Here is an example:

julia> function f1(df)
       s = 0.0
       for i in axes(df, 1)
           s += sum(df[i, :])
       end
       return s
       end
f1 (generic function with 1 method)

julia> 

julia> function f2(df)
       s = 0.0
       for i in axes(df, 1)
           s += sum(DataFrame(df[i, :])[1, :])
       end
       return s
       end
f2 (generic function with 1 method)

julia> df = DataFrame(rand(10^6, 10), :auto);

julia> @time f1(df);
  1.846600 seconds (60.99 M allocations: 1.058 GiB, 13.24% gc time)

julia> @time f1(df);
  1.466934 seconds (60.99 M allocations: 1.058 GiB, 2.71% gc time)

julia> @time f2(df);
  4.292129 seconds (75.00 M allocations: 3.651 GiB, 9.69% gc time)

julia> @time f2(df);
  4.400600 seconds (75.00 M allocations: 3.651 GiB, 9.51% gc time)

(the behavior in f2 is what you propose to implement)

The decision was that when one does df[1, cols] most likely it is not a problem that the returned object is a view (+ actually sometimes it is desired).

@knuesel
Copy link
Contributor Author

knuesel commented Mar 21, 2021

Have you found it to be a problem in practice, or just difficult to explain/justify?

Just difficult to explain/justifiy. I can't think of a case where I wanted to get a copy of a row and modify it.

@bkamins I was thinking a new container independent from the data frame, for example:

function f3(df)
    s = 0.0
    for i in axes(df, 1)
        s += sum(NamedTuple(df[i, :]))
    end
    return s
end

This has performance closer to views:

julia> @time f1(df);
  2.104450 seconds (61.07 M allocations: 1.063 GiB, 8.63% gc time, 3.21% compilation time)

julia> @time f1(df);
  1.848150 seconds (60.99 M allocations: 1.058 GiB, 1.98% gc time)

julia> @time f2(df);
  5.607756 seconds (75.29 M allocations: 3.668 GiB, 8.05% gc time, 1.82% compilation time)

julia> @time f2(df);
  5.524628 seconds (75.00 M allocations: 3.651 GiB, 8.25% gc time)

julia> @time f3(df);
  2.868462 seconds (38.46 M allocations: 1.026 GiB, 3.19% gc time, 10.09% compilation time)

julia> @time f3(df);
  2.586762 seconds (37.99 M allocations: 1022.261 MiB, 3.44% gc time)

Also it includes the cost of creating the view with df[i, :]. It looks better when creating the named tuple directly:

function f4(df)
    s = 0.0
    cols = 1:10
    for i in axes(df, 1)
        s += sum(NamedTuple{Tuple(DataFrames._names(df)[cols])}(ntuple(j->df[i, cols[j]], length(cols))))
    end
    return s
end

julia> @time f4(df);
  2.204085 seconds (39.38 M allocations: 1.067 GiB, 4.41% gc time, 6.59% compilation time)

julia> @time f4(df);
  2.069275 seconds (38.99 M allocations: 1.043 GiB, 4.86% gc time)

Here it's quite close to the performance of views (though I'm not sure it's a fair comparison: maybe the compiler can optimize more with the explicit 1:10. On the other hand, a serious implementation could disable some bounds checking...)

Of course there's the little problem that this solution is also inconsistent, returning an immutable "copy" while other operations return a mutable one 😊. We could just as well return an "ImmutableDataFrameRow" then. But I guess the immutable restriction would feel like a gratuitous limitation when there's no equivalent lightweight syntax for a mutable view on df[1, cols].

I don't have a better idea... I think the low performance and utility of mutable copies for single rows are a good enough justification for the current behavior so feel free to close this issue.

@bkamins
Copy link
Member

bkamins commented Mar 21, 2021

OK - thank you for thinking about this. We really need an external validation of our decisions. But your conclusion is the one we had - in normal use cases making a view is indistinguishable from making a copy and it is faster. It also allows to keep the number of types the user needs to learn down. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants