-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support adding columns to views #2794
Conversation
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I will leave Milan to comment more on the technical details.
On the API, I wonder if transform!(sdf, ...)
should return parent(sdf)
after the transformation. This is more in-line with what we do for transfrorm(gd::GroupedDataFrames, ...)
. We decided to automatically combine because the ungroup
stickiness can lead to bugs in R. Maybe we should add a keyword argument to transform!
about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments. I haven't looked at tests yet.
On the API, I wonder if
transform!(sdf, ...)
should returnparent(sdf)
after the transformation. This is more in-line with what we do fortransfrorm(gd::GroupedDataFrames, ...)
. We decided to automatically combine because theungroup
stickiness can lead to bugs in R. Maybe we should add a keyword argument totransform!
about this?
Given that SubDataFrame
is an AbstractDataFrame
I think it should continue behaving as much as possible like a DataFrame
. But we could add a keyword argument later if that turns out to be useful (notably for piping).
It is allowed. What is not allowed is |
In
is longer than just
In general I would prefer to keep |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan I have applied all the discussions we had. I have resolved all conversations that I thought I clear and left only three open. I will update the tests when we settle the design. |
src/subdataframe/subdataframe.jl
Outdated
# This has an additional effect that for CategoricalVector levels | ||
# and ordering will be retained or not depending on which code patch is taken. | ||
|
||
# TODO: add tests when promote_type vs Base.promote_typejoin decision is made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - this is the performance optimization we have discussed for select!
and transform!
. As commented - it has a side effect for CategoricalArrays.jl but I think it is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but this isn't just an optimization, as the difference can be observed from the outside, right? I think we should only apply tricks that are completely invisible for users, or the behavior will be to complex. Apparently I was wrong in thinking that select!
and transform
are amenable to such optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - it is visible from outside if the column has metadata. I will restrict the optimization to Vector
, as Vector
has no metadata (as opposed to PooledVector
or CategoricalVector
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it visible also for Vector
? That is, if one holds a reference to the column, mutating it rather than replacing it can have consequences. (I'm not overly concerned about PooledVector
and CategoricalVector
if that only makes a difference in weird cases.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can have consequences if there are aliases to this vector. But OK - let us just copy always and stick to "safety first" rule (if someone wants it fast it is always easy-enough to achieve with indexing).
src/dataframe/dataframe.jl
Outdated
newcol = Tables.allocatecolumn(Union{T, Missing}, nrow(dfp)) | ||
fill!(newcol, missing) | ||
newcol[rows(df)] = item_new | ||
item_new = newcol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but item_new_df
sounds like a weird name. How about something like item_new_orig
?
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
It is mentioned at the top in the following part:
|
@test df.a == [1.5, 2, 3, 4] | ||
@test eltype(df.a) === Float64 | ||
|
||
# note that CategoricalVector is dropped as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - this part of tests probably deserves your special attention. Thank you!
@nalimilan - this should be good to have a look at again. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive set of tests!
test/subdataframe_mutation.jl
Outdated
# we first copy old data and then add new data so "1" is in levels | ||
# although it is not present in df.a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK "1" would be in the levels even if we only copied the new values, as all values carry the whole set of levels. The only difference can be in the ordering of levels, since there's no well-defined merged order here.
df = DataFrame(a=1:4) | ||
a = df.a | ||
sdf = @view df[1:1, :] | ||
select!(sdf, :a => (x -> x) => :a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test x -> [1]
(here and below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/broadcasting.jl
Outdated
df.x1 -= [1, 1, 1] | ||
df.x2 -= [100, 100, 0] | ||
@test df == refdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply compare df
with a literal DataFrame
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - changed
test/broadcasting.jl
Outdated
@@ -220,7 +224,8 @@ end | |||
@test df[:, Not("x1")] == refdf[:, 2:end] | |||
|
|||
dfv = @view df[1:2, 2:end] | |||
@test_throws ArgumentError dfv[!, 1] .+= [0, 1] .+ 1 | |||
dfv[!, 1] .+= [0, 1] .+ 1 | |||
@test df.x2 == [5.5, 7.5, 6.5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the contents of the whole df
while you're at it (here and elsewhere)? For example, the column could be re-added in the wrong place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@test_throws BoundsError sdf[:, 4] = ["a", "b", "c"] | ||
@test_throws DimensionMismatch sdf[:, 1] = [1] | ||
@test_throws MethodError sdf[:, 1] = 1 | ||
if DataFrames.is_column_insertion_allowed(sdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any simple way to check this without relying on the internal function? That would allow checking that it's correct too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
test/subdataframe_mutation.jl
Outdated
tmpa = df.a | ||
sdf[:, [:c, :b, :a]] = DataFrame(c=[5, 6], b=[1.0, 2.0], a=[13, 12]) | ||
@test df == DataFrame(a=[1, 12, 13, 4, 5], | ||
b=[11.0, 2.0, 1.0, 14.0, 15.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect indentation here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Frankly - we are at the level of complexity of the logic and number of corner cases where I need the tests to convince myself that I can make a PR public. |
All comments should be now resolved. Thank you! |
Thank you! @pdeffebach - now it is probably the time to experiment with this functionality in combination with what DataFramesMeta.jl could provide. |
This partially implements #2785 for:
setindex!
with:
as row selector:
as row selectorinsertcols!
as only these operations, as discussed, ensure that we ADD columns (and not replace them).
Before I add tests let us discuss what we think about this functionality.
@matthieugomez - in particular could you please comment if you would find it useful? (or your fundamental use case is
transform!
withgroupby
run onSubDataFrame
and having only these methods is of not much value?)The problematic point is that supporting
select!
andtransfrorm!
, as commented earlier creates a slight ambiguity (of course possible to resolve) when if we write:a => identity => :a
as a transformation it is not clear if:a
should be filled withmissing
in filtered-out rows or it should be left as is.