Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support adding columns to views #2794

Merged
merged 32 commits into from
Sep 1, 2021
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
289fadf
add setindex! rules
bkamins Jun 11, 2021
a253236
implement setindex! and broadcasting assignment
bkamins Jun 20, 2021
e33c605
implement insertcols!
bkamins Jun 21, 2021
7f1814a
add NEWS.md entry
bkamins Jun 21, 2021
4a22a1d
Apply suggestions from code review
bkamins Jun 27, 2021
b721a46
changes after code review part 2
bkamins Jun 27, 2021
412b89c
docs update
bkamins Jun 27, 2021
b95f07a
setindex! for ! and setproperty
bkamins Jun 27, 2021
af54c29
Merge branch 'main' into bk/view_add_column
bkamins Aug 6, 2021
dc0d241
fix NEWS.md
bkamins Aug 6, 2021
9f02571
another NEWS.md fix
bkamins Aug 6, 2021
121bb54
another small NEWS.md change
bkamins Aug 6, 2021
1a83b61
finished tests for df[!, col] assignment and broadcasted assignment
bkamins Aug 7, 2021
7d5a65b
some more tests
bkamins Aug 7, 2021
f614e58
done tests of ! assignment and broadcasting assignment
bkamins Aug 7, 2021
e886dc1
finished assignment, broadcasted assignment and insertcols!
bkamins Aug 7, 2021
59afd61
fix tests
bkamins Aug 8, 2021
700e65d
fix tests on Julia 1.7
bkamins Aug 8, 2021
50d9f8b
one more test fix
bkamins Aug 8, 2021
6045034
finalize all required changes
bkamins Aug 8, 2021
1ae0534
fix 1.7 broadcasting
bkamins Aug 8, 2021
971c282
Apply suggestions from code review
bkamins Aug 25, 2021
c4cb1ae
apply suggestions after code review
bkamins Aug 25, 2021
cadc128
fix fast path is select!/transform!
bkamins Aug 25, 2021
bdbf09a
Merge branch 'main' into bk/view_add_column
bkamins Aug 28, 2021
4ad940c
Apply suggestions from code review
bkamins Aug 29, 2021
8f134ef
changes after code review and promote_type decision
bkamins Aug 29, 2021
1f4aa78
fix tests
bkamins Aug 29, 2021
55a6d75
Apply suggestions from code review
bkamins Aug 29, 2021
93de064
apply changes after code review
bkamins Aug 29, 2021
3e5d8d8
Merge branch 'main' into bk/view_add_column
bkamins Sep 1, 2021
f25d333
Update NEWS.md
bkamins Sep 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,16 @@

## New functionalities

* in the `groupby` function the `sort` keyword argument now allows three values
* Improve `sort` keyword argument in `groupby`
([#2812](https://github.com/JuliaData/DataFrames.jl/pull/2812)).

In the `groupby` function the `sort` keyword argument now allows three values:
- `nothing` (the default) leaves the order of groups undefined and allows
`groupby` to pick the fastest available grouping algorithm;
- `true` sorts groups by key columns;
- `false` creates groups in the order of their appearance in the parent data
frame;

In previous versions, the `sort` keyword argument allowed only `Bool` values
and `false` (which was the default) corresponded to the new
behavior when `nothing` is passed. Therefore only the user visible change
Expand All @@ -18,6 +22,30 @@
(notably `PooledArray` and `CategoricalArray`) or when they contained only
integers in a small range.
([#2812](https://github.com/JuliaData/DataFrames.jl/pull/2812))

* Allow adding new columns to a `SubDataFrame` created with `:` as column selector
([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).

If `sdf` is a `SubDataFrame` created with `:` as a column selector then
`insertcols!`, `setindex!`, and broadcasted assignment allow for creation
of new columns, automatically filling filtered-out rows with `missing` values;

* Allow replacing existing columns in a `SubDataFrame` with `!` as row selector in assignment and broadcasted assignment
bkamins marked this conversation as resolved.
Show resolved Hide resolved
([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).

Assignment to existing columns allocates a new column.
Values already stored in filtered-out rows are copied.

* Allow `SubDataFrame` to be passed as an argument to `select!` and `transform!`
(also on `GroupedDataFrame` created from a `SubDataFrame`)
([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).

Assignment to existing columns allocates a new column.
Values already stored in filtered-out rows are copied.
In case of creation of new columns, filtered-out rows are automatically
filled with `missing` values.
If `SubDataFrame` was not created with `:` as column selector the resulting operation
must produce the same column names as stored in the source `SubDataFrame` or an error is thrown.
* `Tables.materializer` when passed the following types or their subtypes:
`AbstractDataFrame`, `DataFrameRows`, `DataFrameColumns` returns `DataFrame`.
([#2839](https://github.com/JuliaData/DataFrames.jl/pull/2839))
Expand Down
43 changes: 38 additions & 5 deletions docs/src/lib/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,9 +144,30 @@ so it is unsafe to use it afterwards (the column length correctness will be pres
* `sdf[row, cols] = v` -> the same as `dfr = df[row, cols]; dfr[:] = v` in-place;
* `sdf[rows, col] = v` -> set rows `rows` of column `col`, in-place; `v` must be an abstract vector;
* `sdf[rows, cols] = v` -> set rows `rows` of columns `cols` in-place;
`v` can be an `AbstractMatrix` or `v` can be `AbstractDataFrame` when column names must match;
`v` can be an `AbstractMatrix` or `v` can be `AbstractDataFrame`
in which case column names must match;
* `sdf[!, col] = v` -> replaces `col` with `v` with copying; if `col` is present in `sdf`
then filtered-out rows in newly created vector are filled with
values already present in that column and `promote_type` is used
to determine the `eltype` of the new column;
if `col` is not present in `sdf` then the operation is only allowed
if `sdf` was created with `:` as column selector, in which case
filtered-out rows are filled with `missing`;
equivalent to `sdf.col = v` if `col` is a valid identifier;
operation is allowed if `length(v) == nrow(sdf)`;
* `sdf[!, cols] = v` -> replaces existing columns `cols` in data frame `sdf` with copying;
bkamins marked this conversation as resolved.
Show resolved Hide resolved
`v` must be an `AbstractMatrix` or an `AbstractDataFrame`
(in the latter case column names must match);
filtered-out rows in newly created vectors are filled with
values already present in respective columns
and `promote_type` is used to determine the `eltype` of the new columns;

!!! note

The rules above mean that `sdf[:, col] = v` is an in-place operation if `col` is present in `sdf`,
therefore it will be fast in general. On the other hand using `sdf[!, col] = v`
or `sdf.col = v` will always allocate a new vector, which is more expensive computationally.

Note that `sdf[!, col] = v`, `sdf[!, cols] = v` and `sdf.col = v` are not allowed as `sdf` can be only modified in-place.

`setindex!` on `DataFrameRow`:
* `dfr[col] = v` -> set value of `col` in row `row` to `v` in-place;
Expand All @@ -171,7 +192,6 @@ The following broadcasting rules apply to `AbstractDataFrame` objects:
Note that if broadcasting assignment operation throws an error the target data frame may be partially changed
so it is unsafe to use it afterwards (the column length correctness will be preserved).


Broadcasting `DataFrameRow` is currently not allowed (which is consistent with `NamedTuple`).

It is possible to assign a value to `AbstractDataFrame` and `DataFrameRow` objects using the `.=` operator.
Expand Down Expand Up @@ -199,8 +219,21 @@ Additional rules:
Starting from Julia 1.7 if `:col` is not present in `df` then a new column will be created in `df`.
* in the `sdf[CartesianIndex(row, col)] .= v`, `sdf[row, col] .= v` and `sdf[row, cols] .= v` syntaxes the assignment to `sdf` is performed in-place;
* in the `sdf[rows, col] .= v` and `sdf[rows, cols] .= v` syntaxes the assignment to `sdf` is performed in-place;
* `sdf.col .= v` syntax is performs an in-place assignment to an existing vector `sdf.col` and is deprecated;
in the future this operation will not be allowed.
if `rows` is `:` and `col` is a `Symbol` or `AbstractString`
referring to a column missing from `sdf` and `sdf` was created with `:` as column selector
then a new column is allocated and added;
the filtered-out rows are filled with `missing`;
* in the `sdf[!, col] .= v` syntax column `col` is replaced by a freshly allocated vector;
the filtered-out rows are filled with values already present in `col`;
if `col` is a `Symbol` or `AbstractString` referring to a column missing from `sdf`
and was `sdf` created with `:` as column selector then a new column is allocated and added;
in this case the filtered-out rows are filled with `missing`;
* the `sdf[!, cols] .= v` syntax replaces existing columns `cols` in data frame `sdf` with freshly allocated vectors;
the filtered-out rows are filled with values already present in `cols`;
* `sdf.col .= v` syntax currently performs in-place assignment to an existing vector `sdf.col`;
this behavior is deprecated and a new column will be allocated in the future.
Starting from Julia 1.7 if `:col` is not present in `sdf` then a new column will be created in `sdf`
if `sdf` was created with `:` as a column selector.
* `dfr.col .= v` syntax is allowed and performs in-place assignment to a value extracted by `dfr.col`.

Note that `sdf[!, col] .= v` and `sdf[!, cols] .= v` syntaxes are not allowed as `sdf` can be only modified in-place.
Expand Down
168 changes: 166 additions & 2 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,9 +113,10 @@ In all of these cases, `function` can return either a single row or multiple
rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional
`AbstractArray` are unwrapped and then treated as a single row.

`select`/`select!` and `transform`/`transform!` always return a `DataFrame`
`select`/`select!` and `transform`/`transform!` always return a data frame
with the same number and order of rows as the source (even if `GroupedDataFrame`
had its groups reordered).
had its groups reordered), except when selection results in zero columns
in the resulting data frame (in which case the result has zero rows).

For `combine`, rows in the returned object appear in the order of groups in the
`GroupedDataFrame`. The functions can return an arbitrary number of rows for
Expand Down Expand Up @@ -617,3 +618,166 @@ julia> gd[1]
─────┼───────
1 │ 1
```

# Simulating the SQL `where` clause

You can conveniently work on subsets of a data frame by using `SubDataFrame`s.
Operations performed on such objects can either create a new data frame or be
performed in-place. Here are some examples:

```jldoctest sac
julia> df = DataFrame(a=1:5)
5×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5

julia> sdf = @view df[2:3, :]
2×1 SubDataFrame
Row │ a
│ Int64
─────┼───────
1 │ 2
2 │ 3

julia> transform(sdf, :a => ByRow(string)) # create a new data frame
2×2 DataFrame
Row │ a a_string
│ Int64 String
─────┼─────────────────
1 │ 2 2
2 │ 3 3

julia> transform!(sdf, :a => ByRow(string)) # update the source df in-place
2×2 SubDataFrame
Row │ a a_string
│ Int64 String?
─────┼─────────────────
1 │ 2 2
2 │ 3 3

julia> df # new column was created filled with missing in filtered-out rows
5×2 DataFrame
Row │ a a_string
│ Int64 String?
─────┼─────────────────
1 │ 1 missing
2 │ 2 2
3 │ 3 3
4 │ 4 missing
5 │ 5 missing

julia> select!(sdf, :a => -, renamecols=false) # update the source df in-place
2×1 SubDataFrame
Row │ a
│ Int64
─────┼───────
1 │ -2
2 │ -3

julia> df # the column replaced an existing column; previously stored values are re-used in filtered-out rows
5×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ -2
3 │ -3
4 │ 4
5 │ 5
```

Similar operations can be performed on `GroupedDataFrame` as well:
```jldoctest sac
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 3], b=1:6)
6×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 2 4
5 │ 2 5
6 │ 3 6

julia> sdf = @view df[2:4, :]
3×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 1 3
3 │ 2 4

julia> gsdf = groupby(sdf, :a)
GroupedDataFrame with 2 groups based on key: a
First Group (2 rows): a = 1
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 1 3
Last Group (1 row): a = 2
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 4

julia> transform(gsdf, nrow) # create a new data frame
3×3 DataFrame
Row │ a b nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 2
2 │ 1 3 2
3 │ 2 4 1

julia> transform!(gsdf, nrow, :b => :b_copy)
3×4 SubDataFrame
Row │ a b nrow b_copy
│ Int64 Int64 Int64? Int64?
─────┼──────────────────────────────
1 │ 1 2 2 2
2 │ 1 3 2 3
3 │ 2 4 1 4

julia> df
6×4 DataFrame
Row │ a b nrow b_copy
│ Int64 Int64 Int64? Int64?
─────┼────────────────────────────────
1 │ 1 1 missing missing
2 │ 1 2 2 2
3 │ 1 3 2 3
4 │ 2 4 1 4
5 │ 2 5 missing missing
6 │ 3 6 missing missing

julia> select!(gsdf, :b_copy, :b => sum, renamecols=false)
3×3 SubDataFrame
Row │ a b_copy b
│ Int64 Int64? Int64
─────┼──────────────────────
1 │ 1 2 5
2 │ 1 3 5
3 │ 2 4 4

julia> df
6×3 DataFrame
Row │ a b_copy b
│ Int64 Int64? Int64
─────┼───────────────────────
1 │ 1 missing 1
2 │ 1 2 5
3 │ 1 3 5
4 │ 2 4 4
5 │ 2 missing 5
6 │ 3 missing 6
```
35 changes: 25 additions & 10 deletions src/abstractdataframe/selection.jl
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,10 @@ const TRANSFORMATION_COMMON_RULES =
rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional
`AbstractArray` are unwrapped and then treated as a single row.

`select`/`select!` and `transform`/`transform!` always return a `DataFrame`
`select`/`select!` and `transform`/`transform!` always return a data frame
with the same number and order of rows as the source (even if `GroupedDataFrame`
had its groups reordered).
had its groups reordered), except when selection results in zero columns
in the resulting data frame (in which case the result has zero rows).

For `combine`, rows in the returned object appear in the order of groups in the
`GroupedDataFrame`. The functions can return an arbitrary number of rows for
Expand Down Expand Up @@ -623,17 +624,27 @@ function select_transform!((nc,)::Ref{Any}, df::AbstractDataFrame, newdf::DataFr
end

"""
select!(df::DataFrame, args...; renamecols::Bool=true)
select!(df::AbstractDataFrame, args...; renamecols::Bool=true)
select!(args::Base.Callable, df::DataFrame; renamecols::Bool=true)
select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
select!(gd::GroupedDataFrame, args...; ungroup::Bool=true, renamecols::Bool=true)
select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)

Mutate `df` or `gd` in place to retain only columns or transformations specified by `args...` and
return it. The result is guaranteed to have the same number of rows as `df` or
parent of `gd`, except when no columns are selected (in which case the result
has zero rows).

If `gd` is passed then it is updated to reflect the new rows of its updated
If a `SubDataFrame` or `GroupedDataFrame{SubDataFrame}` is passed, the parent data frame
is updated using columns generated by `args...`, following the same rules as indexing:
- for existing columns filtered-out rows are filled with values present in the
old columns
- for new columns (which is only allowed if `SubDataFrame` was created with `:`
as column selector) filtered-out rows are filled with `missing`
- if `SubDataFrame` was not created with `:` as column selector then `select!`
is only allowed if the transformations keep exactly the same sequence of column
names as is in the passed `df`

If a `GroupedDataFrame` is passed then it is updated to reflect the new rows of its updated
parent. If there are independent `GroupedDataFrame` objects constructed using
the same parent data frame they might get corrupt.

Expand All @@ -650,6 +661,9 @@ See [`select`](@ref) for examples.
select!(df::DataFrame, @nospecialize(args...); renamecols::Bool=true) =
_replace_columns!(df, select(df, args..., copycols=false, renamecols=renamecols))

select!(df::SubDataFrame, @nospecialize(args...); renamecols::Bool=true) =
_replace_columns!(df, select(df, args..., copycols=true, renamecols=renamecols))

function select!(@nospecialize(arg::Base.Callable), df::AbstractDataFrame; renamecols::Bool=true)
if arg isa Colon
throw(ArgumentError("First argument must be a transformation if the second argument is a data frame"))
Expand All @@ -658,14 +672,15 @@ function select!(@nospecialize(arg::Base.Callable), df::AbstractDataFrame; renam
end

"""
transform!(df::DataFrame, args...; renamecols::Bool=true)
transform!(args::Callable, df::DataFrame; renamecols::Bool=true)
transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
transform!(df::AbstractDataFrame, args...; renamecols::Bool=true)
transform!(args::Callable, df::AbstractDataFrame; renamecols::Bool=true)
transform!(gd::GroupedDataFrame, args...; ungroup::Bool=true, renamecols::Bool=true)
transform!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)

Mutate `df` or `gd` in place to add columns specified by `args...` and return it.
The result is guaranteed to have the same number of rows as `df`.
Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`.
Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`,
except that column renaming performs a copy.

$TRANSFORMATION_COMMON_RULES

Expand All @@ -677,7 +692,7 @@ $TRANSFORMATION_COMMON_RULES

See [`select`](@ref) for examples.
"""
function transform!(df::DataFrame, @nospecialize(args...); renamecols::Bool=true)
function transform!(df::AbstractDataFrame, @nospecialize(args...); renamecols::Bool=true)
idx = index(df)
newargs = Any[if sel isa Pair{<:ColumnIndex, Symbol}
idx[first(sel)] => copy => last(sel)
Expand Down
Loading