JuliaData · bkamins · Sep 1, 2021 · Jun 11, 2021 · Jun 20, 2021 · Jun 21, 2021
diff --git a/NEWS.md b/NEWS.md
@@ -2,12 +2,16 @@
 
 ## New functionalities
 
-* in the `groupby` function the `sort` keyword argument now allows three values
+* Improve `sort` keyword argument in `groupby`
+  ([#2812](https://github.com/JuliaData/DataFrames.jl/pull/2812)).
+
+  In the `groupby` function the `sort` keyword argument now allows three values:
   - `nothing` (the default) leaves the order of groups undefined and allows
     `groupby` to pick the fastest available grouping algorithm;
   - `true` sorts groups by key columns;
   - `false` creates groups in the order of their appearance in the parent data
     frame;
+
   In previous versions, the `sort` keyword argument allowed only `Bool` values
   and `false` (which was the default) corresponded to the new
   behavior when `nothing` is passed. Therefore only the user visible change
@@ -18,6 +22,30 @@
   (notably `PooledArray` and `CategoricalArray`) or when they contained only
   integers in a small range.
   ([#2812](https://github.com/JuliaData/DataFrames.jl/pull/2812))
+
+* Allow adding new columns to a `SubDataFrame` created with `:` as column selector
+  ([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).
+
+  If `sdf` is a `SubDataFrame` created with `:` as a column selector then
+  `insertcols!`, `setindex!`, and broadcasted assignment allow for creation
+  of new columns, automatically filling filtered-out rows with `missing` values;
+
+* Allow replacing existing columns in a `SubDataFrame` with `!` as row selector in assignment and broadcasted assignment
+  ([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).
+
+  Assignment to existing columns allocates a new column.
+  Values already stored in filtered-out rows are copied.
+
+* Allow `SubDataFrame` to be passed as an argument to `select!` and `transform!`
+  (also on `GroupedDataFrame` created from a `SubDataFrame`)
+  ([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).
+
+  Assignment to existing columns allocates a new column.
+  Values already stored in filtered-out rows are copied.
+  In case of creation of new columns, filtered-out rows are automatically
+  filled with `missing` values.
+  If `SubDataFrame` was not created with `:` as column selector the resulting operation
+  must produce the same column names as stored in the source `SubDataFrame` or an error is thrown.
 * `Tables.materializer` when passed the following types or their subtypes:
   `AbstractDataFrame`, `DataFrameRows`, `DataFrameColumns` returns `DataFrame`.
   ([#2839](https://github.com/JuliaData/DataFrames.jl/pull/2839))

diff --git a/docs/src/lib/indexing.md b/docs/src/lib/indexing.md
@@ -144,9 +144,30 @@ so it is unsafe to use it afterwards (the column length correctness will be pres
 * `sdf[row, cols] = v` -> the same as `dfr = df[row, cols]; dfr[:] = v` in-place;
 * `sdf[rows, col] = v` -> set rows `rows` of column `col`, in-place; `v` must be an abstract vector;
 * `sdf[rows, cols] = v` -> set rows `rows` of columns `cols` in-place;
-                           `v` can be an `AbstractMatrix` or `v` can be `AbstractDataFrame` when column names must match;
+                           `v` can be an `AbstractMatrix` or `v` can be `AbstractDataFrame`
+                           in which case column names must match;
+* `sdf[!, col] = v` -> replaces `col` with `v` with copying; if `col` is present in `sdf`
+                       then filtered-out rows in newly created vector are filled with
+                       values already present in that column and `promote_type` is used
+                       to determine the `eltype` of the new column;
+                       if `col` is not present in `sdf` then the operation is only allowed
+                       if `sdf` was created with `:` as column selector, in which case
+                       filtered-out rows are filled with `missing`;
+                       equivalent to `sdf.col = v` if `col` is a valid identifier;
+                       operation is allowed if `length(v) == nrow(sdf)`;
+* `sdf[!, cols] = v` -> replaces existing columns `cols` in data frame `sdf` with copying;
+                       `v` must be an `AbstractMatrix` or an `AbstractDataFrame`
+                       (in the latter case column names must match);
+                       filtered-out rows in newly created vectors are filled with
+                       values already present in respective columns
+                       and `promote_type` is used to determine the `eltype` of the new columns;
+
+!!! note
+
+    The rules above mean that `sdf[:, col] = v` is an in-place operation if `col` is present in `sdf`,
+    therefore it will be fast in general. On the other hand using `sdf[!, col] = v`
+    or `sdf.col = v` will always allocate a new vector, which is more expensive computationally.
 
-Note that `sdf[!, col] = v`, `sdf[!, cols] = v` and `sdf.col = v` are not allowed as `sdf` can be only modified in-place.
 
 `setindex!` on `DataFrameRow`:
 * `dfr[col] = v` -> set value of `col` in row `row` to `v` in-place;
@@ -171,7 +192,6 @@ The following broadcasting rules apply to `AbstractDataFrame` objects:
 Note that if broadcasting assignment operation throws an error the target data frame may be partially changed
 so it is unsafe to use it afterwards (the column length correctness will be preserved).
 
-
 Broadcasting `DataFrameRow` is currently not allowed (which is consistent with `NamedTuple`).
 
 It is possible to assign a value to `AbstractDataFrame` and `DataFrameRow` objects using the `.=` operator.
@@ -199,8 +219,21 @@ Additional rules:
   Starting from Julia 1.7 if `:col` is not present in `df` then a new column will be created in `df`.
 * in the `sdf[CartesianIndex(row, col)] .= v`, `sdf[row, col] .= v` and `sdf[row, cols] .= v` syntaxes the assignment to `sdf` is performed in-place;
 * in the `sdf[rows, col] .= v` and `sdf[rows, cols] .= v` syntaxes the assignment to `sdf` is performed in-place;
-* `sdf.col .= v` syntax is performs an in-place assignment to an existing vector `sdf.col` and is deprecated;
-  in the future this operation will not be allowed.
+  if `rows` is `:` and `col` is a `Symbol` or `AbstractString`
+  referring to a column missing from `sdf` and `sdf` was created with `:` as column selector
+  then a new column is allocated and added;
+  the filtered-out rows are filled with `missing`;
+* in the `sdf[!, col] .= v` syntax column `col` is replaced by a freshly allocated vector;
+  the filtered-out rows are filled with values already present in `col`;
+  if `col` is a `Symbol` or `AbstractString` referring to a column missing from `sdf`
+  and was `sdf` created with `:` as column selector then a new column is allocated and added;
+  in this case the filtered-out rows are filled with `missing`;
+* the `sdf[!, cols] .= v` syntax replaces existing columns `cols` in data frame `sdf` with freshly allocated vectors;
+  the filtered-out rows are filled with values already present in `cols`;
+* `sdf.col .= v` syntax currently performs in-place assignment to an existing vector `sdf.col`;
+  this behavior is deprecated and a new column will be allocated in the future.
+  Starting from Julia 1.7 if `:col` is not present in `sdf` then a new column will be created in `sdf`
+  if `sdf` was created with `:` as a column selector.
 * `dfr.col .= v` syntax is allowed and performs in-place assignment to a value extracted by `dfr.col`.
 
 Note that `sdf[!, col] .= v` and `sdf[!, cols] .= v` syntaxes are not allowed as `sdf` can be only modified in-place.

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
@@ -113,9 +113,10 @@ In all of these cases, `function` can return either a single row or multiple
 rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional
 `AbstractArray` are unwrapped and then treated as a single row.
 
-`select`/`select!` and `transform`/`transform!` always return a `DataFrame`
+`select`/`select!` and `transform`/`transform!` always return a data frame
 with the same number and order of rows as the source (even if `GroupedDataFrame`
-had its groups reordered).
+had its groups reordered), except when selection results in zero columns
+in the resulting data frame (in which case the result has zero rows).
 
 For `combine`, rows in the returned object appear in the order of groups in the
 `GroupedDataFrame`. The functions can return an arbitrary number of rows for
@@ -617,3 +618,166 @@ julia> gd[1]
 ─────┼───────
    1 │     1
 ```
+
+# Simulating the SQL `where` clause
+
+You can conveniently work on subsets of a data frame by using `SubDataFrame`s.
+Operations performed on such objects can either create a new data frame or be
+performed in-place. Here are some examples:
+
+```jldoctest sac
+julia> df = DataFrame(a=1:5)
+5×1 DataFrame
+ Row │ a
+     │ Int64
+─────┼───────
+   1 │     1
+   2 │     2
+   3 │     3
+   4 │     4
+   5 │     5
+
+julia> sdf = @view df[2:3, :]
+2×1 SubDataFrame
+ Row │ a
+     │ Int64
+─────┼───────
+   1 │     2
+   2 │     3
+
+julia> transform(sdf, :a => ByRow(string)) # create a new data frame
+2×2 DataFrame
+ Row │ a      a_string
+     │ Int64  String
+─────┼─────────────────
+   1 │     2  2
+   2 │     3  3
+
+julia> transform!(sdf, :a => ByRow(string)) # update the source df in-place
+2×2 SubDataFrame
+ Row │ a      a_string
+     │ Int64  String?
+─────┼─────────────────
+   1 │     2  2
+   2 │     3  3
+
+julia> df # new column was created filled with missing in filtered-out rows
+5×2 DataFrame
+ Row │ a      a_string
+     │ Int64  String?
+─────┼─────────────────
+   1 │     1  missing
+   2 │     2  2
+   3 │     3  3
+   4 │     4  missing
+   5 │     5  missing
+
+julia> select!(sdf, :a => -, renamecols=false) # update the source df in-place
+2×1 SubDataFrame
+ Row │ a
+     │ Int64
+─────┼───────
+   1 │    -2
+   2 │    -3
+
+julia> df # the column replaced an existing column; previously stored values are re-used in filtered-out rows
+5×1 DataFrame
+ Row │ a
+     │ Int64
+─────┼───────
+   1 │     1
+   2 │    -2
+   3 │    -3
+   4 │     4
+   5 │     5
+```
+
+Similar operations can be performed on `GroupedDataFrame` as well:
+```jldoctest sac
+julia> df = DataFrame(a=[1, 1, 1, 2, 2, 3], b=1:6)
+6×2 DataFrame
+ Row │ a      b
+     │ Int64  Int64
+─────┼──────────────
+   1 │     1      1
+   2 │     1      2
+   3 │     1      3
+   4 │     2      4
+   5 │     2      5
+   6 │     3      6
+
+julia> sdf = @view df[2:4, :]
+3×2 SubDataFrame
+ Row │ a      b
+     │ Int64  Int64
+─────┼──────────────
+   1 │     1      2
+   2 │     1      3
+   3 │     2      4
+
+julia> gsdf = groupby(sdf, :a)
+GroupedDataFrame with 2 groups based on key: a
+First Group (2 rows): a = 1
+ Row │ a      b
+     │ Int64  Int64
+─────┼──────────────
+   1 │     1      2
+   2 │     1      3
+⋮
+Last Group (1 row): a = 2
+ Row │ a      b
+     │ Int64  Int64
+─────┼──────────────
+   1 │     2      4
+
+julia> transform(gsdf, nrow) # create a new data frame
+3×3 DataFrame
+ Row │ a      b      nrow
+     │ Int64  Int64  Int64
+─────┼─────────────────────
+   1 │     1      2      2
+   2 │     1      3      2
+   3 │     2      4      1
+
+julia> transform!(gsdf, nrow, :b => :b_copy)
+3×4 SubDataFrame
+ Row │ a      b      nrow    b_copy
+     │ Int64  Int64  Int64?  Int64?
+─────┼──────────────────────────────
+   1 │     1      2       2       2
+   2 │     1      3       2       3
+   3 │     2      4       1       4
+
+julia> df
+6×4 DataFrame
+ Row │ a      b      nrow     b_copy
+     │ Int64  Int64  Int64?   Int64?
+─────┼────────────────────────────────
+   1 │     1      1  missing  missing
+   2 │     1      2        2        2
+   3 │     1      3        2        3
+   4 │     2      4        1        4
+   5 │     2      5  missing  missing
+   6 │     3      6  missing  missing
+
+julia> select!(gsdf, :b_copy, :b => sum, renamecols=false)
+3×3 SubDataFrame
+ Row │ a      b_copy  b
+     │ Int64  Int64?  Int64
+─────┼──────────────────────
+   1 │     1       2      5
+   2 │     1       3      5
+   3 │     2       4      4
+
+julia> df
+6×3 DataFrame
+ Row │ a      b_copy   b
+     │ Int64  Int64?   Int64
+─────┼───────────────────────
+   1 │     1  missing      1
+   2 │     1        2      5
+   3 │     1        3      5
+   4 │     2        4      4
+   5 │     2  missing      5
+   6 │     3  missing      6
+```
diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl
@@ -107,9 +107,10 @@ const TRANSFORMATION_COMMON_RULES =
     rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional
     `AbstractArray` are unwrapped and then treated as a single row.
 
-    `select`/`select!` and `transform`/`transform!` always return a `DataFrame`
+    `select`/`select!` and `transform`/`transform!` always return a data frame
     with the same number and order of rows as the source (even if `GroupedDataFrame`
-    had its groups reordered).
+    had its groups reordered), except when selection results in zero columns
+    in the resulting data frame (in which case the result has zero rows).
 
     For `combine`, rows in the returned object appear in the order of groups in the
     `GroupedDataFrame`. The functions can return an arbitrary number of rows for
@@ -623,17 +624,27 @@ function select_transform!((nc,)::Ref{Any}, df::AbstractDataFrame, newdf::DataFr
 end
 
 """
-    select!(df::DataFrame, args...; renamecols::Bool=true)
+    select!(df::AbstractDataFrame, args...; renamecols::Bool=true)
     select!(args::Base.Callable, df::DataFrame; renamecols::Bool=true)
-    select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
+    select!(gd::GroupedDataFrame, args...; ungroup::Bool=true, renamecols::Bool=true)
     select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)
 
 Mutate `df` or `gd` in place to retain only columns or transformations specified by `args...` and
 return it. The result is guaranteed to have the same number of rows as `df` or
 parent of `gd`, except when no columns are selected (in which case the result
 has zero rows).
 
-If `gd` is passed then it is updated to reflect the new rows of its updated
+If a `SubDataFrame` or `GroupedDataFrame{SubDataFrame}` is passed, the parent data frame
+is updated using columns generated by `args...`, following the same rules as indexing:
+- for existing columns filtered-out rows are filled with values present in the
+  old columns
+- for new columns (which is only allowed if `SubDataFrame` was created with `:`
+  as column selector) filtered-out rows are filled with `missing`
+- if `SubDataFrame` was not created with `:` as column selector then `select!`
+  is only allowed if the transformations keep exactly the same sequence of column
+  names as is in the passed `df`
+
+If a `GroupedDataFrame` is passed then it is updated to reflect the new rows of its updated
 parent. If there are independent `GroupedDataFrame` objects constructed using
 the same parent data frame they might get corrupt.
 
@@ -650,6 +661,9 @@ See [`select`](@ref) for examples.
 select!(df::DataFrame, @nospecialize(args...); renamecols::Bool=true) =
     _replace_columns!(df, select(df, args..., copycols=false, renamecols=renamecols))
 
+select!(df::SubDataFrame, @nospecialize(args...); renamecols::Bool=true) =
+    _replace_columns!(df, select(df, args..., copycols=true, renamecols=renamecols))
+
 function select!(@nospecialize(arg::Base.Callable), df::AbstractDataFrame; renamecols::Bool=true)
     if arg isa Colon
         throw(ArgumentError("First argument must be a transformation if the second argument is a data frame"))
@@ -658,14 +672,15 @@ function select!(@nospecialize(arg::Base.Callable), df::AbstractDataFrame; renam
 end
 
 """
-    transform!(df::DataFrame, args...; renamecols::Bool=true)
-    transform!(args::Callable, df::DataFrame; renamecols::Bool=true)
-    transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
+    transform!(df::AbstractDataFrame, args...; renamecols::Bool=true)
+    transform!(args::Callable, df::AbstractDataFrame; renamecols::Bool=true)
+    transform!(gd::GroupedDataFrame, args...; ungroup::Bool=true, renamecols::Bool=true)
     transform!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)
 
 Mutate `df` or `gd` in place to add columns specified by `args...` and return it.
 The result is guaranteed to have the same number of rows as `df`.
-Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`.
+Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`,
+except that column renaming performs a copy.
 
 $TRANSFORMATION_COMMON_RULES
 
@@ -677,7 +692,7 @@ $TRANSFORMATION_COMMON_RULES
 
 See [`select`](@ref) for examples.
 """
-function transform!(df::DataFrame, @nospecialize(args...); renamecols::Bool=true)
+function transform!(df::AbstractDataFrame, @nospecialize(args...); renamecols::Bool=true)
     idx = index(df)
     newargs = Any[if sel isa Pair{<:ColumnIndex, Symbol}
                       idx[first(sel)] => copy => last(sel)