Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subset #2496

Merged
merged 17 commits into from
Jan 11, 2021
Merged

Add subset #2496

merged 17 commits into from
Jan 11, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 22, 2020

This follows #2314 discussion.

CC @pdeffebach @matthieugomez

@bkamins bkamins added grouping non-breaking The proposed change is not breaking priority labels Oct 22, 2020
@bkamins bkamins added this to the 1.0 milestone Oct 22, 2020
@bkamins
Copy link
Member Author

bkamins commented Oct 22, 2020

Let us confirm if the functionality is as we like it before I write the tests. Thank you!

@bkamins bkamins mentioned this pull request Oct 22, 2020
20 tasks
@bkamins bkamins requested a review from nalimilan October 22, 2020 10:58
Copy link
Contributor

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this!

I don't think you've gotten all the missing logic worked out yet.

julia> df = DataFrame(a = [1, 1, 1, 2, 2, 2, 2], b = [1, 1, 2, 101, 101, 102, missing]);

julia> where(df, :b => (t -> t .> mean(t)))
ERROR: MethodError: no method matching &(::Missing)

I think this only shows up when we return exactly one column that contains missing values. Proabably a missing method for (&)(missing)

docs/src/man/comparisons.md Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
@matthieugomez
Copy link
Contributor

matthieugomez commented Oct 23, 2020

This looks great! I just have two comments about missing values.

I'm not a fan of introducing a new kwarg dropmissing = true. As I said in another thread, I don't like R multiplication of na.rm, na.omit etc. I think it'd be better to have no keyword argument at all.

Also, I support dropping rows with missing values because that is why what R/sql/stata do. But @nalimilan was concerned about it, so maybe it would be worth having a vote somewhere to make sure others agree.

@bkamins
Copy link
Member Author

bkamins commented Oct 23, 2020

Thank you both for looking into it. Indeed there are some rough edges - thank you for catching them (such things happen if tests are not written yet)

(&)(missing) error

This is exactly the reason. I will fix this.

Also, I support dropping rows with missing values because that is why what R/sql/stata do.

I think we should drop them (but clearly indicate that we behave here differently than expected).

I'm not a fan of introducing a new kwarg dropmissing = true.

I would assume that it would be hardly used, so it is not that problematic hopefully. On the other hand having it makes, in particular, sure that users notice more easily that we have a non-standard behavior here.


In general - thank you for the feedback. I will fix the mistake, and let us wait for @nalimilan to comment when he has time (I would not put a vote on it - I think us four have enough understanding of the ecosystem to decide).

@pdeffebach
Copy link
Contributor

w.r.t &. I used reduce(and, eachcol(df)) which I think might be better because it doesn't specialize on number of arguments.

@bkamins
Copy link
Member Author

bkamins commented Oct 23, 2020

w.r.t &. I used reduce(and, eachcol(df))

This is less efficient as it will not be able to do broadcasting fusion (you will have allocations grow linearly with the number of conditions).

@bkamins
Copy link
Member Author

bkamins commented Oct 23, 2020

see JuliaLang/julia#38157

@bkamins
Copy link
Member Author

bkamins commented Oct 23, 2020

I have thought of pros and cons of dropping row vs erroring on missing.

The question is - if we error on missing how do we cleanly handle something like:

where(df, :x => ByRow(>=(1)), :y => ByRow(>=(2)))

assuming both :x and :y may contain missings.

You would have to write something like:

where(df, :x => x -> coalesce.(x .>= 1, false), :y => y -> coalesce.(y .>= 1, false))

which seems verbose (but indeed it is not the end of the world and at least it is explicit).
If we chose to error on missing by default then I think we do not need dropmissing kwarg, as then using coalesce seems fine for me.

By the way. Can you please comment what would be the recommended style of writing simple filters:

where(df, :x => ByRow(>=(1)), :y => ByRow(>=(2)))

or

where(df, :x => x -> x .>= 1, :y => y -> y .>= 1)

?

The benefit of:

where(df, :x => ByRow(>=(1)), :y => ByRow(>=(2)))

is that it is compiled once, while both:

where(df, :x => x -> x .>= 1, :y => y -> y .>= 1)
where(df, :x => x -> coalesce.(x .>= 1, false), :y => y -> coalesce.(y .>= 1, false))

would be compiled anew on every call.

@bkamins
Copy link
Member Author

bkamins commented Oct 24, 2020

@pdeffebach + @matthieugomez : if you could spare some time it would be great if you could comment with examples similar to my above that you find typical use-cases, so that we have more hands-on material to make an informed decision what to do. Thank you!

@bkamins
Copy link
Member Author

bkamins commented Oct 27, 2020

@nalimilan - do you have an opinion what we should do with handling of missing by where given the discussion?

@nalimilan
Copy link
Member

Sorry I've given the priority to breaking issues. I think this one could wait for post-0.22 so that we have time to find the best design.

I think having a dropmissing/skipmissing keyword argument is nice as it's simpler to use than coalesce. I'm still not completely sure whether the default should be true or false (as I said I think we need to discuss a general plan for missing values first).

Regarding the name of that argument, dropmissing already exists in our API so it's not completely new, but I agree with @matthieugomez that since other functions take a skipmissing keyword argument it would be easier for users to be consistent.

src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
Comment on lines 109 to 113
function where(df::AbstractDataFrame, @nospecialize(args...);
dropmissing::Bool=true, view::Bool=false)
row_selector = _get_where_conditions(df, args, dropmissing)
return view ? Base.view(df, row_selector, :) : df[row_selector, :]
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth using @inline to avoid the type instability, with @noinline in _get_where_conditions (with inference tests to prevent regressions). Same for the GroupedDataFrame method.

Copy link
Member Author

@bkamins bkamins Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have added @inline. I keep this comment open to remember testing against it when we finalize the design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is type unstable even with @inline and @noinline in _get_where_conditions. Do you think it can be resolved somehow? (and do you think it is so problematic given DataFrame is type unstable anyway?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird. We use this kind of pattern elsewhere and it works, right? Is row_selector correctly inferred?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - given we add ungroup kwarg I am not sure we can get type stability here, do we? (as now we have yet another option of the return value related to the second kwarg)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ungroup shouldn't really make things worse as long as the function is inlined and the value of arguments know at compile time. But if that doesn't work in the first place...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I have now:

julia> @code_warntype subset(DataFrame(a=1:10), :a => ByRow(<(4)), view=false, skipmissing=false)
Variables
  #unused#::Core.Compiler.Const(DataFrames.var"#subset##kw"(), false)
  @_2::NamedTuple{(:view, :skipmissing),Tuple{Bool,Bool}}
  @_3::Core.Compiler.Const(DataFrames.subset, false)
  df::DataFrame
  args::Tuple{Pair{Symbol,ByRow{Base.Fix2{typeof(<),Int64}}}}
  skipmissing::Bool
  view::Bool
  @_8::Bool
  @_9::Bool

Body::Any
1 ── %1  = Base.haskey(@_2, :skipmissing)::Core.Compiler.Const(true, false)
│          %1
│    %3  = Base.getindex(@_2, :skipmissing)::Bool
│    %4  = (%3 isa DataFrames.Bool)::Core.Compiler.Const(true, false)
│          %4
└───       goto #3
2 ──       Core.Compiler.Const(:(%new(Core.TypeError, Symbol("keyword argument"), :skipmissing, DataFrames.Bool, %3)), false)
└───       Core.Compiler.Const(:(Core.throw(%7)), false)
3 ┄─       (@_8 = %3)
└───       goto #5
4 ──       Core.Compiler.Const(:(@_8 = false), false)
5 ┄─       (skipmissing = @_8)
│    %13 = Base.haskey(@_2, :view)::Core.Compiler.Const(true, false)
│          %13
│    %15 = Base.getindex(@_2, :view)::Bool
│    %16 = (%15 isa DataFrames.Bool)::Core.Compiler.Const(true, false)
│          %16
└───       goto #7
6 ──       Core.Compiler.Const(:(%new(Core.TypeError, Symbol("keyword argument"), :view, DataFrames.Bool, %15)), false)
└───       Core.Compiler.Const(:(Core.throw(%19)), false)
7 ┄─       (@_9 = %15)
└───       goto #9
8 ──       Core.Compiler.Const(:(@_9 = false), false)
9 ┄─       (view = @_9)
│    %25 = (:skipmissing, :view)::Core.Compiler.Const((:skipmissing, :view), false)
│    %26 = Core.apply_type(Core.NamedTuple, %25)::Core.Compiler.Const(NamedTuple{(:skipmissing, :view),T} where T<:Tuple, false)
│    %27 = Base.structdiff(@_2, %26)::Core.Compiler.Const(NamedTuple(), false)
│    %28 = Base.pairs(%27)::Core.Compiler.Const(Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}(), false)
│    %29 = Base.isempty(%28)::Core.Compiler.Const(true, false)
│          %29
└───       goto #11
10 ─       Core.Compiler.Const(:(Core.tuple(@_2, @_3, df)), false)
└───       Core.Compiler.Const(:(Core._apply_iterate(Base.iterate, Base.kwerr, %32, args)), false)
11 ┄ %34 = Core.tuple(skipmissing, view, @_3, df)::Core.Compiler.PartialStruct(Tuple{Bool,Bool,typeof(subset),DataFrame}, Any[Bool, Bool, Core.Compiler.Const(DataFrames.subset, false), DataFrame])
│    %35 = Core._apply_iterate(Base.iterate, DataFrames.:(var"#subset#406"), %34, args)::Any
└───       return %35

julia> f() = subset(DataFrame(a=1:10), :a => ByRow(<(4)), view=false, skipmissing=false)
f (generic function with 1 method)

julia> @code_warntype f()
Variables
  #self#::Core.Compiler.Const(f, false)

Body::Any
1 ─ %1  = (1:10)::Core.Compiler.Const(1:10, false)
│   %2  = (:a,)::Core.Compiler.Const((:a,), false)
│   %3  = Core.apply_type(Core.NamedTuple, %2)::Core.Compiler.Const(NamedTuple{(:a,),T} where T<:Tuple, false)
│   %4  = Core.tuple(%1)::Core.Compiler.Const((1:10,), false)
│   %5  = (%3)(%4)::Core.Compiler.Const((a = 1:10,), false)
│   %6  = Core.kwfunc(Main.DataFrame)::Core.Compiler.Const(Core.var"#Type##kw"(), false)
│   %7  = (%6)(%5, Main.DataFrame)::DataFrame
│   %8  = (<)(4)::Core.Compiler.Const(Base.Fix2{typeof(<),Int64}(<, 4), false)
│   %9  = Main.ByRow(%8)::Core.Compiler.Const(ByRow{Base.Fix2{typeof(<),Int64}}(Base.Fix2{typeof(<),Int64}(<, 4)), false)
│   %10 = (:a => %9)::Core.Compiler.Const(:a => ByRow{Base.Fix2{typeof(<),Int64}}(Base.Fix2{typeof(<),Int64}(<, 4)), false)
│   %11 = (:view, :skipmissing)::Core.Compiler.Const((:view, :skipmissing), false)
│   %12 = Core.apply_type(Core.NamedTuple, %11)::Core.Compiler.Const(NamedTuple{(:view, :skipmissing),T} where T<:Tuple, false)
│   %13 = Core.tuple(false, false)::Core.Compiler.Const((false, false), false)
│   %14 = (%12)(%13)::Core.Compiler.Const((view = false, skipmissing = false), false)
│   %15 = Core.kwfunc(Main.subset)::Core.Compiler.Const(DataFrames.var"#subset##kw"(), false)
│   %16 = (%15)(%14, Main.subset, %7, %10)::Any
└──       return %16

(not passing kwargs does not change this)

Maybe the reason is that we have a vararg function with @nospecialize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing @nospecialize did not change anything.

src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
src/abstractdataframe/where.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Oct 30, 2020

I think this one could wait for post-0.22 so that we have time to find the best design.

Sure - we can skip it for now to gather more feedback what would be best (especially concrete examples so that we can judge between both designs).

@bkamins
Copy link
Member Author

bkamins commented Nov 4, 2020

Given the discussion in #2417 I have changed my mind and think that dropmissing=false should be the default in DataFrames.jl.

However, what we should discuss I think is @pdeffebach thinks that DataFramesMeta.jl should use the dropmissing=true by default (and also then in other operations missings would be propagated). In this way we would have a different default in DataFrames.jl and DataFramesMeta.jl consistently (and users could choose what they prefer). Would that work?

@pdeffebach
Copy link
Contributor

pdeffebach commented Nov 4, 2020

I have to admit that #2417 is a bit of a niche and tangential issue to base this decision on.

I am okay with a bifurcation in DataFramesMeta. I suppose I will probably start out with mimicking DataFrames and then change it if enough people complain?

This really underscores how we need to

  1. Use :x instead of x for lhs assignment, which opens the door for
  2. Allowing keyword arguments to be passed.

Then it matters less what the default behavior in @where is. The user can always control it.

@bkamins
Copy link
Member Author

bkamins commented Dec 8, 2020

I would like to revive this PR and make a decision:

  1. do we need where (I vote yes); here the only discussion could be that where is also a keyword in Julia Base, so e.g. some users might be confused and it is very likely that some IDEs might be confused with highlighting
  2. if we add where what should be the default for dropmissing kwarg (I think it should be false)

@pdeffebach
Copy link
Contributor

I'm fine with a different name for where, not sure what it would be though.

I'm okay with dropmissing = true by default, I guess. One reasoning is that if users want to use && and ||, which I think is pretty common inside those expressions, they will need to handle missings anyways. Hopefully something like spreadmissings can make help this.

@matthieugomez
Copy link
Contributor

The only thing I feel strongly about is that dropmissing should be called skipmissing if it defaults to false.

@nalimilan
Copy link
Member

As I said somewhere before I'm not a fan of the name where since all other functions in the API are verbs, and there's a (small) confusion with the where T syntax. I'd be inclined to use something else, like subset (R term) instead. Anyway our terminology is quite different from SQL since the only term in common is select.

Regarding missings, I think we should be consistent everywhere, and since we are strict in select and transform by default (no propagation of missing values) I think we shouldn't drop missing values automatically here either. But DataFramesMeta should probably choose a more permissive approach in all functions.

@bkamins
Copy link
Member Author

bkamins commented Dec 11, 2020

OK, so I vote for:

  1. subset as a verb (this is OK as we treat data frame as a collection of rows)
  2. skipmissing=false as a kwarg name and default value

@bkamins
Copy link
Member Author

bkamins commented Jan 1, 2021

I would even avoid mentioning it.

OK - I will add a comment in the code though.

bkamins and others added 3 commits January 1, 2021 22:45
src/abstractdataframe/subset.jl Outdated Show resolved Hide resolved
Comment on lines 159 to 160
@inline function subset(gdf::GroupedDataFrame, @nospecialize(args...);
skipmissing::Bool=false, view::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be implemented right now, but I guess at some point it will make sense to support ungroup=false to get a GroupedDataFrame like with select/transform?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I had in an initial implementation. Then I thought that the problem is that ungroup=false does not give us any benefit (as in select/transform as we have to run groupby anyway) and complicates the API. But given you raised it I think it is OK to have it for consistency. I will add it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could apply the grouping code directly to group indices, using the same path as refarray/refpool. It could often be faster than using the original columns. Maybe leave a TODO?

src/abstractdataframe/subset.jl Outdated Show resolved Hide resolved
src/abstractdataframe/subset.jl Outdated Show resolved Hide resolved
src/abstractdataframe/subset.jl Outdated Show resolved Hide resolved
Comment on lines 109 to 113
function where(df::AbstractDataFrame, @nospecialize(args...);
dropmissing::Bool=true, view::Bool=false)
row_selector = _get_where_conditions(df, args, dropmissing)
return view ? Base.view(df, row_selector, :) : df[row_selector, :]
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird. We use this kind of pattern elsewhere and it works, right? Is row_selector correctly inferred?

bkamins and others added 2 commits January 3, 2021 12:50
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/abstractdataframe/subset.jl Outdated Show resolved Hide resolved
Comment on lines 159 to 160
@inline function subset(gdf::GroupedDataFrame, @nospecialize(args...);
skipmissing::Bool=false, view::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could apply the grouping code directly to group indices, using the same path as refarray/refpool. It could often be faster than using the original columns. Maybe leave a TODO?

bkamins and others added 2 commits January 3, 2021 17:31
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Jan 3, 2021

I have applied the same fix in the other docstring.

@bkamins
Copy link
Member Author

bkamins commented Jan 3, 2021

We could apply the grouping code directly to group indices

You mean to use gdf.groups[row_selector] and based on this create groups faster?
The code would be something like:

res.newuniquetempcol = gdf.groups[row_selector]
gdf2 = groupby(res, :newuniquetempcol)
copy!(gdf2.cols, gdf.cols)
select!(res, 1:ncol(res)-1)
return res

Do you think it would be worth it (I expect this code-path will not be used much).

@nalimilan
Copy link
Member

Yes, something like this. That's not super important, but it doesn't hurt (at least as a TODO).

@bkamins
Copy link
Member Author

bkamins commented Jan 3, 2021

I will add it as TODO, as I do not want to overcomplicate the code now.

@bkamins
Copy link
Member Author

bkamins commented Jan 3, 2021

I left a TODO, as:

  1. in some cases it will be slower than doing groupby on the original column (e.g. when grouping by CategoricalArray it is faster)
  2. in some cases it will be the same
  3. in some cases of course it will be faster

In order to do a fully fast implementation we should treat gdf.groups[row_selector] as categorical (which it is in fact), but this would complicate code more than what I proposed above (we would have to use a special grouping path as you proposed).

@bkamins
Copy link
Member Author

bkamins commented Jan 6, 2021

OK - so regarding this PR is there anything we should discuss or it is good to be merged?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to fix the return type stability issue at some point, but this doesn't have to block the PR.

src/abstractdataframe/subset.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 6, 2021

@pdeffebach, @matthieugomez - I will merge this PR later this week. Please comment if you want to add something to the discussion before merging.

@pdeffebach
Copy link
Contributor

This PR looks good to me. I have a few suggestions.

  1. Is it too late to rename skipmissing to missingasfalse? The second is more clear and doesn't conflict with Base.skipmissing
  2. There is a section in the docs called "Taking a Subset" which makes no mention of this. Perhaps we should add a line in the docs? We also don't mention filter there. (Could be a separate PR)
  3. Compile time is not great. I'm getting about .25 seconds for every call with an anonymous function, compared to about .15 seconds with an equivalent transform call. This can also be fixed in another PR, or more realistically a massive 1.X PR.
  4. Do we have an opinionated way to do "OR" operations? Particularly with broadcasting and avoiding || as that doesn't play well with missing.

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2021

These are good points. Thank you for raising them.

Is it too late to rename skipmissing to missingasfalse?

I thought @matthieugomez wanted skipmissing exactly to reduce number of words to learn. Also groupby uses skipmissing to mean the same what we mean here (i.e. drop missings). I am not sure what is best here (naming is always super hard).

There is a section in the docs called "Taking a Subset" which makes no mention of this.

I will add filter and subset in this PR to this section.

Compile time is not great.

This is strange. I would think that it should be the same as the most compilation work should be with select. I will try to investigate (but as you know these things are hard; c.f. #2566 which I will probably want to follow up before 1.0 release).

Do we have an opinionated way to do "OR" operations?

Probably one would have to do it manually (which means it would be not very convenient). My thinking was that "OR" is needed much less frequently than "AND" (and if you need "OR" you probably want it on the same column, which can be expressed using an anonymous function).

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2021

I started updating the manual, but given the discussion we had today on Slack I gave up - it was too ad hoc. The manual needs to be updated. I have opened a separate PR for this: #2595.

@nalimilan
Copy link
Member

IMO using skipmissing everywhere is indeed better for simplicity.

@bkamins
Copy link
Member Author

bkamins commented Jan 9, 2021

I have investigated the performance related to compilation and I could not improve things here (the compilation time is mostly select except for the first call where we compile also other methods):

~/Desktop/Dev/DF_dev$ julia --banner=no
 Activating environment at `~/Desktop/Dev/DF_dev/Project.toml`
julia> using DataFrames

julia> df = DataFrame(a=true)
1×1 DataFrame
 Row │ a    
     │ Bool 
─────┼──────
   1 │ true

julia> @time subset(df, :a => ByRow(x -> x))
  1.042957 seconds (2.91 M allocations: 149.359 MiB, 2.55% gc time)
1×1 DataFrame
 Row │ a    
     │ Bool 
─────┼──────
   1 │ true

julia> @time subset(df, :a => ByRow(x -> x))
  0.248947 seconds (603.93 k allocations: 30.169 MiB, 5.64% gc time)
1×1 DataFrame
 Row │ a    
     │ Bool 
─────┼──────
   1 │ true

julia> @time subset(df, :a => ByRow(x -> x))
  0.225462 seconds (603.94 k allocations: 30.172 MiB)
1×1 DataFrame
 Row │ a    
     │ Bool 
─────┼──────
   1 │ true


~/Desktop/Dev/DF_dev$ julia --banner=no
 Activating environment at `~/Desktop/Dev/DF_dev/Project.toml`
julia> using DataFrames

julia> df = DataFrame(a=true)
1×1 DataFrame
 Row │ a    
     │ Bool 
─────┼──────
   1 │ true

julia> @time select(df, :a => ByRow(x -> x))
  0.609927 seconds (1.77 M allocations: 90.192 MiB, 4.28% gc time)
1×1 DataFrame
 Row │ a_function 
     │ Bool       
─────┼────────────
   1 │       true

julia> @time select(df, :a => ByRow(x -> x))
  0.213393 seconds (563.55 k allocations: 27.732 MiB, 5.39% gc time)
1×1 DataFrame
 Row │ a_function 
     │ Bool       
─────┼────────────
   1 │       true

julia> @time select(df, :a => ByRow(x -> x))
  0.214110 seconds (563.56 k allocations: 27.735 MiB, 5.01% gc time)
1×1 DataFrame
 Row │ a_function 
     │ Bool       
─────┼────────────
   1 │       true

I will wait for a feedback and if there are no more comments I am going to merge this PR soon as is now.

@bkamins bkamins merged commit 6519d31 into JuliaData:main Jan 11, 2021
@bkamins bkamins deleted the add_where branch January 11, 2021 16:46
@bkamins
Copy link
Member Author

bkamins commented Jan 11, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grouping non-breaking The proposed change is not breaking priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants