Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use of map in ByRow #2957

Closed
bkamins opened this issue Dec 5, 2021 · 1 comment · Fixed by #2982
Closed

use of map in ByRow #2957

bkamins opened this issue Dec 5, 2021 · 1 comment · Fixed by #2982
Labels
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Dec 5, 2021

The problem is:

julia> df = DataFrame(x=spzeros(10))
10×1 DataFrame
 Row │ x       
     │ Float64 
─────┼─────────
   1 │     0.0
   2 │     0.0
   3 │     0.0
   4 │     0.0
   5 │     0.0
   6 │     0.0
   7 │     0.0
   8 │     0.0
   9 │     0.0
  10 │     0.0

julia> transform(df, :x => ByRow(x -> rand()))
10×2 DataFrame
 Row │ x        x_function 
     │ Float64  Float64    
─────┼─────────────────────
   1 │     0.0   0.0797826 
   2 │     0.0   0.0797826 
   3 │     0.0   0.0797826 
   4 │     0.0   0.0797826 
   5 │     0.0   0.0797826 
   6 │     0.0   0.0797826 
   7 │     0.0   0.0797826 
   8 │     0.0   0.0797826 
   9 │     0.0   0.0797826 
  10 │     0.0   0.0797826 

which is incorrect.

The reason is that we stayed with using map in

(f::ByRow)(cols::AbstractVector...) = map(f.fun, cols...)
since it was efficient. I could change it, but maybe we consider the behavior of map for SparseVector incorrect and we can stay with using map?

@nalimilan + I am not sure whom to ping from SparseArrays community?

@bkamins bkamins added the bug label Dec 5, 2021
@bkamins bkamins added this to the patch milestone Dec 5, 2021
@nalimilan
Copy link
Member

So basically this boils down to:

julia> using SparseArrays

julia> x = spzeros(10)
10-element SparseVector{Float64, Int64} with 0 stored entries

julia> map(_ -> rand(), x)
10-element SparseVector{Float64, Int64} with 10 stored entries:
  [1 ]  =  0.809673
  [2 ]  =  0.809673
  [3 ]  =  0.809673
  [4 ]  =  0.809673
  [5 ]  =  0.809673
  [6 ]  =  0.809673
  [7 ]  =  0.809673
  [8 ]  =  0.809673
  [9 ]  =  0.809673
  [10]  =  0.809673

I don't think DataFrames should take care of this. That's a problem in SparseArrays, and as long as map behaves like that there it's OK for us to do the same with ByRow. But I agree it's weird, and we precisely changed PooledArrays to avoid that. Maybe we could find a common solution, like pure=false in PooledArrays.

I'm not sure who we could ping, probably best file an issue in SparseArrays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants