Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up colwise #485

Closed
johnmyleswhite opened this issue Jan 21, 2014 · 4 comments
Closed

Clean up colwise #485

johnmyleswhite opened this issue Jan 21, 2014 · 4 comments

Comments

@johnmyleswhite
Copy link
Contributor

Instead of offering functions colmean, colstd, etc., I'd like to lean more heavily on colwise. Right now its behavior surprises, although I think I understand the logic.

using DataFrames
    df = DataFrame(A = 1:4, B = randn(4))
    colwise(cumsum, df) # -> Evaluates to Array{Any}

I assume we do this with the assumption that the return values for a function may differ in length across columns, which means that we can't do better than return a generic Array{Any}. That might be the right approach, but it's worth making sure that we prefer this very general strategy over something that would produce a more easily interpreted DataFrame.

@nalimilan
Copy link
Member

For most uses the length of the result will be the same for all columns, so indeed it would be nice to get a more specific type. I think it would be natural to return the same type as what a list comprehension syntax would return.
colwise() could even be defined as a comprehension, i.e. colwise(mean, df) = [mean(df[i]) for i in 1:length(df)].

Now, currently there's issue JuliaLang/julia#5258, but in the long term this would not return an Array{Any}.

@mkborregaard
Copy link
Contributor

mkborregaard commented Sep 12, 2016

Revisiting this 2 years later, colwise still returns an array of Any, where each element is a Vector, even when the result of each operation is a scalar. Example:

df =DataFrame(a = repeat([1,2,3,4], outer =[2]), b =repeat([2,3], outer =[4]), c =randn(8))
cs = colwise(sum, df)

gives

3-element Array{Any,1}:
 [20]    
 [20]    
 [5.0978]

I feel this is not convenient for further work on the column results, e.g. I may want to make a histogram of the column sums. I then have to

using Plots; histogram(vcat(cs...))

@cjprybol
Copy link
Contributor

@mkborregaard this has been addressed in DataTables by JuliaData/DataTables.jl#28

@mkborregaard
Copy link
Contributor

Great, thanks for the heads up! Must be nice to be able to close a three year old issue :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants