Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic discussions about using metaprogramming with DataFrames #1

Closed
tshort opened this issue Feb 5, 2014 · 24 comments
Closed

Generic discussions about using metaprogramming with DataFrames #1

tshort opened this issue Feb 5, 2014 · 24 comments

Comments

@tshort
Copy link
Contributor

tshort commented Feb 5, 2014

Here are several issues that discuss metaprogramming and/or different
approaches to query and manipulate DataFrames:

Please add additional comments here on better approaches to querying DataFrames.

@floswald
Copy link
Contributor

hey @tshort can I just briefly state that this is one of the most important packages for my daily work. the combination with Lazy is hard to beat. do you have any plans of integrating this with DataFrames at some point?
thanks!

@tshort
Copy link
Contributor Author

tshort commented Oct 23, 2014

Hi Florian. Thanks for the feedback. This is still an experimental package. DataFrames needs something extra to improve performance, tighten up syntax, and add LINQ-like features. Is the DataFramesMeta the best way to do that? I'm not sure. More feedback would be great! (I haven't used it much for real projects.)

  • Have you run into issues or unexpected results?
  • Do you see real-life performance gains?
  • What would you change?
  • Does it feel Julian?

As to plans for integrating into DataFrames, that's up to (mainly) @johnmyleswhite and @simonster. It'll need more testing and user input. Maybe we start small with @with. Maybe we need to add this package to METADATA to make it easier for people to try.

@johnmyleswhite
Copy link

I think encouraging people to try this package out would be a good idea. I'm still really hung up on making sure we get the lower level stuff like Nullable right, but there's no reason people can't start trying out this package to see how it helps them.

@floswald
Copy link
Contributor

Hi Tom and John,

I am actually using it for real work. It allows me to plough through large dataframes very quickly and very intuitively. My typical usage is with Lazy.jl to do something like this:

sum1 = @> begin
        sim1    
        @where((:j.==j) & (:year.>1997)) 
        @transform(move_own = :move .* :own, move_rent = :move .* (!:own), buy = (:h.==0).*(:hh.==1))
        @by(:year,move_own=mean(:move_own.data,WeightVec(:density.data)),move_rent=mean(:move_rent.data,WeightVec(:density.data)))
        end
  • I don't have any concerns about performance, i certainly couldn't complain about it being slower than using plain dataframes. it would take me much longer to come up with the right expressions actually.
  • I would add something that lets me return several objects as a tuple (maybe there's a way to do that already?)
sum1 = @> begin
        sim1    
        @where((:j.==j) & (:year.>1997)) 
        @transform(move_own = :move .* :own, move_rent = :move .* (!:own), buy = (:h.==0).*(:hh.==1))
        out1 = @by(:year,move_own=mean(:move_own.data,WeightVec(:density.data)),move_rent=mean(:move_rent.data,WeightVec(:density.data)))
        out2 = @by(:age,move_own=mean(:move_own.data,WeightVec(:density.data)),move_rent=mean(:move_rent.data,WeightVec(:density.data)))
(out1,out2)
        end
  • does it feel julian? it does to me because of the macros, yes.
  • adding this to METADATA would be a good idea. I came across this by pure chance.

@johnmyleswhite
Copy link

Glad that works so well for you. I personally find that code really hard to make sense of because of all the implicit arguments.

@floswald
Copy link
Contributor

yeah i can see why you say that. it took me a little while to get used to it. it's always the first argument that gets piped in from the previous expression.

@johnmyleswhite
Copy link

Thanks, that helps me understand. Is there a version where the piping is explicit? That's what I find most confusing.

@johnmyleswhite
Copy link

Also, if you'd like to put this in METADATA.jl, I think that's a good idea.

@garborg
Copy link

garborg commented Oct 24, 2014

@johnmyleswhite As far as I could tell, the @as macro is the most likely, if any, to make it into base.

It at least makes the arguments explicit:

@as _ begin
    sim1
    @where(_, (:j .== j) & (:year .> 1997))
    @transform(_, move_own = :move .* :own,
                  move_rent = :move .* (!:own),
                  buy = (:h .== 0) .* (:hh .== 1))
    @by(_, :year, move_own = mean(:move_own.data, WeightVec(:density.data)),
                  move_rent = mean(:move_rent.data, WeightVec(:density.data)))
end

More than worth the extra three characters per line IMO. I'm undecided how I feel about the lack of pipes there. Fortunately, I'm pretty sure the Lazy.jl version below won't make it in without pipes:

@as _ sim1 @where(_, (:j. == j) & (:year .> 1997)) @transform(_, move_own = :move .* :own)

@shashi
Copy link

shashi commented Oct 24, 2014

This package is awesome, and should definitely be in METADATA.jl. I really like the macros and @> for the simple and natural language they provide together, Julian or not.

As for performance and increased expressiveness, there may be some optimization opportunities: (please correct me if I'm not being correct/reasonable)

  • map in base is implemented over AbstractArrays. Currently, broadcast(f::Function, As::StridedArray) is a faster way to map over a vector of numbers.
    by just using broadcast as map we could express these queries better: @where, @ transform, @by need not require broadcasted expressions: e.g. @where(_, (:j .== j) & (:year .> 1997)) could just be @where(_, (:j == j) & (:year > 1997)). Now these operations (== and >) in the where clause can be reasoned as per-row, like one would in an SQL statement. Besides, any function f can be used in place of the operators.
  • Transducers! - transducers could make it possible to compose map filter and reduce processes and apply the composition in a single loop. This should also amount to lesser memory allocation.

There are some concerns with these points though:

@garborg
Copy link

garborg commented Oct 24, 2014

I would be REALLY happy to see anything that made composition of functional
iterators (filter, Iterators.jl, etc.) faster in Julia.

On Fri, Oct 24, 2014 at 2:48 PM, Shashi Gowda notifications@github.com
wrote:

This package is awesome, and should definitely be in METADATA.jl. I really
like the macros and @> for the simple and natural language they provide
together, Julian or not.

As for performance and increased expressiveness, there may be some
optimization opportunities: (please correct me if I'm not being
correct/reasonable)

  • map in base is implemented over AbstractArrays. Currently, broadcast(f::Function,
    As::StridedArray) is a faster way to map over a vector of numbers. by
    just using broadcast as map we could express these queries better:
    @where, @ transform, @by need not require broadcasted expressions:
    e.g. @where(, (:j .== j) & (:year .> 1997)) could just be @where(,
    (:j == j) & (:year > 1997)). Now these operations (== and >) in the
    where clause can be reasoned as per-row, like one would in an SQL
    statement. Besides, any function f can be used in place of the
    operators.
  • Transducers! https://www.youtube.com/watch?v=6mTbuzafcII#t=1227 -
    transducers could make it possible to compose map filter and reduce
    processes and apply the composition in a single loop. This should
    also amount to lesser memory allocation.

There are some concerns with these points though:


Reply to this email directly or view it on GitHub
#1 (comment)
.

@johnmyleswhite
Copy link

I guess I don't really understand why iterators aren't fast right now. When you nest iterators, do they no longer allow inlining?

@garborg
Copy link

garborg commented Oct 25, 2014

I haven't looked into it enough to have much to say, but another thing is that when it comes to things like filtering, there are extra loops in the next methods, etc., so it gets far removed from a single loop pretty quickly. I'm not sure how much getting around that there is without adding something like a skip function that defaults to false -- relatedly I've heard a couple people mention using

@garborg
Copy link

garborg commented Oct 25, 2014

Nullable to combine next and done to avoid the mild unpleasantness of mutation happening in done, which would be nice, but might also make fusing the iterators harder. Just half-baked thoughts.

@johnmyleswhite
Copy link

I definitely think that using Nullable as part of the iteration protocol is a good idea.

I guess I assume you need nested loops for certain compositions of iterators. In particular, I was thinking we should remove pad from DataArrays and make it into an iterator, which could be a big gain in applications where you need to avoid allocating a lot of extra memory.

@garborg
Copy link

garborg commented Oct 25, 2014

Oh yes, that makes sense, and Iterators.chain is the same thing. Still,
composition could be worth playing around with.

On Sat, Oct 25, 2014 at 5:24 PM, John Myles White notifications@github.com
wrote:

I definitely think that using Nullable as part of the iteration protocol
is a good idea.

I guess I assume you need nested loops for certain compositions of
iterators. In particular, I was thinking we should remove pad from
DataArrays and make it into an iterator, which could be a big gain in
applications where you need to avoid allocating a lot of extra memory.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@tshort
Copy link
Contributor Author

tshort commented Nov 17, 2014

DataFramesMeta is now in METADATA...

@davidagold
Copy link
Contributor

With regards to JuliaData/DataFrames.jl#369 I have been working on a simple @byrow macro that supports if blocks:

julia> df
8x3 DataFrame
| Row | a | b   | col_1    |
|-----|---|-----|----------|
| 1   | 1 | "M" | 0.731688 |
| 2   | 2 | "F" | 0.294839 |
| 3   | 3 | "M" | 0.667601 |
| 4   | 4 | "M" | 0.24186  |
| 5   | 5 | "F" | 0.247961 |
| 6   | 6 | "F" | 0.302071 |
| 7   | 7 | "M" | 0.167708 |
| 8   | 8 | "F" | 0.664298 |

julia> @byrow df if :a > 1; :b = "foo" end

julia> df
8x3 DataFrame
| Row | a | b     | col_1    |
|-----|---|-------|----------|
| 1   | 1 | "M"   | 0.731688 |
| 2   | 2 | "foo" | 0.294839 |
| 3   | 3 | "foo" | 0.667601 |
| 4   | 4 | "foo" | 0.24186  |
| 5   | 5 | "foo" | 0.247961 |
| 6   | 6 | "foo" | 0.302071 |
| 7   | 7 | "foo" | 0.167708 |
| 8   | 8 | "foo" | 0.664298 |

julia> macroexpand( :( @byrow df if :a > 1; :b = "Foo" end ))
:(for row = 1:length(df[1])
        if df[row,:a] > 1 # line 1:
            df[row,:b] = "Foo"
        end
    end)

You can also use begin end blocks to improve readability:

julia> @byrow df begin
                   if :a > 1
                       :b = "bar"
                   end
                 end

julia> df
8x3 DataFrame
| Row | a | b     | col_1    |
|-----|---|-------|----------|
| 1   | 1 | "M"   | 0.731688 |
| 2   | 2 | "bar" | 0.294839 |
| 3   | 3 | "bar" | 0.667601 |
| 4   | 4 | "bar" | 0.24186  |
| 5   | 5 | "bar" | 0.247961 |
| 6   | 6 | "bar" | 0.302071 |
| 7   | 7 | "bar" | 0.167708 |
| 8   | 8 | "bar" | 0.664298 |

The code is here. I haven't tested performance yet. My approach is very simple and doesn't generate a new function for which type inference is unimpeded, as in @tshort 's implementation of @with. Furthermore, I haven't yet experimented with the Devectorize.jl package, and I don't know how much this macro overlaps with that. All that being said, I think this syntax could go a long ways towards ameliorating issues such as described in 369 and in this thread.

I'm very happy to continue working on this functionality for DataFramesMeta if folks here like it. Immediate next steps are to add more expressive power, such as the ability to index into an array x[:] that isn't columns of the implicit dataframe. This probably will require some approach to denote that the index for x[:] needs to be hooked into the loop generated by the macro, e.g.:

 @byrow df begin
            if :a > ^x
                :a = 2(^x)
            end
        end

to get

for row in 1:length(df[1])
    if df[row, :a] > x[row]
        df[row, :a] = 2x[row]
    end
end

EDIT: Actually, the above turns out to be unnecessary. If x is a vector outside the "scope" of df, then just writing x[row] will suffice:

julia> macroexpand( :( @byrow df if :a > 1; :b = x[row] end ))
:(for row = 1:length(df[1])
        if df[row,:a] > 1 # line 1:
            df[row,:b] = x[row]
        end
    end)

If anybody has thoughts, please do share them! @nalimilan, is this along the lines of what you had in mind in 369 above? (Thank you for humoring my reference to your posts from over a year ago...)

@nalimilan
Copy link
Member

@davidagold Interesting. Indeed it looks like what I was describing in JuliaData/DataFrames.jl#369 I don't think you need to worry about Devec.jl, I see it as orthogonal to this kind of macro.

@tshort
Copy link
Contributor Author

tshort commented May 22, 2015

I like this idea, @davidagold--it would make a good addition to DataFramesMeta. I don't think it will perform well with the indexing inside the loop. For better performance, you could try to convert to something like the following:

@with df for row in 1:length(df[1])
    if :a[row] > x[row]
        :a[row] = 2x[row]
    end
end

@davidagold
Copy link
Contributor

Thank you both for your inputs! Tom, you're right about performance:

using DataArrays, DataFrames, DataFramesMeta

srand(1)
n = 10_000_000
a = rand(n)
b = rand(n)
c = rand(n)
d = zeros(n)
df = DataFrame(a=a, b=b, c=c, d=d)

function f1()
    x = 0.0
    @byrow df (begin
        if :a < :b
            x += :b * :c
        end
    end)
    return x
end

function f2()
    x = 0.0
    a = convert(Array, df[:a])
    b = convert(Array, df[:b])
    c = convert(Array, df[:c])
    for row in 1:10_000_000
        if a[row] < b[row]
            x += b[row] * c[row]
        end
    end
    return x
end

function f3()
    x = 0.0
    @with df (begin
        for row in 1:length(df[1])
            if :a[row] < :b[row]
                x += :b[row] * :c[row]
            end
        end
    end)
    return x
end

function g1()
    @byrow df begin
        if :a < :b
            :d = :b * :c
        end
    end
end

function g2()
    a = convert(Array, df[:a])
    b = convert(Array, df[:b])
    c = convert(Array, df[:c])
    for row in 1:10_000_000
        if a[row] < b[row]
            df[row, :d] = b[row] * c[row]
        end
    end
end

function g3()
    @with df begin
        for row in 1:length(df[1])
            if :a[row] < :b[row]
                :d[row] = :b[row] * :c[row]
            end
        end
    end
end

f1()
f2()
f3()
g1()
g2()
g3()

println("f1: ", @time(f1()))
println("f2: ", @time(f2()))
println("f3: ", @time(f3()))
println("g1: ", @time(g1()))
println("g2: ", @time(g2()))
println("g3: ", @time(g3()))

gives

elapsed time: 5.64861955 seconds (1760130256 bytes allocated, 20.85% gc time)
f1: 1.6670060387376607e6

elapsed time: 2.464278376 seconds (1120132752 bytes allocated, 37.23% gc time)
f2: 1.6670060387376607e6

elapsed time: 3.435715078 seconds (1280088960 bytes allocated, 23.48% gc time)
f3: 1.6670060387376607e6

elapsed time: 6.304825319 seconds (1920164824 bytes allocated, 20.65% gc time)
g1: nothing

elapsed time: 3.157730953 seconds (1360203344 bytes allocated, 31.98% gc time)
g2: nothing

elapsed time: 3.384456055 seconds (1200062760 bytes allocated, 23.14% gc time)
g3: nothing

I'll go work on your suggestion. May I submit it as a PR when it's ready?

@davidagold
Copy link
Contributor

@tshort After implementing your suggestion the same tests as above now give:

elapsed time: 3.841014122 seconds (1280088960 bytes allocated, 25.72% gc time)
f1: 1.6670060387376607e6

elapsed time: 2.611359312 seconds (1120132752 bytes allocated, 37.45% gc time)
f2: 1.6670060387376607e6


elapsed time: 3.53787387 seconds (1200062760 bytes allocated, 22.51% gc time)
g1: nothing

elapsed time: 3.085036564 seconds (1360203344 bytes allocated, 32.94% gc time)
g2: nothing

where (as a reminder) f1/g1 use @byrow (which now essentially does the same thing as f3/g3 above) and f2/g2 converts DataFrame columns to Arrays and loops through the latter. If you have any more suggestions for performance, I am game to investigate.

@davidagold
Copy link
Contributor

Aaactually, I've run into an issue. Going to file.

@tshort
Copy link
Contributor Author

tshort commented May 22, 2015

Yes to the PR when ready, @davidagold. As far as your tests, try them out without using globals. They should all be faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants