Generic discussions about using metaprogramming with DataFrames #1

tshort · 2014-02-05T00:24:05Z

Here are several issues that discuss metaprogramming and/or different
approaches to query and manipulate DataFrames:

Please add additional comments here on better approaches to querying DataFrames.

floswald · 2014-10-22T12:01:25Z

hey @tshort can I just briefly state that this is one of the most important packages for my daily work. the combination with Lazy is hard to beat. do you have any plans of integrating this with DataFrames at some point?
thanks!

tshort · 2014-10-23T00:20:14Z

Hi Florian. Thanks for the feedback. This is still an experimental package. DataFrames needs something extra to improve performance, tighten up syntax, and add LINQ-like features. Is the DataFramesMeta the best way to do that? I'm not sure. More feedback would be great! (I haven't used it much for real projects.)

Have you run into issues or unexpected results?
Do you see real-life performance gains?
What would you change?
Does it feel Julian?

As to plans for integrating into DataFrames, that's up to (mainly) @johnmyleswhite and @simonster. It'll need more testing and user input. Maybe we start small with @with. Maybe we need to add this package to METADATA to make it easier for people to try.

johnmyleswhite · 2014-10-23T00:23:09Z

I think encouraging people to try this package out would be a good idea. I'm still really hung up on making sure we get the lower level stuff like Nullable right, but there's no reason people can't start trying out this package to see how it helps them.

floswald · 2014-10-24T15:27:22Z

Hi Tom and John,

I am actually using it for real work. It allows me to plough through large dataframes very quickly and very intuitively. My typical usage is with Lazy.jl to do something like this:

sum1 = @> begin
        sim1    
        @where((:j.==j) & (:year.>1997)) 
        @transform(move_own = :move .* :own, move_rent = :move .* (!:own), buy = (:h.==0).*(:hh.==1))
        @by(:year,move_own=mean(:move_own.data,WeightVec(:density.data)),move_rent=mean(:move_rent.data,WeightVec(:density.data)))
        end

I don't have any concerns about performance, i certainly couldn't complain about it being slower than using plain dataframes. it would take me much longer to come up with the right expressions actually.
I would add something that lets me return several objects as a tuple (maybe there's a way to do that already?)

sum1 = @> begin
        sim1    
        @where((:j.==j) & (:year.>1997)) 
        @transform(move_own = :move .* :own, move_rent = :move .* (!:own), buy = (:h.==0).*(:hh.==1))
        out1 = @by(:year,move_own=mean(:move_own.data,WeightVec(:density.data)),move_rent=mean(:move_rent.data,WeightVec(:density.data)))
        out2 = @by(:age,move_own=mean(:move_own.data,WeightVec(:density.data)),move_rent=mean(:move_rent.data,WeightVec(:density.data)))
(out1,out2)
        end

does it feel julian? it does to me because of the macros, yes.
adding this to METADATA would be a good idea. I came across this by pure chance.

johnmyleswhite · 2014-10-24T15:31:20Z

Glad that works so well for you. I personally find that code really hard to make sense of because of all the implicit arguments.

floswald · 2014-10-24T15:50:21Z

yeah i can see why you say that. it took me a little while to get used to it. it's always the first argument that gets piped in from the previous expression.

johnmyleswhite · 2014-10-24T15:52:10Z

Thanks, that helps me understand. Is there a version where the piping is explicit? That's what I find most confusing.

johnmyleswhite · 2014-10-24T15:54:14Z

Also, if you'd like to put this in METADATA.jl, I think that's a good idea.

garborg · 2014-10-24T18:13:18Z

@johnmyleswhite As far as I could tell, the @as macro is the most likely, if any, to make it into base.

It at least makes the arguments explicit:

@as _ begin
    sim1
    @where(_, (:j .== j) & (:year .> 1997))
    @transform(_, move_own = :move .* :own,
                  move_rent = :move .* (!:own),
                  buy = (:h .== 0) .* (:hh .== 1))
    @by(_, :year, move_own = mean(:move_own.data, WeightVec(:density.data)),
                  move_rent = mean(:move_rent.data, WeightVec(:density.data)))
end

More than worth the extra three characters per line IMO. I'm undecided how I feel about the lack of pipes there. Fortunately, I'm pretty sure the Lazy.jl version below won't make it in without pipes:

@as _ sim1 @where(_, (:j. == j) & (:year .> 1997)) @transform(_, move_own = :move .* :own)

shashi · 2014-10-24T20:48:19Z

This package is awesome, and should definitely be in METADATA.jl. I really like the macros and @> for the simple and natural language they provide together, Julian or not.

As for performance and increased expressiveness, there may be some optimization opportunities: (please correct me if I'm not being correct/reasonable)

map in base is implemented over AbstractArrays. Currently, broadcast(f::Function, As::StridedArray) is a faster way to map over a vector of numbers.
by just using broadcast as map we could express these queries better: @where, @ transform, @by need not require broadcasted expressions: e.g. @where(_, (:j .== j) & (:year .> 1997)) could just be @where(_, (:j == j) & (:year > 1997)). Now these operations (== and >) in the where clause can be reasoned as per-row, like one would in an SQL statement. Besides, any function f can be used in place of the operators.
Transducers! - transducers could make it possible to compose map filter and reduce processes and apply the composition in a single loop. This should also amount to lesser memory allocation.

There are some concerns with these points though:

Column major nature of Julia means that map/filter operation taking multiple columns as arguments will suffer.
Use of anonymous functions (in place of broadcast operations) can make things slow (anonymous function calls have a huge overhead JuliaLang/julia#1864)

garborg · 2014-10-24T21:08:55Z

I would be REALLY happy to see anything that made composition of functional
iterators (filter, Iterators.jl, etc.) faster in Julia.

On Fri, Oct 24, 2014 at 2:48 PM, Shashi Gowda notifications@github.com
wrote:

This package is awesome, and should definitely be in METADATA.jl. I really
like the macros and @> for the simple and natural language they provide
together, Julian or not.

As for performance and increased expressiveness, there may be some
optimization opportunities: (please correct me if I'm not being
correct/reasonable)

map in base is implemented over AbstractArrays. Currently, broadcast(f::Function,
As::StridedArray) is a faster way to map over a vector of numbers. by
just using broadcast as map we could express these queries better:
@where, @ transform, @by need not require broadcasted expressions:
e.g. @where(, (:j .== j) & (:year .> 1997)) could just be @where(,
(:j == j) & (:year > 1997)). Now these operations (== and >) in the
where clause can be reasoned as per-row, like one would in an SQL
statement. Besides, any function f can be used in place of the
operators.

Transducers! https://www.youtube.com/watch?v=6mTbuzafcII#t=1227 -
transducers could make it possible to compose map filter and reduce
processes and apply the composition in a single loop. This should
also amount to lesser memory allocation.

There are some concerns with these points though:

Column major nature of Julia means that map/filter operation taking
multiple columns as arguments will suffer. This is a trade-off.

Use of anonymous functions (in place of broadcast operations) can
make things slow (anonymous function calls have a huge overhead JuliaLang/julia#1864
anonymous function calls have a huge overhead JuliaLang/julia#1864)

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

johnmyleswhite · 2014-10-25T20:15:22Z

I guess I don't really understand why iterators aren't fast right now. When you nest iterators, do they no longer allow inlining?

garborg · 2014-10-25T20:28:27Z

I haven't looked into it enough to have much to say, but another thing is that when it comes to things like filtering, there are extra loops in the next methods, etc., so it gets far removed from a single loop pretty quickly. I'm not sure how much getting around that there is without adding something like a skip function that defaults to false -- relatedly I've heard a couple people mention using

garborg · 2014-10-25T20:31:01Z

Nullable to combine next and done to avoid the mild unpleasantness of mutation happening in done, which would be nice, but might also make fusing the iterators harder. Just half-baked thoughts.

johnmyleswhite · 2014-10-25T23:24:22Z

I definitely think that using Nullable as part of the iteration protocol is a good idea.

I guess I assume you need nested loops for certain compositions of iterators. In particular, I was thinking we should remove pad from DataArrays and make it into an iterator, which could be a big gain in applications where you need to avoid allocating a lot of extra memory.

garborg · 2014-10-25T23:35:54Z

Oh yes, that makes sense, and Iterators.chain is the same thing. Still,
composition could be worth playing around with.

On Sat, Oct 25, 2014 at 5:24 PM, John Myles White notifications@github.com
wrote:

I definitely think that using Nullable as part of the iteration protocol
is a good idea.

I guess I assume you need nested loops for certain compositions of
iterators. In particular, I was thinking we should remove pad from
DataArrays and make it into an iterator, which could be a big gain in
applications where you need to avoid allocating a lot of extra memory.

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

tshort · 2014-11-17T20:18:23Z

DataFramesMeta is now in METADATA...

davidagold · 2015-05-21T15:26:31Z

With regards to JuliaData/DataFrames.jl#369 I have been working on a simple @byrow macro that supports if blocks:

julia> df
8x3 DataFrame
| Row | a | b   | col_1    |
|-----|---|-----|----------|
| 1   | 1 | "M" | 0.731688 |
| 2   | 2 | "F" | 0.294839 |
| 3   | 3 | "M" | 0.667601 |
| 4   | 4 | "M" | 0.24186  |
| 5   | 5 | "F" | 0.247961 |
| 6   | 6 | "F" | 0.302071 |
| 7   | 7 | "M" | 0.167708 |
| 8   | 8 | "F" | 0.664298 |

julia> @byrow df if :a > 1; :b = "foo" end

julia> df
8x3 DataFrame
| Row | a | b     | col_1    |
|-----|---|-------|----------|
| 1   | 1 | "M"   | 0.731688 |
| 2   | 2 | "foo" | 0.294839 |
| 3   | 3 | "foo" | 0.667601 |
| 4   | 4 | "foo" | 0.24186  |
| 5   | 5 | "foo" | 0.247961 |
| 6   | 6 | "foo" | 0.302071 |
| 7   | 7 | "foo" | 0.167708 |
| 8   | 8 | "foo" | 0.664298 |

julia> macroexpand( :( @byrow df if :a > 1; :b = "Foo" end ))
:(for row = 1:length(df[1])
        if df[row,:a] > 1 # line 1:
            df[row,:b] = "Foo"
        end
    end)

You can also use begin end blocks to improve readability:

julia> @byrow df begin
                   if :a > 1
                       :b = "bar"
                   end
                 end

julia> df
8x3 DataFrame
| Row | a | b     | col_1    |
|-----|---|-------|----------|
| 1   | 1 | "M"   | 0.731688 |
| 2   | 2 | "bar" | 0.294839 |
| 3   | 3 | "bar" | 0.667601 |
| 4   | 4 | "bar" | 0.24186  |
| 5   | 5 | "bar" | 0.247961 |
| 6   | 6 | "bar" | 0.302071 |
| 7   | 7 | "bar" | 0.167708 |
| 8   | 8 | "bar" | 0.664298 |

The code is here. I haven't tested performance yet. My approach is very simple and doesn't generate a new function for which type inference is unimpeded, as in @tshort 's implementation of @with. Furthermore, I haven't yet experimented with the Devectorize.jl package, and I don't know how much this macro overlaps with that. All that being said, I think this syntax could go a long ways towards ameliorating issues such as described in 369 and in this thread.

I'm very happy to continue working on this functionality for DataFramesMeta if folks here like it. Immediate next steps are to add more expressive power, such as the ability to index into an array x[:] that isn't columns of the implicit dataframe. This probably will require some approach to denote that the index for x[:] needs to be hooked into the loop generated by the macro, e.g.:

 @byrow df begin
            if :a > ^x
                :a = 2(^x)
            end
        end

to get

for row in 1:length(df[1])
    if df[row, :a] > x[row]
        df[row, :a] = 2x[row]
    end
end

EDIT: Actually, the above turns out to be unnecessary. If x is a vector outside the "scope" of df, then just writing x[row] will suffice:

julia> macroexpand( :( @byrow df if :a > 1; :b = x[row] end ))
:(for row = 1:length(df[1])
        if df[row,:a] > 1 # line 1:
            df[row,:b] = x[row]
        end
    end)

If anybody has thoughts, please do share them! @nalimilan, is this along the lines of what you had in mind in 369 above? (Thank you for humoring my reference to your posts from over a year ago...)

nalimilan · 2015-05-22T07:32:41Z

@davidagold Interesting. Indeed it looks like what I was describing in JuliaData/DataFrames.jl#369 I don't think you need to worry about Devec.jl, I see it as orthogonal to this kind of macro.

tshort · 2015-05-22T11:47:31Z

I like this idea, @davidagold--it would make a good addition to DataFramesMeta. I don't think it will perform well with the indexing inside the loop. For better performance, you could try to convert to something like the following:

@with df for row in 1:length(df[1])
    if :a[row] > x[row]
        :a[row] = 2x[row]
    end
end

davidagold · 2015-05-22T15:03:13Z

Thank you both for your inputs! Tom, you're right about performance:

using DataArrays, DataFrames, DataFramesMeta

srand(1)
n = 10_000_000
a = rand(n)
b = rand(n)
c = rand(n)
d = zeros(n)
df = DataFrame(a=a, b=b, c=c, d=d)

function f1()
    x = 0.0
    @byrow df (begin
        if :a < :b
            x += :b * :c
        end
    end)
    return x
end

function f2()
    x = 0.0
    a = convert(Array, df[:a])
    b = convert(Array, df[:b])
    c = convert(Array, df[:c])
    for row in 1:10_000_000
        if a[row] < b[row]
            x += b[row] * c[row]
        end
    end
    return x
end

function f3()
    x = 0.0
    @with df (begin
        for row in 1:length(df[1])
            if :a[row] < :b[row]
                x += :b[row] * :c[row]
            end
        end
    end)
    return x
end

function g1()
    @byrow df begin
        if :a < :b
            :d = :b * :c
        end
    end
end

function g2()
    a = convert(Array, df[:a])
    b = convert(Array, df[:b])
    c = convert(Array, df[:c])
    for row in 1:10_000_000
        if a[row] < b[row]
            df[row, :d] = b[row] * c[row]
        end
    end
end

function g3()
    @with df begin
        for row in 1:length(df[1])
            if :a[row] < :b[row]
                :d[row] = :b[row] * :c[row]
            end
        end
    end
end

f1()
f2()
f3()
g1()
g2()
g3()

println("f1: ", @time(f1()))
println("f2: ", @time(f2()))
println("f3: ", @time(f3()))
println("g1: ", @time(g1()))
println("g2: ", @time(g2()))
println("g3: ", @time(g3()))

gives

elapsed time: 5.64861955 seconds (1760130256 bytes allocated, 20.85% gc time)
f1: 1.6670060387376607e6

elapsed time: 2.464278376 seconds (1120132752 bytes allocated, 37.23% gc time)
f2: 1.6670060387376607e6

elapsed time: 3.435715078 seconds (1280088960 bytes allocated, 23.48% gc time)
f3: 1.6670060387376607e6

elapsed time: 6.304825319 seconds (1920164824 bytes allocated, 20.65% gc time)
g1: nothing

elapsed time: 3.157730953 seconds (1360203344 bytes allocated, 31.98% gc time)
g2: nothing

elapsed time: 3.384456055 seconds (1200062760 bytes allocated, 23.14% gc time)
g3: nothing

I'll go work on your suggestion. May I submit it as a PR when it's ready?

davidagold · 2015-05-22T21:38:42Z

@tshort After implementing your suggestion the same tests as above now give:

elapsed time: 3.841014122 seconds (1280088960 bytes allocated, 25.72% gc time)
f1: 1.6670060387376607e6

elapsed time: 2.611359312 seconds (1120132752 bytes allocated, 37.45% gc time)
f2: 1.6670060387376607e6


elapsed time: 3.53787387 seconds (1200062760 bytes allocated, 22.51% gc time)
g1: nothing

elapsed time: 3.085036564 seconds (1360203344 bytes allocated, 32.94% gc time)
g2: nothing

where (as a reminder) f1/g1 use @byrow (which now essentially does the same thing as f3/g3 above) and f2/g2 converts DataFrame columns to Arrays and loops through the latter. If you have any more suggestions for performance, I am game to investigate.

davidagold · 2015-05-22T22:36:38Z

Aaactually, I've run into an issue. Going to file.

tshort · 2015-05-22T23:32:12Z

Yes to the PR when ready, @davidagold. As far as your tests, try them out without using globals. They should all be faster.

pdeffebach closed this as completed Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic discussions about using metaprogramming with DataFrames #1

Generic discussions about using metaprogramming with DataFrames #1

tshort commented Feb 5, 2014

floswald commented Oct 22, 2014

tshort commented Oct 23, 2014

johnmyleswhite commented Oct 23, 2014

floswald commented Oct 24, 2014

johnmyleswhite commented Oct 24, 2014

floswald commented Oct 24, 2014

johnmyleswhite commented Oct 24, 2014

johnmyleswhite commented Oct 24, 2014

garborg commented Oct 24, 2014

shashi commented Oct 24, 2014

garborg commented Oct 24, 2014

johnmyleswhite commented Oct 25, 2014

garborg commented Oct 25, 2014

garborg commented Oct 25, 2014

johnmyleswhite commented Oct 25, 2014

garborg commented Oct 25, 2014

tshort commented Nov 17, 2014

davidagold commented May 21, 2015

nalimilan commented May 22, 2015

tshort commented May 22, 2015

davidagold commented May 22, 2015

davidagold commented May 22, 2015

davidagold commented May 22, 2015

tshort commented May 22, 2015

Generic discussions about using metaprogramming with DataFrames #1

Generic discussions about using metaprogramming with DataFrames #1

Comments

tshort commented Feb 5, 2014

floswald commented Oct 22, 2014

tshort commented Oct 23, 2014

johnmyleswhite commented Oct 23, 2014

floswald commented Oct 24, 2014

johnmyleswhite commented Oct 24, 2014

floswald commented Oct 24, 2014

johnmyleswhite commented Oct 24, 2014

johnmyleswhite commented Oct 24, 2014

garborg commented Oct 24, 2014

shashi commented Oct 24, 2014

garborg commented Oct 24, 2014

johnmyleswhite commented Oct 25, 2014

garborg commented Oct 25, 2014

garborg commented Oct 25, 2014

johnmyleswhite commented Oct 25, 2014

garborg commented Oct 25, 2014

tshort commented Nov 17, 2014

davidagold commented May 21, 2015

nalimilan commented May 22, 2015

tshort commented May 22, 2015

davidagold commented May 22, 2015

davidagold commented May 22, 2015

davidagold commented May 22, 2015

tshort commented May 22, 2015