-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Added generalized sorting routines to DataFrames. #177
Conversation
Just to add: for some reason I couldn't quite figure out, sorting hung without JuliaLang/julia#2172, so you need to update julia to test this. |
Example output: julia> using DataFrames
julia> d = @DataFrame(w=>rand(1:3, 10), x => rand(1:4, 10), y=>rand(1:5, 10), z=>rand(1:6, 10))
10x4 DataFrame:
w x y z
[1,] 2 3 2 3
[2,] 2 2 2 6
[3,] 1 1 3 2
[4,] 1 1 1 5
[5,] 2 2 1 3
[6,] 3 1 5 4
[7,] 2 4 2 6
[8,] 2 2 2 1
[9,] 2 2 2 1
[10,] 2 4 4 6
julia> sort(d)
10x4 DataFrame:
w x y z
[1,] 1 1 1 5
[2,] 1 1 3 2
[3,] 2 2 1 3
[4,] 2 2 2 1
[5,] 2 2 2 1
[6,] 2 2 2 6
[7,] 2 3 2 3
[8,] 2 4 2 6
[9,] 2 4 4 6
[10,] 3 1 5 4
julia> sortby("z", d)
10x4 DataFrame:
w x y z
[1,] 2 2 2 1
[2,] 2 2 2 1
[3,] 1 1 3 2
[4,] 2 3 2 3
[5,] 2 2 1 3
[6,] 3 1 5 4
[7,] 1 1 1 5
[8,] 2 2 2 6
[9,] 2 4 2 6
[10,] 2 4 4 6
julia> sortby(4,d)
10x4 DataFrame:
w x y z
[1,] 2 2 2 1
[2,] 2 2 2 1
[3,] 1 1 3 2
[4,] 2 3 2 3
[5,] 2 2 1 3
[6,] 3 1 5 4
[7,] 1 1 1 5
[8,] 2 2 2 6
[9,] 2 4 2 6
[10,] 2 4 4 6
julia> all(sortby(4, d) .== sortby("z", d))
true
julia> sortby([1,4],d)
10x4 DataFrame:
w x y z
[1,] 1 1 3 2
[2,] 1 1 1 5
[3,] 2 2 2 1
[4,] 2 2 2 1
[5,] 2 3 2 3
[6,] 2 2 1 3
[7,] 2 2 2 6
[8,] 2 4 2 6
[9,] 2 4 4 6
[10,] 3 1 5 4
julia> all(sortby(["w","z"],d) .== sortby([1,4],d))
true |
Great Kevin! Your clever use of wrapper types and the flexibility of the Sort module made the code rather short. This addresses most of the items in issue #176. Do you mind adding some tests to cover this? We don't currently have a test for |
return false | ||
end | ||
|
||
type DFVector <: AbstractVector{DFRow} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type DFVector
is similar to the existing type DFRowIterator
. Could we merge these capabilities into one type? I'm especially asking @johnmyleswhite, because he wrote that type. It may be nice for other uses to have a type that treats a DataFrame like an AbstractVector.
What's the main complication to implementing I like the idea of |
sortby(by::Union(Real, String, Array, Function), v::AbstractDataFrame) = sort(Sort.By(by),v) | ||
|
||
# previous definition of sortby() had opposite order of Sort.sortby; keep for now | ||
sortby(df::DataFrame, colname::String) = sortby(colname, df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this order of arguments for sortby
as it matches the order of arguments for by
and groupby
. Others like with
and within
also lead with the DataFrame.
Also, can df
be an AbstractDataFrame
, and can colname
be the union you have above, so it works for multiple columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I just submitted a pull request which changes the Julia sort routines to always put the vector first, which I think is more consistent with the rest of julia (though I haven't done a thorough review).
Anyway, I'll update this version of the function to use the Union.
That's mainly it. The idea of wrapping DataFrames to look like vectors might have been clever, but the actual implementation is as simple as I could get it to make it work read-only. Making a read-write version should be possible, but I think it will take more work, at least to make it work relatively efficiently. Actually, if you have enough memory, Another thought: stable sorting is done with TimSort here, or MergeSort by default in the julia mainline. In either case, there is a certain amount of temporary storage needed. For Timsort, this is just under half the size of the array, on average (for unsorted arrays), and I think MergeSort uses more. In theory, the algorithms could be modified to merge in place (this is where the memory is used), but with a significant performance hit, as I understand it. Either way, the memory savings are better with I'll try to add some tests and documentation tomorrow, and see if a read-write DFVector is easier than I expect. I should probably look at DFRowIterator, too, to make sure I'm not doing something silly here. Cheers! Kevin |
Thanks, Kevin. I think adding write capability to DFVector should be straightforward, maybe as easy as: function assign(v::DFVector, i, val)
v.df[i, :] = val
return nothing
end |
Updates:
julia> d = @DataFrame(w=>rand(1:3, 10), x => rand(1:4, 10), y=>rand(1:5, 10), z=>rand(1:6, 10))
10x4 DataFrame:
WARNING: strcat is deprecated, use string instead.
w x y z
[1,] 1 3 2 2
[2,] 2 2 3 5
[3,] 1 1 4 2
[4,] 2 4 3 2
[5,] 2 1 3 4
[6,] 1 1 1 5
[7,] 2 3 2 3
[8,] 2 2 5 1
[9,] 1 3 4 6
[10,] 1 3 2 3
julia> sortby(d,:y,Sort.Reverse)
10x4 DataFrame:
w x y z
[1,] 2 2 5 1
[2,] 1 1 4 2
[3,] 1 3 4 6
[4,] 2 2 3 5
[5,] 2 4 3 2
[6,] 2 1 3 4
[7,] 1 3 2 2
[8,] 2 3 2 3
[9,] 1 3 2 3
[10,] 1 1 1 5
julia> sortby(d,[:w,:z],[Sort.Forward,Sort.Reverse])
10x4 DataFrame:
w x y z
[1,] 1 3 4 6
[2,] 1 1 1 5
[3,] 1 3 2 3
[4,] 1 3 2 2
[5,] 1 1 4 2
[6,] 2 2 3 5
[7,] 2 1 3 4
[8,] 2 3 2 3
[9,] 2 4 3 2
[10,] 2 2 5 1
julia> sortby(d,[(:w,Sort.Forward),(:z,Sort.Reverse)])
10x4 DataFrame:
w x y z
[1,] 1 3 4 6
[2,] 1 1 1 5
[3,] 1 3 2 3
[4,] 1 3 2 2
[5,] 1 1 4 2
[6,] 2 2 3 5
[7,] 2 1 3 4
[8,] 2 3 2 3
[9,] 2 4 3 2
[10,] 2 2 5 1 Still need tests, docs, and the per-column sort specification can be generalized some more. |
Oh, lovely. Was this, as I suspected, much faster? |
Oh, and to test, these depend on |
To be honest, I didn't test the speed, but it is much cleaner. I'll try to compare them. |
Got it. I suspect it is, but just cleaner is good too. |
Nice! I like how you can sort by different columns differently. Is it ready to merge? (I didn't know if you wanted to wait on merging until you looked at the "still needs..." items.) |
Looks great. At some point, we need to decide whether column names should be symbols or strings. For now I'd suggest changing to strings for consistency with indexing, assignment and |
Probably should wait to merge until the two Julia pulls I referenced are merged. I'll try to work on it some more in the mean time, but I think it has good enough functionality to be merged. I can always do more later in a separate request. |
Okay, everything should be in place in julia for this to be merged. It still needs a few things, but I'm probably not going to be able to work on them for a little while, so I'll leave it up to you all to decide whether or not it is in good enough shape to merge. Needs:
|
* Sorting is based on sortperm * Can specify sort order for each column (not completely general, but functional) Still To Do: * Tests * Docs * Generalizing sort order specification (e.g., by a Function of the column items)
I'll go ahead and merge this, so more folks can play with it. I've also got another idea to consider... We currently have a counting sort routine that can produce a sort index for vectors with smallish numbers of integers very quickly. See sortperm(pda::PooledDataArray) = groupsort_indexer(pda)[1]
sort(pda::PooledDataArray) = pda[sortperm(pda)] What I'm searching for is a way to use We also have the somewhat experimental So, any ideas on how to incorporate this with
It would also be nice to have a way to apply the |
RFC: Added generalized sorting routines to DataFrames.
Here are some timings to compare the function mysortby(d, idx)
g = groupby(d, idx)
g.parent[g.idx,:]
end
N = 1000000
d = @DataFrame(a => rand(1:5,N), b => rand(1:10,N), c => rand(N))
d["a_pdv"] = PooledDataArray(d["a"])
d["b_pdv"] = PooledDataArray(d["b"])
d["a_iv"] = IndexedVector(d["a_pdv"])
julia> @time sortperm(d["a"]);
elapsed time: 2.419466972351074 seconds
julia> @time sortperm(d["a_pdv"]);
elapsed time: 0.008836984634399414 seconds
julia> @time sortperm(d["a_iv"]);
elapsed time: 2.002716064453125e-5 seconds
julia> @time sortby(d, ["a", "b"]);
elapsed time: 13.103484869003296 seconds
julia> @time mysortby(d, ["a", "b"]);
elapsed time: 1.382004976272583 seconds
julia> @time mysortby(d, ["a_pdv", "b_pdv"]);
elapsed time: 1.1158790588378906 seconds
julia> @time sortby(d, "a");
elapsed time: 2.9830338954925537 seconds
julia> @time mysortby(d, "a");
elapsed time: 0.7941920757293701 seconds
julia> @time mysortby(d, "a_pdv");
elapsed time: 0.637357234954834 seconds
julia> @time d[sortperm(d["a_pdv"]),:];
elapsed time: 0.6243560314178467 seconds
julia> @time d[sortperm(d["a_iv"]),:];
elapsed time: 0.6324689388275146 seconds |
Hi Tom, sorry, got busy, and this fell off my radar, but your counting-sort timings are very impressive! At some point, it would be good to try to integrate those in to mainline julia, and there's probably more integration that could be done here as well. |
Hi Kevin. The counting sort is great when you want to search a known set of values (like a range of integers). It's ideal for PooledDataVectors. I pondered how it might be integrated with mainline Julia, but I couldn't come up with anything easy. A counting sort is quite different from the types of sorting algorithms in Base. In DataFrames, I implemented the easiest case I outlined above where the sorting column (limited to one only) in a DataFrame is a PooledDataVector or IndexedDataVector. Here's the code for that: # Extras to speed up sorting
sortperm{V}(d::AbstractDataFrame, a::Sort.Algorithm, o::FastPerm{Sort.Forward,V}) = sortperm(o.vec)
sortperm{V}(d::AbstractDataFrame, a::Sort.Algorithm, o::FastPerm{Sort.Reverse,V}) = reverse(sortperm(o.vec)) I wasn't sure how to go much further than that. |
One idea would be to make the sort generally available, and on the first pass, fall back to a different sort routine if the number or range of values gets too large. More generally, a stable radix sort algorithm would be a useful addition, and some (all?) versions use a counting sort internally. At some point, I would love look into this more, but I'm too busy with other things right now. I'll be interested if you do anything more with the counting sort. |
I'm also too tied up to tackle this, especially in base where the sorting On Wed, Mar 6, 2013 at 7:24 PM, Kevin Squire notifications@github.comwrote:
|
Make
Sort.sort
,Sort.sortby
, andSort.sortperm
work forDataFrames
.Notes:
sort*
functions expect anAbstractVector
, soDataFrame
s are wrapped in aDFVector
type.sort!
functions are not implemented.df[x,:]
produces anotherDataFrame
, and it doesn't make much sense to compare two dataframes, so DataFrame rows are wrapped in aDFRow
type for comparison.Sort.By
was extended to accept numbers, strings, or vectors, and use those as indices into DataFrame columns, to allow sorting on particular columnssortperm
Note that the order of arguments change:Old:DataFrames.sortby(df::DataFrame, colname::String)New:Sort.sortby(cols, df::DataFrame)I think it would make more sense for the sort vector go first in
Sort.sortby
, in anticipation of keyword arguments that could be used to specify the sort algorithm, etc.EDIT: Putting the vector to be sorted first would have made the definitions here a lot simpler as well!
CC: @StefanKarpinski