RFC: Added generalized sorting routines to DataFrames. #177

kmsquire · 2013-02-02T21:00:33Z

Make Sort.sort, Sort.sortby, and Sort.sortperm work for DataFrames.

Notes:

~~sort* functions expect an AbstractVector, so DataFrames are wrapped in a DFVector type.~~
~~sort! functions are not implemented.~~
Accessing a row with df[x,:] produces another DataFrame, and it doesn't make much sense to compare two dataframes, so DataFrame rows are wrapped in a DFRow type for comparison.
Sort.By was extended to accept numbers, strings, or vectors, and use those as indices into DataFrame columns, to allow sorting on particular columns
All sorting is implemented in terms of sortperm
TimSort is the default sort; DataFrames should use a stable sort, and are one place where chunks should be expected to be already sorted, which TimSort takes advantage of
~~Note that the order of arguments change:~~

~~Old:~~

~~DataFrames.sortby(df::DataFrame, colname::String)~~

~~New:~~

~~Sort.sortby(cols, df::DataFrame)~~

I think it would make more sense for the sort vector go first in Sort.sortby, in anticipation of keyword arguments that could be used to specify the sort algorithm, etc.

EDIT: Putting the vector to be sorted first would have made the definitions here a lot simpler as well!

CC: @StefanKarpinski

kmsquire · 2013-02-02T21:17:20Z

Just to add: for some reason I couldn't quite figure out, sorting hung without JuliaLang/julia#2172, so you need to update julia to test this.

kmsquire · 2013-02-02T21:33:47Z

Example output:

julia> using DataFrames

julia> d = @DataFrame(w=>rand(1:3, 10), x => rand(1:4, 10), y=>rand(1:5, 10), z=>rand(1:6, 10))
10x4 DataFrame:
         w x y z
[1,]     2 3 2 3
[2,]     2 2 2 6
[3,]     1 1 3 2
[4,]     1 1 1 5
[5,]     2 2 1 3
[6,]     3 1 5 4
[7,]     2 4 2 6
[8,]     2 2 2 1
[9,]     2 2 2 1
[10,]    2 4 4 6

julia> sort(d)
10x4 DataFrame:
         w x y z
[1,]     1 1 1 5
[2,]     1 1 3 2
[3,]     2 2 1 3
[4,]     2 2 2 1
[5,]     2 2 2 1
[6,]     2 2 2 6
[7,]     2 3 2 3
[8,]     2 4 2 6
[9,]     2 4 4 6
[10,]    3 1 5 4


julia> sortby("z", d)
10x4 DataFrame:
         w x y z
[1,]     2 2 2 1
[2,]     2 2 2 1
[3,]     1 1 3 2
[4,]     2 3 2 3
[5,]     2 2 1 3
[6,]     3 1 5 4
[7,]     1 1 1 5
[8,]     2 2 2 6
[9,]     2 4 2 6
[10,]    2 4 4 6


julia> sortby(4,d)
10x4 DataFrame:
         w x y z
[1,]     2 2 2 1
[2,]     2 2 2 1
[3,]     1 1 3 2
[4,]     2 3 2 3
[5,]     2 2 1 3
[6,]     3 1 5 4
[7,]     1 1 1 5
[8,]     2 2 2 6
[9,]     2 4 2 6
[10,]    2 4 4 6


julia> all(sortby(4, d) .== sortby("z", d))
true

julia> sortby([1,4],d)
10x4 DataFrame:
         w x y z
[1,]     1 1 3 2
[2,]     1 1 1 5
[3,]     2 2 2 1
[4,]     2 2 2 1
[5,]     2 3 2 3
[6,]     2 2 1 3
[7,]     2 2 2 6
[8,]     2 4 2 6
[9,]     2 4 4 6
[10,]    3 1 5 4


julia> all(sortby(["w","z"],d) .== sortby([1,4],d))
true

tshort · 2013-02-03T12:40:23Z

Great Kevin! Your clever use of wrapper types and the flexibility of the Sort module made the code rather short. This addresses most of the items in issue #176.

Do you mind adding some tests to cover this? We don't currently have a test for sortby, and we should.

tshort · 2013-02-03T12:43:39Z

src/dataframe.jl

+    return false
+end
+
+type DFVector <: AbstractVector{DFRow}


The type DFVector is similar to the existing type DFRowIterator. Could we merge these capabilities into one type? I'm especially asking @johnmyleswhite, because he wrote that type. It may be nice for other uses to have a type that treats a DataFrame like an AbstractVector.

tshort · 2013-02-03T12:49:52Z

What's the main complication to implementing sortby!? Is it because DFvector is read only?

I like the idea of sortby! because it would save memory.

tshort · 2013-02-03T12:56:17Z

src/dataframe.jl

+sortby(by::Union(Real, String, Array, Function), v::AbstractDataFrame) = sort(Sort.By(by),v)
+
+# previous definition of sortby() had opposite order of Sort.sortby; keep for now
+sortby(df::DataFrame, colname::String) = sortby(colname, df)


I prefer this order of arguments for sortby as it matches the order of arguments for by and groupby. Others like with and within also lead with the DataFrame.

Also, can df be an AbstractDataFrame, and can colname be the union you have above, so it works for multiple columns?

I agree. I just submitted a pull request which changes the Julia sort routines to always put the vector first, which I think is more consistent with the rest of julia (though I haven't done a thorough review).

Anyway, I'll update this version of the function to use the Union.

kmsquire · 2013-02-04T07:08:22Z

What's the main complication to implementing sortby!? Is it because DFvector is read only?

I like the idea of sortby! because it would save memory.

That's mainly it. The idea of wrapping DataFrames to look like vectors might have been clever, but the actual implementation is as simple as I could get it to make it work read-only. Making a read-write version should be possible, but I think it will take more work, at least to make it work relatively efficiently.

Actually, if you have enough memory, sort will probably be a bit faster than sort! for DataFrames (with the current implementation), since there will be a less data being copied. This will be especially true for tables with lots of columns.

Another thought: stable sorting is done with TimSort here, or MergeSort by default in the julia mainline. In either case, there is a certain amount of temporary storage needed. For Timsort, this is just under half the size of the array, on average (for unsorted arrays), and I think MergeSort uses more. In theory, the algorithms could be modified to merge in place (this is where the memory is used), but with a significant performance hit, as I understand it. Either way, the memory savings are better with sort!, but only by about half the size of the array, on average.

I'll try to add some tests and documentation tomorrow, and see if a read-write DFVector is easier than I expect. I should probably look at DFRowIterator, too, to make sure I'm not doing something silly here.

Cheers!

Kevin

tshort · 2013-02-04T12:38:29Z

Thanks, Kevin. I think adding write capability to DFVector should be straightforward, maybe as easy as:

function assign(v::DFVector, i, val)
    v.df[i, :] = val
    return nothing
end

kmsquire · 2013-02-07T01:42:20Z

Updates:

at @StefanKarpinski's suggestion, I actually removed the wrapper classes, and implemented sorting in terms of special Ordering subclasses (Sort.Perm and DataFrames.DFPerm).
sort and sort! are now both implemented in terms of sortperm. This means that sort! needs two Int arrays as long as the DataFrame, but all movement is done in place.
sorting can be done over any subset of columns, and a sort order can be specified for each column individually:

julia> d = @DataFrame(w=>rand(1:3, 10), x => rand(1:4, 10), y=>rand(1:5, 10), z=>rand(1:6, 10))
10x4 DataFrame:
WARNING: strcat is deprecated, use string instead.
         w x y z
[1,]     1 3 2 2
[2,]     2 2 3 5
[3,]     1 1 4 2
[4,]     2 4 3 2
[5,]     2 1 3 4
[6,]     1 1 1 5
[7,]     2 3 2 3
[8,]     2 2 5 1
[9,]     1 3 4 6
[10,]    1 3 2 3


julia> sortby(d,:y,Sort.Reverse)
10x4 DataFrame:
         w x y z
[1,]     2 2 5 1
[2,]     1 1 4 2
[3,]     1 3 4 6
[4,]     2 2 3 5
[5,]     2 4 3 2
[6,]     2 1 3 4
[7,]     1 3 2 2
[8,]     2 3 2 3
[9,]     1 3 2 3
[10,]    1 1 1 5


julia> sortby(d,[:w,:z],[Sort.Forward,Sort.Reverse])
10x4 DataFrame:
         w x y z
[1,]     1 3 4 6
[2,]     1 1 1 5
[3,]     1 3 2 3
[4,]     1 3 2 2
[5,]     1 1 4 2
[6,]     2 2 3 5
[7,]     2 1 3 4
[8,]     2 3 2 3
[9,]     2 4 3 2
[10,]    2 2 5 1


julia> sortby(d,[(:w,Sort.Forward),(:z,Sort.Reverse)])
10x4 DataFrame:
         w x y z
[1,]     1 3 4 6
[2,]     1 1 1 5
[3,]     1 3 2 3
[4,]     1 3 2 2
[5,]     1 1 4 2
[6,]     2 2 3 5
[7,]     2 1 3 4
[8,]     2 3 2 3
[9,]     2 4 3 2
[10,]    2 2 5 1

Still need tests, docs, and the per-column sort specification can be generalized some more.

StefanKarpinski · 2013-02-07T01:43:45Z

Oh, lovely. Was this, as I suspected, much faster?

kmsquire · 2013-02-07T01:43:52Z

Oh, and to test, these depend on

kmsquire · 2013-02-07T01:44:38Z

To be honest, I didn't test the speed, but it is much cleaner. I'll try to compare them.

StefanKarpinski · 2013-02-07T01:45:11Z

Got it. I suspect it is, but just cleaner is good too.

tshort · 2013-02-07T02:55:47Z

Nice! I like how you can sort by different columns differently.

Is it ready to merge? (I didn't know if you wanted to wait on merging until you looked at the "still needs..." items.)

johnmyleswhite · 2013-02-07T02:58:47Z

Looks great. At some point, we need to decide whether column names should be symbols or strings. For now I'd suggest changing to strings for consistency with indexing, assignment and by.

kmsquire · 2013-02-07T06:18:17Z

Probably should wait to merge until the two Julia pulls I referenced are merged. I'll try to work on it some more in the mean time, but I think it has good enough functionality to be merged. I can always do more later in a separate request.

kmsquire · 2013-02-08T06:42:32Z

Okay, everything should be in place in julia for this to be merged. It still needs a few things, but I'm probably not going to be able to work on them for a little while, so I'll leave it up to you all to decide whether or not it is in good enough shape to merge. Needs:

Tests
Better handling of NAs. Should probably be "move to end" as with NaNs(?)
Documentation (where should this go?)
Performance testing
Possibly: more generalization in specifying sort order of columns.

* Sorting is based on sortperm * Can specify sort order for each column (not completely general, but functional) Still To Do: * Tests * Docs * Generalizing sort order specification (e.g., by a Function of the column items)

tshort · 2013-02-08T13:01:14Z

I'll go ahead and merge this, so more folks can play with it. I've also got another idea to consider...

We currently have a counting sort routine that can produce a sort index for vectors with smallish numbers of integers very quickly. See groupsort_indexer in grouping.jl, which was derived from a pandas function. This is a stable sort. For PooledDataArrays, we have the following defined:

sortperm(pda::PooledDataArray) = groupsort_indexer(pda)[1]
sort(pda::PooledDataArray) = pda[sortperm(pda)]

What I'm searching for is a way to use groupsort_indexer with sortby for DataFrames, so the sorting strategy can take into account the speed of the counting sort. The problem is that a counting sort isn't a comparison sort, so it can't use lt, so it's more difficult to integrate with the new sorting infrastructure.

We also have the somewhat experimental IndexedVector in indexing.jl that keeps a sorting index along with the vector. We also have sortperm and sort defined for this type.

So, any ideas on how to incorporate this with sortby? I can think of options to handle some cases, but it starts to get complicated:

All sorting columns are PooledDataVecs or IndexedVectors -- Successively apply sortperm to columns to come up with an overall sorting index. Pretty easy.
None of the sorting columns are PooledDataVecs or IndexedVectors -- No changes.
One or more of the first sorting columns are PDVs or IVs -- Find a presorting index using the successive-sortperm approach then apply the existing sortby with the presorted index. This is starting to get complicated.
Some columns (but not the first) are PDVs or IVs -- Find a presorting index again based on PDV and IV columns. This only helps if there are duplicates in the "regular" columns. Again, complicated.

It would also be nice to have a way to apply the groupsort_indexer to columns that are not PooledDataVectors; This could be integers where you known that they range from 0 to 100.

RFC: Added generalized sorting routines to DataFrames.

tshort · 2013-02-08T15:25:38Z

Here are some timings to compare the groupsort_indexer approach with Kevin's sorting when sorting on factor-like variables:

function mysortby(d, idx)
    g = groupby(d, idx)
    g.parent[g.idx,:]
end

N = 1000000
d = @DataFrame(a => rand(1:5,N), b => rand(1:10,N), c => rand(N))
d["a_pdv"] = PooledDataArray(d["a"])
d["b_pdv"] = PooledDataArray(d["b"])
d["a_iv"] = IndexedVector(d["a_pdv"])


julia> @time sortperm(d["a"]); 
elapsed time: 2.419466972351074 seconds

julia> @time sortperm(d["a_pdv"]); 
elapsed time: 0.008836984634399414 seconds

julia> @time sortperm(d["a_iv"]); 
elapsed time: 2.002716064453125e-5 seconds


julia> @time sortby(d, ["a", "b"]);
elapsed time: 13.103484869003296 seconds

julia> @time mysortby(d, ["a", "b"]);
elapsed time: 1.382004976272583 seconds

julia> @time mysortby(d, ["a_pdv", "b_pdv"]);
elapsed time: 1.1158790588378906 seconds


julia> @time sortby(d, "a");
elapsed time: 2.9830338954925537 seconds

julia> @time mysortby(d, "a");
elapsed time: 0.7941920757293701 seconds

julia> @time mysortby(d, "a_pdv");
elapsed time: 0.637357234954834 seconds

julia> @time d[sortperm(d["a_pdv"]),:];
elapsed time: 0.6243560314178467 seconds

julia> @time d[sortperm(d["a_iv"]),:];
elapsed time: 0.6324689388275146 seconds

kmsquire · 2013-03-06T21:37:54Z

Hi Tom, sorry, got busy, and this fell off my radar, but your counting-sort timings are very impressive! At some point, it would be good to try to integrate those in to mainline julia, and there's probably more integration that could be done here as well.

tshort · 2013-03-07T00:01:09Z

Hi Kevin. The counting sort is great when you want to search a known set of values (like a range of integers). It's ideal for PooledDataVectors. I pondered how it might be integrated with mainline Julia, but I couldn't come up with anything easy. A counting sort is quite different from the types of sorting algorithms in Base. In DataFrames, I implemented the easiest case I outlined above where the sorting column (limited to one only) in a DataFrame is a PooledDataVector or IndexedDataVector. Here's the code for that:

# Extras to speed up sorting
sortperm{V}(d::AbstractDataFrame, a::Sort.Algorithm, o::FastPerm{Sort.Forward,V}) = sortperm(o.vec)
sortperm{V}(d::AbstractDataFrame, a::Sort.Algorithm, o::FastPerm{Sort.Reverse,V}) = reverse(sortperm(o.vec))

I wasn't sure how to go much further than that.

kmsquire · 2013-03-07T00:24:53Z

One idea would be to make the sort generally available, and on the first pass, fall back to a different sort routine if the number or range of values gets too large.

More generally, a stable radix sort algorithm would be a useful addition, and some (all?) versions use a counting sort internally.

At some point, I would love look into this more, but I'm too busy with other things right now. I'll be interested if you do anything more with the counting sort.

tshort · 2013-03-07T00:32:56Z

I'm also too tied up to tackle this, especially in base where the sorting
code is elegant but complicated. Maybe when someone runs into a real need
here, they will implement it. Hopefully, our code gives them some basics
they can use.

On Wed, Mar 6, 2013 at 7:24 PM, Kevin Squire notifications@github.comwrote:

One idea would be to make the sort generally available, and on the first
pass, fall back to a different sort routine if the number or range of
values gets too large.

More generally, a stable radix sort algorithm would be a useful addition,
and some (all?) versions use a counting sort internally.

At some point, I would love look into this more, but I'm too busy with
other things right now. I'll be interested if you do anything more with the
counting sort.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/177#issuecomment-14535420
.

tshort reviewed Feb 3, 2013
View reviewed changes

kmsquire mentioned this pull request Feb 7, 2013

zip documentation JuliaLang/julia#2209

Merged

Added sort routines

67ef3e8

* Sorting is based on sortperm * Can specify sort order for each column (not completely general, but functional) Still To Do: * Tests * Docs * Generalizing sort order specification (e.g., by a Function of the column items)

tshort added a commit that referenced this pull request Feb 8, 2013

Merge pull request #177 from kmsquire/sorting

792b443

RFC: Added generalized sorting routines to DataFrames.

tshort merged commit 792b443 into JuliaData:master Feb 8, 2013

tshort mentioned this pull request Feb 11, 2013

What to do about order() #176

Closed

kmsquire deleted the sorting branch March 6, 2013 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Added generalized sorting routines to DataFrames. #177

RFC: Added generalized sorting routines to DataFrames. #177

kmsquire commented Feb 2, 2013

kmsquire commented Feb 2, 2013

kmsquire commented Feb 2, 2013

tshort commented Feb 3, 2013

tshort Feb 3, 2013

tshort commented Feb 3, 2013

tshort Feb 3, 2013

kmsquire Feb 4, 2013

kmsquire commented Feb 4, 2013

tshort commented Feb 4, 2013

kmsquire commented Feb 7, 2013

StefanKarpinski commented Feb 7, 2013

kmsquire commented Feb 7, 2013

kmsquire commented Feb 7, 2013

StefanKarpinski commented Feb 7, 2013

tshort commented Feb 7, 2013

johnmyleswhite commented Feb 7, 2013

kmsquire commented Feb 7, 2013

kmsquire commented Feb 8, 2013

tshort commented Feb 8, 2013

tshort commented Feb 8, 2013

kmsquire commented Mar 6, 2013

tshort commented Mar 7, 2013

kmsquire commented Mar 7, 2013

tshort commented Mar 7, 2013

RFC: Added generalized sorting routines to DataFrames. #177

RFC: Added generalized sorting routines to DataFrames. #177

Conversation

kmsquire commented Feb 2, 2013

kmsquire commented Feb 2, 2013

kmsquire commented Feb 2, 2013

tshort commented Feb 3, 2013

tshort Feb 3, 2013

Choose a reason for hiding this comment

tshort commented Feb 3, 2013

tshort Feb 3, 2013

Choose a reason for hiding this comment

kmsquire Feb 4, 2013

Choose a reason for hiding this comment

kmsquire commented Feb 4, 2013

tshort commented Feb 4, 2013

kmsquire commented Feb 7, 2013

StefanKarpinski commented Feb 7, 2013

kmsquire commented Feb 7, 2013

kmsquire commented Feb 7, 2013

StefanKarpinski commented Feb 7, 2013

tshort commented Feb 7, 2013

johnmyleswhite commented Feb 7, 2013

kmsquire commented Feb 7, 2013

kmsquire commented Feb 8, 2013

tshort commented Feb 8, 2013

tshort commented Feb 8, 2013

kmsquire commented Mar 6, 2013

tshort commented Mar 7, 2013

kmsquire commented Mar 7, 2013

tshort commented Mar 7, 2013