Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Added generalized sorting routines to DataFrames. #177

Merged
merged 1 commit into from
Feb 8, 2013

Conversation

kmsquire
Copy link
Contributor

@kmsquire kmsquire commented Feb 2, 2013

Make Sort.sort, Sort.sortby, and Sort.sortperm work for DataFrames.

Notes:

  • sort* functions expect an AbstractVector, so DataFrames are wrapped in a DFVector type.
  • sort! functions are not implemented.
  • Accessing a row with df[x,:] produces another DataFrame, and it doesn't make much sense to compare two dataframes, so DataFrame rows are wrapped in a DFRow type for comparison.
  • Sort.By was extended to accept numbers, strings, or vectors, and use those as indices into DataFrame columns, to allow sorting on particular columns
  • All sorting is implemented in terms of sortperm
  • TimSort is the default sort; DataFrames should use a stable sort, and are one place where chunks should be expected to be already sorted, which TimSort takes advantage of
  • Note that the order of arguments change:

Old:

DataFrames.sortby(df::DataFrame, colname::String)

New:

Sort.sortby(cols, df::DataFrame)

I think it would make more sense for the sort vector go first in Sort.sortby, in anticipation of keyword arguments that could be used to specify the sort algorithm, etc.

EDIT: Putting the vector to be sorted first would have made the definitions here a lot simpler as well!

CC: @StefanKarpinski

@kmsquire
Copy link
Contributor Author

kmsquire commented Feb 2, 2013

Just to add: for some reason I couldn't quite figure out, sorting hung without JuliaLang/julia#2172, so you need to update julia to test this.

@kmsquire
Copy link
Contributor Author

kmsquire commented Feb 2, 2013

Example output:

julia> using DataFrames

julia> d = @DataFrame(w=>rand(1:3, 10), x => rand(1:4, 10), y=>rand(1:5, 10), z=>rand(1:6, 10))
10x4 DataFrame:
         w x y z
[1,]     2 3 2 3
[2,]     2 2 2 6
[3,]     1 1 3 2
[4,]     1 1 1 5
[5,]     2 2 1 3
[6,]     3 1 5 4
[7,]     2 4 2 6
[8,]     2 2 2 1
[9,]     2 2 2 1
[10,]    2 4 4 6

julia> sort(d)
10x4 DataFrame:
         w x y z
[1,]     1 1 1 5
[2,]     1 1 3 2
[3,]     2 2 1 3
[4,]     2 2 2 1
[5,]     2 2 2 1
[6,]     2 2 2 6
[7,]     2 3 2 3
[8,]     2 4 2 6
[9,]     2 4 4 6
[10,]    3 1 5 4


julia> sortby("z", d)
10x4 DataFrame:
         w x y z
[1,]     2 2 2 1
[2,]     2 2 2 1
[3,]     1 1 3 2
[4,]     2 3 2 3
[5,]     2 2 1 3
[6,]     3 1 5 4
[7,]     1 1 1 5
[8,]     2 2 2 6
[9,]     2 4 2 6
[10,]    2 4 4 6


julia> sortby(4,d)
10x4 DataFrame:
         w x y z
[1,]     2 2 2 1
[2,]     2 2 2 1
[3,]     1 1 3 2
[4,]     2 3 2 3
[5,]     2 2 1 3
[6,]     3 1 5 4
[7,]     1 1 1 5
[8,]     2 2 2 6
[9,]     2 4 2 6
[10,]    2 4 4 6


julia> all(sortby(4, d) .== sortby("z", d))
true

julia> sortby([1,4],d)
10x4 DataFrame:
         w x y z
[1,]     1 1 3 2
[2,]     1 1 1 5
[3,]     2 2 2 1
[4,]     2 2 2 1
[5,]     2 3 2 3
[6,]     2 2 1 3
[7,]     2 2 2 6
[8,]     2 4 2 6
[9,]     2 4 4 6
[10,]    3 1 5 4


julia> all(sortby(["w","z"],d) .== sortby([1,4],d))
true

@tshort
Copy link
Contributor

tshort commented Feb 3, 2013

Great Kevin! Your clever use of wrapper types and the flexibility of the Sort module made the code rather short. This addresses most of the items in issue #176.

Do you mind adding some tests to cover this? We don't currently have a test for sortby, and we should.

return false
end

type DFVector <: AbstractVector{DFRow}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type DFVector is similar to the existing type DFRowIterator. Could we merge these capabilities into one type? I'm especially asking @johnmyleswhite, because he wrote that type. It may be nice for other uses to have a type that treats a DataFrame like an AbstractVector.

@tshort
Copy link
Contributor

tshort commented Feb 3, 2013

What's the main complication to implementing sortby!? Is it because DFvector is read only?

I like the idea of sortby! because it would save memory.

sortby(by::Union(Real, String, Array, Function), v::AbstractDataFrame) = sort(Sort.By(by),v)

# previous definition of sortby() had opposite order of Sort.sortby; keep for now
sortby(df::DataFrame, colname::String) = sortby(colname, df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this order of arguments for sortby as it matches the order of arguments for by and groupby. Others like with and within also lead with the DataFrame.

Also, can df be an AbstractDataFrame, and can colname be the union you have above, so it works for multiple columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I just submitted a pull request which changes the Julia sort routines to always put the vector first, which I think is more consistent with the rest of julia (though I haven't done a thorough review).

Anyway, I'll update this version of the function to use the Union.

@kmsquire
Copy link
Contributor Author

kmsquire commented Feb 4, 2013

What's the main complication to implementing sortby!? Is it because DFvector is read only?

I like the idea of sortby! because it would save memory.

That's mainly it. The idea of wrapping DataFrames to look like vectors might have been clever, but the actual implementation is as simple as I could get it to make it work read-only. Making a read-write version should be possible, but I think it will take more work, at least to make it work relatively efficiently.

Actually, if you have enough memory, sort will probably be a bit faster than sort! for DataFrames (with the current implementation), since there will be a less data being copied. This will be especially true for tables with lots of columns.

Another thought: stable sorting is done with TimSort here, or MergeSort by default in the julia mainline. In either case, there is a certain amount of temporary storage needed. For Timsort, this is just under half the size of the array, on average (for unsorted arrays), and I think MergeSort uses more. In theory, the algorithms could be modified to merge in place (this is where the memory is used), but with a significant performance hit, as I understand it. Either way, the memory savings are better with sort!, but only by about half the size of the array, on average.

I'll try to add some tests and documentation tomorrow, and see if a read-write DFVector is easier than I expect. I should probably look at DFRowIterator, too, to make sure I'm not doing something silly here.

Cheers!

Kevin

@tshort
Copy link
Contributor

tshort commented Feb 4, 2013

Thanks, Kevin. I think adding write capability to DFVector should be straightforward, maybe as easy as:

function assign(v::DFVector, i, val)
    v.df[i, :] = val
    return nothing
end

@kmsquire
Copy link
Contributor Author

kmsquire commented Feb 7, 2013

Updates:

  • at @StefanKarpinski's suggestion, I actually removed the wrapper classes, and implemented sorting in terms of special Ordering subclasses (Sort.Perm and DataFrames.DFPerm).
  • sort and sort! are now both implemented in terms of sortperm. This means that sort! needs two Int arrays as long as the DataFrame, but all movement is done in place.
  • sorting can be done over any subset of columns, and a sort order can be specified for each column individually:
julia> d = @DataFrame(w=>rand(1:3, 10), x => rand(1:4, 10), y=>rand(1:5, 10), z=>rand(1:6, 10))
10x4 DataFrame:
WARNING: strcat is deprecated, use string instead.
         w x y z
[1,]     1 3 2 2
[2,]     2 2 3 5
[3,]     1 1 4 2
[4,]     2 4 3 2
[5,]     2 1 3 4
[6,]     1 1 1 5
[7,]     2 3 2 3
[8,]     2 2 5 1
[9,]     1 3 4 6
[10,]    1 3 2 3


julia> sortby(d,:y,Sort.Reverse)
10x4 DataFrame:
         w x y z
[1,]     2 2 5 1
[2,]     1 1 4 2
[3,]     1 3 4 6
[4,]     2 2 3 5
[5,]     2 4 3 2
[6,]     2 1 3 4
[7,]     1 3 2 2
[8,]     2 3 2 3
[9,]     1 3 2 3
[10,]    1 1 1 5


julia> sortby(d,[:w,:z],[Sort.Forward,Sort.Reverse])
10x4 DataFrame:
         w x y z
[1,]     1 3 4 6
[2,]     1 1 1 5
[3,]     1 3 2 3
[4,]     1 3 2 2
[5,]     1 1 4 2
[6,]     2 2 3 5
[7,]     2 1 3 4
[8,]     2 3 2 3
[9,]     2 4 3 2
[10,]    2 2 5 1


julia> sortby(d,[(:w,Sort.Forward),(:z,Sort.Reverse)])
10x4 DataFrame:
         w x y z
[1,]     1 3 4 6
[2,]     1 1 1 5
[3,]     1 3 2 3
[4,]     1 3 2 2
[5,]     1 1 4 2
[6,]     2 2 3 5
[7,]     2 1 3 4
[8,]     2 3 2 3
[9,]     2 4 3 2
[10,]    2 2 5 1

Still need tests, docs, and the per-column sort specification can be generalized some more.

@StefanKarpinski
Copy link
Member

Oh, lovely. Was this, as I suspected, much faster?

@kmsquire
Copy link
Contributor Author

kmsquire commented Feb 7, 2013

@kmsquire
Copy link
Contributor Author

kmsquire commented Feb 7, 2013

To be honest, I didn't test the speed, but it is much cleaner. I'll try to compare them.

@StefanKarpinski
Copy link
Member

Got it. I suspect it is, but just cleaner is good too.

@tshort
Copy link
Contributor

tshort commented Feb 7, 2013

Nice! I like how you can sort by different columns differently.

Is it ready to merge? (I didn't know if you wanted to wait on merging until you looked at the "still needs..." items.)

@johnmyleswhite
Copy link
Contributor

Looks great. At some point, we need to decide whether column names should be symbols or strings. For now I'd suggest changing to strings for consistency with indexing, assignment and by.

@kmsquire
Copy link
Contributor Author

kmsquire commented Feb 7, 2013

Probably should wait to merge until the two Julia pulls I referenced are merged. I'll try to work on it some more in the mean time, but I think it has good enough functionality to be merged. I can always do more later in a separate request.

@kmsquire
Copy link
Contributor Author

kmsquire commented Feb 8, 2013

Okay, everything should be in place in julia for this to be merged. It still needs a few things, but I'm probably not going to be able to work on them for a little while, so I'll leave it up to you all to decide whether or not it is in good enough shape to merge. Needs:

  • Tests
  • Better handling of NAs. Should probably be "move to end" as with NaNs(?)
  • Documentation (where should this go?)
  • Performance testing
  • Possibly: more generalization in specifying sort order of columns.

* Sorting is based on sortperm
* Can specify sort order for each column (not completely general, but functional)

Still To Do:

* Tests
* Docs
* Generalizing sort order specification (e.g., by a Function of the column items)
@tshort
Copy link
Contributor

tshort commented Feb 8, 2013

I'll go ahead and merge this, so more folks can play with it. I've also got another idea to consider...

We currently have a counting sort routine that can produce a sort index for vectors with smallish numbers of integers very quickly. See groupsort_indexer in grouping.jl, which was derived from a pandas function. This is a stable sort. For PooledDataArrays, we have the following defined:

sortperm(pda::PooledDataArray) = groupsort_indexer(pda)[1]
sort(pda::PooledDataArray) = pda[sortperm(pda)]

What I'm searching for is a way to use groupsort_indexer with sortby for DataFrames, so the sorting strategy can take into account the speed of the counting sort. The problem is that a counting sort isn't a comparison sort, so it can't use lt, so it's more difficult to integrate with the new sorting infrastructure.

We also have the somewhat experimental IndexedVector in indexing.jl that keeps a sorting index along with the vector. We also have sortperm and sort defined for this type.

So, any ideas on how to incorporate this with sortby? I can think of options to handle some cases, but it starts to get complicated:

  • All sorting columns are PooledDataVecs or IndexedVectors -- Successively apply sortperm to columns to come up with an overall sorting index. Pretty easy.
  • None of the sorting columns are PooledDataVecs or IndexedVectors -- No changes.
  • One or more of the first sorting columns are PDVs or IVs -- Find a presorting index using the successive-sortperm approach then apply the existing sortby with the presorted index. This is starting to get complicated.
  • Some columns (but not the first) are PDVs or IVs -- Find a presorting index again based on PDV and IV columns. This only helps if there are duplicates in the "regular" columns. Again, complicated.

It would also be nice to have a way to apply the groupsort_indexer to columns that are not PooledDataVectors; This could be integers where you known that they range from 0 to 100.

tshort added a commit that referenced this pull request Feb 8, 2013
RFC: Added generalized sorting routines to DataFrames.
@tshort tshort merged commit 792b443 into JuliaData:master Feb 8, 2013
@tshort
Copy link
Contributor

tshort commented Feb 8, 2013

Here are some timings to compare the groupsort_indexer approach with Kevin's sorting when sorting on factor-like variables:

function mysortby(d, idx)
    g = groupby(d, idx)
    g.parent[g.idx,:]
end

N = 1000000
d = @DataFrame(a => rand(1:5,N), b => rand(1:10,N), c => rand(N))
d["a_pdv"] = PooledDataArray(d["a"])
d["b_pdv"] = PooledDataArray(d["b"])
d["a_iv"] = IndexedVector(d["a_pdv"])


julia> @time sortperm(d["a"]); 
elapsed time: 2.419466972351074 seconds

julia> @time sortperm(d["a_pdv"]); 
elapsed time: 0.008836984634399414 seconds

julia> @time sortperm(d["a_iv"]); 
elapsed time: 2.002716064453125e-5 seconds


julia> @time sortby(d, ["a", "b"]);
elapsed time: 13.103484869003296 seconds

julia> @time mysortby(d, ["a", "b"]);
elapsed time: 1.382004976272583 seconds

julia> @time mysortby(d, ["a_pdv", "b_pdv"]);
elapsed time: 1.1158790588378906 seconds


julia> @time sortby(d, "a");
elapsed time: 2.9830338954925537 seconds

julia> @time mysortby(d, "a");
elapsed time: 0.7941920757293701 seconds

julia> @time mysortby(d, "a_pdv");
elapsed time: 0.637357234954834 seconds

julia> @time d[sortperm(d["a_pdv"]),:];
elapsed time: 0.6243560314178467 seconds

julia> @time d[sortperm(d["a_iv"]),:];
elapsed time: 0.6324689388275146 seconds

@tshort tshort mentioned this pull request Feb 11, 2013
@kmsquire kmsquire deleted the sorting branch March 6, 2013 21:27
@kmsquire
Copy link
Contributor Author

kmsquire commented Mar 6, 2013

Hi Tom, sorry, got busy, and this fell off my radar, but your counting-sort timings are very impressive! At some point, it would be good to try to integrate those in to mainline julia, and there's probably more integration that could be done here as well.

@tshort
Copy link
Contributor

tshort commented Mar 7, 2013

Hi Kevin. The counting sort is great when you want to search a known set of values (like a range of integers). It's ideal for PooledDataVectors. I pondered how it might be integrated with mainline Julia, but I couldn't come up with anything easy. A counting sort is quite different from the types of sorting algorithms in Base. In DataFrames, I implemented the easiest case I outlined above where the sorting column (limited to one only) in a DataFrame is a PooledDataVector or IndexedDataVector. Here's the code for that:

# Extras to speed up sorting
sortperm{V}(d::AbstractDataFrame, a::Sort.Algorithm, o::FastPerm{Sort.Forward,V}) = sortperm(o.vec)
sortperm{V}(d::AbstractDataFrame, a::Sort.Algorithm, o::FastPerm{Sort.Reverse,V}) = reverse(sortperm(o.vec))

I wasn't sure how to go much further than that.

@kmsquire
Copy link
Contributor Author

kmsquire commented Mar 7, 2013

One idea would be to make the sort generally available, and on the first pass, fall back to a different sort routine if the number or range of values gets too large.

More generally, a stable radix sort algorithm would be a useful addition, and some (all?) versions use a counting sort internally.

At some point, I would love look into this more, but I'm too busy with other things right now. I'll be interested if you do anything more with the counting sort.

@tshort
Copy link
Contributor

tshort commented Mar 7, 2013

I'm also too tied up to tackle this, especially in base where the sorting
code is elegant but complicated. Maybe when someone runs into a real need
here, they will implement it. Hopefully, our code gives them some basics
they can use.

On Wed, Mar 6, 2013 at 7:24 PM, Kevin Squire notifications@github.comwrote:

One idea would be to make the sort generally available, and on the first
pass, fall back to a different sort routine if the number or range of
values gets too large.

More generally, a stable radix sort algorithm would be a useful addition,
and some (all?) versions use a counting sort internally.

At some point, I would love look into this more, but I'm too busy with
other things right now. I'll be interested if you do anything more with the
counting sort.


Reply to this email directly or view it on GitHubhttps://github.com//pull/177#issuecomment-14535420
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants