Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

Specialize row_group_slots() and findrow() on column types to improve performance #79

Merged
merged 1 commit into from
Aug 7, 2017

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Jul 11, 2017

Looping over columns is very slow when their type is unknown at compile time.
Specialize the method on the types of the key (grouping) columns by passing
a tuple of columns rather than a DataTable. This will force compiling a specific
method for each combination of key types, but their number should remain relatively low
and the one-time cost is worth it.

This dramatically improves performance of groupby(), but does not have a large
effect on join() since it is very inefficient in other areas.


This is an alternative to #76. We should probably merge it anyway, even if we merge #76 later, as it improves the performance of the current algorithm, making it possible to compare the merits of both approaches.

In the simple test used in #76, this makes grouping faster than DataFrames, and even slightly faster than #76. More benchmarks would be needed to check whether this is also the case for other scenarios. Unfortunately, join remains much slower than DataFrames for unrelated reasons.

The downside of this change is that the method will be recompiled for each combination of column types used for grouping. I think that's OK since the number of grouping columns is generally low, and the number of different types is limited too. Though I should note the approach in #76 only compiles one method for each column types, not for each of their combinations.

Cc: @cjprybol @alyst

isequal_row(cols::Tuple{AbstractVector}, r1::Int, r2::Int) =
isequal_colel(cols[1][r1], cols[1][r2])
isequal_row(cols::Tuple{Vararg{AbstractVector}}, r1::Int, r2::Int) =
isequal_colel(cols[1][r1], cols[1][r2]) && isequal_row(Base.tail(cols), r1, r2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recursive trick allows the compiler to generate well-typed code even when the tuples have heterogeneous eltypes.

@nalimilan
Copy link
Member Author

nalimilan commented Jul 11, 2017

Looks like I spoke too soon about join. Using the benchmark from JuliaData/DataFrames.jl#850 adapted to DataTables, I get a 3x performance improvement compared with master and with DataFrames. We need more systematic benchmarks to check in what cases this applies.

EDIT: also, using the benchmark from this Discourse post, performance is similar to DataFrames and only slightly worse than Pandas, when it was previously terrible.

@aaowens
Copy link

aaowens commented Jul 24, 2017

The by performance is pretty close to Pandas with the Discourse example. On joins, Pandas is 2x faster and R's data.table is 3-4x faster.

… performance

Looping over columns is very slow when their type is unknown at compile time.
Specialize the method on the types of the key (grouping) columns by passing
a tuple of columns rather than a DataTable. This will force compiling a specific
method for each combination of key types, but their number should remain relatively low
and the one-time cost is worth it.

This dramatically improves performance of groupby(), but does not have a large
effect on join() since it is very inefficient in other areas.

Also add return type assertion for rowhash(). The fact that the type of
the columns isn't known at compile time appears to confuse inference,
which isn't able to detect that this function always returns UInt.
This reduces a lot the number of allocations when calling join(),
but doesn't really change performance.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants