-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-14866: [C++] Remove internal GroupBy implementation #14867
GH-14866: [C++] Remove internal GroupBy implementation #14867
Conversation
westonpace
commented
Dec 7, 2022
•
edited by jorisvandenbossche
Loading
edited by jorisvandenbossche
- Closes: [C++] Remove internal GroupBy implementation #14866
- Closes: [C++][Python] Segfault when calling groupby on table with misaligned chunks #34238
|
|
python/pyarrow/_compute.pyx
Outdated
result_batch = [] | ||
for c_column in c_batch.values: | ||
result_batch.append(wrap_datum(c_column)) | ||
result_batches.append(result_batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this use the ExecBatch::ToRecordBatch
to return a list of batches instead? (that seems simpler to work with, and also simplifies the code here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was a bit torn on this one. The simplest thing might be for arrow::compute::GroupBy
to return std::shared_ptr<RecordBatch>
. However, I don't have column names, so I would be making those up. Also, the inputs are datums, and so it seemed like a mismatch to receive datums (not arrays) and return a record batch (and not an exec batch). So then I ended up with a list of lists of arrays which is unpleasant too.
I could return a list of record batches but then I would have to copy the CreateSimpleSchema
method which invents names for these columns. Since the caller of this function has those names, and this function is private, I figured it best to leave that work for the caller.
That being said, after sleeping on this a bit, maybe a better change would be to change arrow::compute::GroupBy
to receive arrays (not datums) and then returning a record batch wouldn't be inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I ended up promoting arrow::compute::GroupBy
to a "proper" convenience function. It now accepts arrays, returns a table, is a bit friendlier with field names, checks for invalid input, is added to the api.h file, and has unit tests.
@@ -185,7 +184,8 @@ Result<Datum> GroupByUsingExecPlan(const BatchesWithSchema& input, | |||
Result<Datum> GroupByUsingExecPlan(const std::vector<Datum>& arguments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could just be called GroupBy
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little hesitant to use GroupBy
since that would mean the GroupBy
function calls the GroupBy
function which is a little confusing. I changed it to RunGroupBy
.
if (io_executor == NULLPTR) { | ||
io_executor = plan->exec_context()->executor(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this removal intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not terribly relevant. I was initially wanting to use exec_batch_source
here but ran into a problem because I was not transferring off the background generator. Incidentally, I think it may be close to time to fix the background generator to remove this limitation, but I have enough on my plate for the moment.
This change was simply because, a few lines down (on line 316) we have a very similar if condition:
if (io_executor == NULLPTR) {
io_executor = io::internal::GetIOThreadPool();
}
I don't think we need both of these statements and defaulting to the I/O thread pool seemed like the better default. If that is not the correct default, or there is some subtlety I am missing, let me know and I can revert this.
@jorisvandenbossche, it looks like this PR is pending your review. Could you take a look? I have another PR, which is important to me, that is waiting for this one. |
Sorry for the slow reply. I don't fully understand how the above can work correctly (working on arrays, instead of on chunked arrays), because you now calculate the groupby batch by batch? But then those results should be "merged" somehow, and not just concatenated? |
To illustrate what I mean, using a small example (and using this branch):
I created a table consisting of multiple chunks, and then the result is incorrect as it is the concatentation of individual results of each chunk. |
@westonpace, though I didn't review the code carefully, it looks like this PR removes code that is refactored and used in my ordered/segmented aggregation PR. GIven this, and the correctness problem @jorisvandenbossche is pointing out in this PR, might it make sense to wait with your removal until my PR is merged? or to include your removal in my PR? |
@jorisvandenbossche it should be using an exec plan internally with an aggregate node. The aggregate node knows how to maintain state from batch to batch. However, I agree your example is pointing out a bug in my code. I'll take a look. @rtpsw I will try merging this with your branch (and then create a third PR) just to make sure it works. I don't know if I can get to it before tomorrow morning. Either way, if there is concern about this approach, we can merge yours and clean up with mine. The basic idea is that we have kernel functions for arrays / single batches and exec plans for multiple batches (which should include chunked arrays). I don't see any value in maintaining a third path for chunked arrays when they should just be a special case of multiple batches / exec plans. |
But I assumed that it would be the purpose of the |
Yes. I see the problem now. The group by helper needs an overload that accepts chunked arrays, converts them to a table, and then uses that as input. |
@rtpsw 0f2b458 is an example of layering this PR on top of your ordered groupby changes. I couldn't get the aggregate node working because I don't think it works at the moment and didn't want to dive too deep into that problem just yet. However, I don't see any reason your ordered aggregation won't work with this PR. Also, once things are working, we should be able to go further and:
|
I'm going to rebase this and address the problem Joris raised. |
1b407e2
to
3a799ac
Compare
@jorisvandenbossche Thanks for pointing out that problem. I think I've addressed your concerns (and I've added your example as a test case). |
9baee4b
to
412a32d
Compare
@jorisvandenbossche friendly ping now that the holidays are over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! (only looked at the cython code again, and the expected behaviour)
Just some small clean-ups needed
412a32d
to
c0cc97a
Compare
I've rebased this and will merge if CI is still passing. |
c9fc872
to
57d0ebd
Compare
… directly instead of emulating one apacheGH-14866: converted GroupBy into a proper convenience function, accepting arrays and returning table, with unit tests
5e60530
to
cd40e50
Compare
closes #34238 |
@westonpace BTW, if you want that the issue gets automatically closed, you need to add the "closes #34238" to the top comment, and not only in a comment like the one above (our tooling still won't automatically assign and milestone the issue though, it's only github that will automatically close it then) |
@jorisvandenbossche Oh! Thanks for catching that. I hadn't realized that (I think this is the first PR I've done that closed multiple issues). |
Benchmark runs are scheduled for baseline = a988302 and contender = 92d91f5. 92d91f5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…14867) * Closes: apache#14866 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>