GH-14866: [C++] Remove internal GroupBy implementation #14867

westonpace · 2022-12-07T08:37:05Z

github-actions · 2022-12-07T08:37:29Z

Closes: [C++] Remove internal GroupBy implementation #14866

github-actions · 2022-12-07T08:37:32Z

⚠️ GitHub issue #14866 has been automatically assigned in GitHub to PR creator.

github-actions · 2022-12-07T08:37:33Z

⚠️ GitHub issue #14866 has no components, please add labels for components.

westonpace · 2022-12-07T08:39:08Z

This may be interesting to @lidavidm , @bkietz , and @amol-

jorisvandenbossche · 2022-12-07T08:52:42Z

python/pyarrow/_compute.pyx

+        result_batch = []
+        for c_column in c_batch.values:
+            result_batch.append(wrap_datum(c_column))
+        result_batches.append(result_batch)


Could this use the ExecBatch::ToRecordBatch to return a list of batches instead? (that seems simpler to work with, and also simplifies the code here)

Yes, I was a bit torn on this one. The simplest thing might be for arrow::compute::GroupBy to return std::shared_ptr<RecordBatch>. However, I don't have column names, so I would be making those up. Also, the inputs are datums, and so it seemed like a mismatch to receive datums (not arrays) and return a record batch (and not an exec batch). So then I ended up with a list of lists of arrays which is unpleasant too.

I could return a list of record batches but then I would have to copy the CreateSimpleSchema method which invents names for these columns. Since the caller of this function has those names, and this function is private, I figured it best to leave that work for the caller.

That being said, after sleeping on this a bit, maybe a better change would be to change arrow::compute::GroupBy to receive arrays (not datums) and then returning a record batch wouldn't be inconsistent.

Ok, I ended up promoting arrow::compute::GroupBy to a "proper" convenience function. It now accepts arrays, returns a table, is a bit friendlier with field names, checks for invalid input, is added to the api.h file, and has unit tests.

rtpsw · 2022-12-07T09:22:10Z

cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

@@ -185,7 +184,8 @@ Result<Datum> GroupByUsingExecPlan(const BatchesWithSchema& input,
 Result<Datum> GroupByUsingExecPlan(const std::vector<Datum>& arguments,


Perhaps this could just be called GroupBy now.

I'm a little hesitant to use GroupBy since that would mean the GroupBy function calls the GroupBy function which is a little confusing. I changed it to RunGroupBy.

rtpsw · 2022-12-07T09:23:12Z

cpp/src/arrow/compute/exec/source_node.cc

-    if (io_executor == NULLPTR) {
-      io_executor = plan->exec_context()->executor();
-    }


Is this removal intended?

It's not terribly relevant. I was initially wanting to use exec_batch_source here but ran into a problem because I was not transferring off the background generator. Incidentally, I think it may be close to time to fix the background generator to remove this limitation, but I have enough on my plate for the moment.

This change was simply because, a few lines down (on line 316) we have a very similar if condition:

if (io_executor == NULLPTR) { io_executor = io::internal::GetIOThreadPool(); }

I don't think we need both of these statements and defaulting to the I/O thread pool seemed like the better default. If that is not the correct default, or there is some subtlety I am missing, let me know and I can revert this.

rtpsw · 2022-12-22T14:05:11Z

@jorisvandenbossche, it looks like this PR is pending your review. Could you take a look? I have another PR, which is important to me, that is waiting for this one.

jorisvandenbossche · 2022-12-22T19:54:38Z

Ok, I ended up promoting arrow::compute::GroupBy to a "proper" convenience function. It now accepts arrays, returns a table, is a bit friendlier with field names, checks for invalid input, is added to the api.h file, and has unit tests.

Sorry for the slow reply. I don't fully understand how the above can work correctly (working on arrays, instead of on chunked arrays), because you now calculate the groupby batch by batch? But then those results should be "merged" somehow, and not just concatenated?

jorisvandenbossche · 2022-12-22T20:41:56Z

To illustrate what I mean, using a small example (and using this branch):

In [7]: table = pa.table({'key': ['a', 'b', 'a', 'b', 'a', 'b'], 'col': range(6)})

In [8]: table = pa.Table.from_batches(table.to_batches(max_chunksize=3))

In [9]: table.to_pandas()
Out[9]: 
  key  col
0   a    0
1   b    1
2   a    2
3   b    3
4   a    4
5   b    5

In [10]: table.group_by('key').aggregate([('col', 'sum')]).to_pandas()
Out[10]: 
   col_sum key
0        2   a
1        1   b
2        8   b
3        4   a

I created a table consisting of multiple chunks, and then the result is incorrect as it is the concatentation of individual results of each chunk.

rtpsw · 2022-12-22T21:04:58Z

@westonpace, though I didn't review the code carefully, it looks like this PR removes code that is refactored and used in my ordered/segmented aggregation PR. GIven this, and the correctness problem @jorisvandenbossche is pointing out in this PR, might it make sense to wait with your removal until my PR is merged? or to include your removal in my PR?

westonpace · 2022-12-22T21:27:23Z

@jorisvandenbossche it should be using an exec plan internally with an aggregate node. The aggregate node knows how to maintain state from batch to batch. However, I agree your example is pointing out a bug in my code. I'll take a look.

@rtpsw I will try merging this with your branch (and then create a third PR) just to make sure it works. I don't know if I can get to it before tomorrow morning. Either way, if there is concern about this approach, we can merge yours and clean up with mine.

The basic idea is that we have kernel functions for arrays / single batches and exec plans for multiple batches (which should include chunked arrays).

I don't see any value in maintaining a third path for chunked arrays when they should just be a special case of multiple batches / exec plans.

jorisvandenbossche · 2022-12-22T21:38:49Z

it should be using an exec plan internally with an aggregate node.

But I assumed that it would be the purpose of the GroupBy(..) helper to do this? (its doc comment says "The result will be calculated using an exec plan with an aggregate node")

westonpace · 2022-12-22T21:44:58Z

Yes. I see the problem now. The group by helper needs an overload that accepts chunked arrays, converts them to a table, and then uses that as input.

westonpace · 2022-12-23T16:20:16Z

@rtpsw 0f2b458 is an example of layering this PR on top of your ordered groupby changes. I couldn't get the aggregate node working because I don't think it works at the moment and didn't want to dive too deep into that problem just yet. However, I don't see any reason your ordered aggregation won't work with this PR. Also, once things are working, we should be able to go further and:

remove chunked array from datum
remove all grouper.cc/grouper.h changes which use spans and instead only use batches

westonpace · 2022-12-23T16:21:28Z

I'm going to rebase this and address the problem Joris raised.

westonpace · 2022-12-24T03:56:06Z

@jorisvandenbossche Thanks for pointing out that problem. I think I've addressed your concerns (and I've added your example as a test case).

westonpace · 2023-01-04T19:38:37Z

@jorisvandenbossche friendly ping now that the holidays are over.

jorisvandenbossche

Looks good! (only looked at the cython code again, and the expected behaviour)

Just some small clean-ups needed

python/pyarrow/table.pxi

python/pyarrow/tests/test_table.py

westonpace · 2023-01-19T00:37:25Z

I've rebased this and will merge if CI is still passing.

… directly instead of emulating one apacheGH-14866: converted GroupBy into a proper convenience function, accepting arrays and returning table, with unit tests

westonpace · 2023-02-17T18:02:41Z

closes #34238

jorisvandenbossche · 2023-02-21T20:57:09Z

@westonpace BTW, if you want that the issue gets automatically closed, you need to add the "closes #34238" to the top comment, and not only in a comment like the one above (our tooling still won't automatically assign and milestone the issue though, it's only github that will automatically close it then)

westonpace · 2023-02-21T21:00:49Z

@jorisvandenbossche Oh! Thanks for catching that. I hadn't realized that (I think this is the first PR I've done that closed multiple issues).

ursabot · 2023-02-22T06:53:53Z

Benchmark runs are scheduled for baseline = a988302 and contender = 92d91f5. 92d91f5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.49% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.48% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 92d91f53 ec2-t3-xlarge-us-east-2
[Finished] 92d91f53 test-mac-arm
[Finished] 92d91f53 ursa-i9-9960x
[Finished] 92d91f53 ursa-thinkcentre-m75q
[Finished] a9883024 ec2-t3-xlarge-us-east-2
[Failed] a9883024 test-mac-arm
[Finished] a9883024 ursa-i9-9960x
[Finished] a9883024 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…14867) * Closes: apache#14866 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

github-actions bot added Component: C++ Component: Python labels Dec 7, 2022

westonpace mentioned this pull request Dec 7, 2022

ARROW-17642: [C++] Add ordered aggregation #14352

Closed

jorisvandenbossche reviewed Dec 7, 2022

View reviewed changes

rtpsw reviewed Dec 7, 2022

View reviewed changes

westonpace mentioned this pull request Dec 9, 2022

[Python] Add a pyarrow.Table.aggregate function to compute aggregates against the whole table #14896

Open

westonpace requested a review from jorisvandenbossche December 20, 2022 17:06

westonpace force-pushed the feature/14866--remove-internal-groupby branch from 1b407e2 to 3a799ac Compare December 23, 2022 17:21

westonpace force-pushed the feature/14866--remove-internal-groupby branch from 9baee4b to 412a32d Compare January 4, 2023 13:57

westonpace mentioned this pull request Jan 4, 2023

[C++] Usage of Hash Aggregation #15162

Closed

westonpace mentioned this pull request Jan 10, 2023

[C++] Move Acero out of libarrow #15280

Closed

1 task

jorisvandenbossche approved these changes Jan 11, 2023

View reviewed changes

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

python/pyarrow/tests/test_table.py Outdated Show resolved Hide resolved

westonpace force-pushed the feature/14866--remove-internal-groupby branch from 412a32d to c0cc97a Compare January 19, 2023 00:37

westonpace requested a review from AlenkaF as a code owner January 19, 2023 00:37

westonpace requested review from thisisnic and zeroshade as code owners January 25, 2023 20:39

github-actions bot added Component: Documentation Component: Go Component: R labels Jan 25, 2023

westonpace force-pushed the feature/14866--remove-internal-groupby branch from c9fc872 to 57d0ebd Compare January 25, 2023 20:41

github-actions bot removed Component: Go Component: R labels Jan 25, 2023

coady mentioned this pull request Feb 11, 2023

[C++] Remove internal GroupBy implementation #14866

Closed

westonpace added 2 commits February 17, 2023 10:00

apacheGH-14866: Modified arrow::internal::GroupBy to use an exec plan…

777d136

… directly instead of emulating one apacheGH-14866: converted GroupBy into a proper convenience function, accepting arrays and returning table, with unit tests

Added some comments for clarification. Removed debug print

cd40e50

westonpace force-pushed the feature/14866--remove-internal-groupby branch from 5e60530 to cd40e50 Compare February 17, 2023 18:00

github-actions bot removed the Component: Documentation label Feb 17, 2023

westonpace mentioned this pull request Feb 17, 2023

[C++][Python] Segfault when calling groupby on table with misaligned chunks #34238

Closed

westonpace added 2 commits February 17, 2023 10:44

Lint

83f9c46

Fix BatchGroupBy to actually accept a batch. Fix benchmark

414eabe

westonpace removed request for kou, zeroshade, raulcd, paleolimbot, thisisnic, assignUser and AlenkaF February 17, 2023 20:06

westonpace merged commit 92d91f5 into apache:main Feb 21, 2023

westonpace mentioned this pull request Feb 24, 2023

GH-32884: [C++] Add ordered aggregation #34311

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-14866: [C++] Remove internal GroupBy implementation #14867

GH-14866: [C++] Remove internal GroupBy implementation #14867

westonpace commented Dec 7, 2022 •

edited by jorisvandenbossche

Loading

github-actions bot commented Dec 7, 2022

github-actions bot commented Dec 7, 2022

github-actions bot commented Dec 7, 2022

westonpace commented Dec 7, 2022

jorisvandenbossche Dec 7, 2022

westonpace Dec 7, 2022

westonpace Dec 9, 2022

rtpsw Dec 7, 2022

westonpace Dec 9, 2022

rtpsw Dec 7, 2022

westonpace Dec 7, 2022

rtpsw commented Dec 22, 2022

jorisvandenbossche commented Dec 22, 2022

jorisvandenbossche commented Dec 22, 2022

rtpsw commented Dec 22, 2022

westonpace commented Dec 22, 2022

jorisvandenbossche commented Dec 22, 2022

westonpace commented Dec 22, 2022

westonpace commented Dec 23, 2022

westonpace commented Dec 23, 2022

westonpace commented Dec 24, 2022

westonpace commented Jan 4, 2023

jorisvandenbossche left a comment

westonpace commented Jan 19, 2023

westonpace commented Feb 17, 2023

jorisvandenbossche commented Feb 21, 2023

westonpace commented Feb 21, 2023

ursabot commented Feb 22, 2023

		@@ -185,7 +184,8 @@ Result<Datum> GroupByUsingExecPlan(const BatchesWithSchema& input,
		Result<Datum> GroupByUsingExecPlan(const std::vector<Datum>& arguments,

GH-14866: [C++] Remove internal GroupBy implementation #14867

GH-14866: [C++] Remove internal GroupBy implementation #14867

Conversation

westonpace commented Dec 7, 2022 • edited by jorisvandenbossche Loading

github-actions bot commented Dec 7, 2022

github-actions bot commented Dec 7, 2022

github-actions bot commented Dec 7, 2022

westonpace commented Dec 7, 2022

jorisvandenbossche Dec 7, 2022

Choose a reason for hiding this comment

westonpace Dec 7, 2022

Choose a reason for hiding this comment

westonpace Dec 9, 2022

Choose a reason for hiding this comment

rtpsw Dec 7, 2022

Choose a reason for hiding this comment

westonpace Dec 9, 2022

Choose a reason for hiding this comment

rtpsw Dec 7, 2022

Choose a reason for hiding this comment

westonpace Dec 7, 2022

Choose a reason for hiding this comment

rtpsw commented Dec 22, 2022

jorisvandenbossche commented Dec 22, 2022

jorisvandenbossche commented Dec 22, 2022

rtpsw commented Dec 22, 2022

westonpace commented Dec 22, 2022

jorisvandenbossche commented Dec 22, 2022

westonpace commented Dec 22, 2022

westonpace commented Dec 23, 2022

westonpace commented Dec 23, 2022

westonpace commented Dec 24, 2022

westonpace commented Jan 4, 2023

jorisvandenbossche left a comment

Choose a reason for hiding this comment

westonpace commented Jan 19, 2023

westonpace commented Feb 17, 2023

jorisvandenbossche commented Feb 21, 2023

westonpace commented Feb 21, 2023

ursabot commented Feb 22, 2023

westonpace commented Dec 7, 2022 •

edited by jorisvandenbossche

Loading