[C++] Move Acero out of libarrow #15280

wjones127 · 2023-01-09T21:17:04Z

Describe the enhancement requested

libarrow_acero? libacero?

The core libarrow is largely stable while Acero remains experimental. It would be nice to eventually let these libraries evolve somewhat independently, like arrow-rs and datafusion.

[C++] Second block of refactoring to move acero out of libarrow #34630

Component(s)

C++

The text was updated successfully, but these errors were encountered:

wjones127 · 2023-01-09T22:32:38Z

libarrow_dataset depends on Acero, and is tightly integrated with it. Should it be moved into libacero? or renamed libacero_dataset?

kou · 2023-01-10T07:50:01Z

I prefer to libacero because Gandiva uses libgandiva not libarrow_gandiva.
ADBC also uses libadbc_* not libarrow_adbc_*.

jorisvandenbossche · 2023-01-10T12:41:16Z

What part would we want to move to a separate library?
Right now Acero lives in arrow/compute/exec/, right? Would we only move that, or the full of arrow/compute ? (xref ARROW-8891 - "Split non-cast compute kernels into a separate shared library")

rok · 2023-01-10T12:51:17Z

How usable are compute kernels now without Acero? How usable is Acero without kernels?

jorisvandenbossche · 2023-01-10T13:01:48Z

How usable is Acero without kernels?

Acero would then of course have to depend on libarrow (or libarrow_compute) for the kernels.

How usable are compute kernels now without Acero?

Very usable (if you are OK with the numpy paradigm of having eager functions for all operations). The current integration in pandas to work with arrow data is built on top of plain compute kernels.

rok · 2023-01-10T13:25:16Z

So then another option is to split further to libkernels, libacero. I don't have an opinion on the split though :).

paleolimbot · 2023-01-10T14:29:54Z

I would love to see this split, although the ability to distribute it separately (e.g., so that a separate R or Python 'acero' package could exist) is something I hope this would be a step in the direction of.

wjones127 · 2023-01-10T16:48:25Z

What part would we want to move to a separate library?
Right now Acero lives in arrow/compute/exec/, right? Would we only move that, or the full of arrow/compute ? (xref ARROW-8891 - "Split non-cast compute kernels into a separate shared library")

Yes, just arrow/compute/exec/. Splitting compute functions is its own issue and much messier.

westonpace · 2023-01-10T21:58:52Z

libacero will be easier than libkernels (there are still parts of the core lib that depend on compute kernels, I think) though both seem valid.

The only gray area I can see is group-by. There is currently a standalone group-by implementation that runs outside of Acero, that is marked internal, that I am trying to get rid of.

My vote would be that "group-by" is a part of Acero. Note, the hash_ aggregate kernels themselves could remain as part of compute/kernels. Aggregations would still be usable outside of Acero.

If that's the case then you could also move arrow/compute/row to Acero (and probably move it to arrow/compute/exec/row while you are at it).

I prefer to libacero because Gandiva uses libgandiva not libarrow_gandiva.
ADBC also uses libadbc_* not libarrow_adbc_*.

Both datasets and substrait will need to depend on libacero. It would be a bit weird to have libarrow_dataset -> libacero -> libarrow_compute. I'm ok with the weirdness though.

wjones127 · 2023-01-10T22:03:25Z

Both datasets and substrait will need to depend on libacero. It would be a bit weird to have libarrow_dataset -> libacero -> libarrow_compute. I'm ok with the weirdness though.

Yeah that's why I was contemplating renaming libarrow_dataset to libacero_dataset.

wjones127 · 2023-01-11T21:11:50Z

I spent some time looking at the necessary changes, and it occured to me with have a bit of a mess with namespaces as well. We introduced a library called libarrow_substrait and that contains the namespace arrow::engine. Nothing else uses that namespace. Acero (i.e. the code in src/arrow/compute/exec all lives in the arrow::compute namespace.

Although I'd like to have this issue somewhat limited in scope, it would be nice to know where we are trying to go with the library structure. I created a document outlining the current structure, an a couple ideas of a slightly more sane structure:

https://docs.google.com/document/d/1IcQfqFlvMgM3GreNbAvax0_2XKFHRdmcs9nqTPQlhOs/edit?usp=sharing

I welcome any comments or alternative suggestions.

westonpace · 2023-01-11T23:07:21Z

Thanks for the detailed analysis and the suggestions. engine echos back to a time when we were thinking there might be potentially more formats like Substrait that could drive the engine. It was an early standin for "acero" as the name wasn't agreed at the time. I'm fine getting rid of the term "engine" and merging substrait & acero and agree that is a good idea.

I'm a little hesitant about lumping "datasets" with Acero. Datasets is somewhat heavyweight, it requires parquet, orc, csv, filesystems, etc. None of those things are relevant to the core Acero. One could imagine Acero being used without any need for file I/O. For example, flight input, apply some compute operations, flight output.

So my preference would be second chart you drew that has libacero and libacero_dataset.

icexelloss · 2023-01-12T14:40:58Z

For what it's worth, as a Acero user I currently don't use dataset and don't see myself using it in the near future. So I am +1 not putting dataset into libacero and have a seperate so files for it.

wjones127 · 2023-01-13T16:03:51Z

cc @jorisvandenbossche @zeroshade

zeroshade · 2023-01-23T16:24:24Z

CC @benibus

ildipo · 2023-03-02T23:12:54Z

Since at TS we need this I'll be doing this refactoring with the solution n2 in the google doc.

…oving acero out of libarrow (#34518) ### Rationale for this change In order to remove acero from libarrow we need to reorganize the code to reduce entanglement between compute/exec and the rest of the product by removing any dependency from the codebase (except dataset) into compute/exec. ### What changes are included in this PR? This PR removes some of the dependencies: * some of the functions in `compute/exec/util.[h,cc]` that are widely used are moved to a new `compute/util.[h,cc] `file * `key_map` and `key_hash` are widely used and thus moved from compute/exec to compute * some light_array tests do use functionalities inside `compute/exec`, so the tests are split into 2 parts, one inside `compute` and another inside `compute/exec` ### Are these changes tested? The existing tests cover all these changes. One additional test program has been added by splitting light_array_test in two parts. ### Are there any user-facing changes? no * Closes: #15280

icexelloss · 2023-03-14T13:55:38Z

The above PR doesn't not close this issue - only a very first step towards it

ildipo · 2023-03-16T21:14:19Z

@icexelloss @westonpace

I think we need a slightly different solution:

substrait uses dataset
dataset uses acero
acero uses arrow

my plan is to introduce a libacero (from compute/exec) which depends only on libarrow and leave libsubstrait as is, make libdataset depend on libacero

westonpace · 2023-03-16T22:37:29Z

@ildipo

Do you mean "substrait uses dataset AND acero"? Then I agree.

ildipo · 2023-03-17T12:14:23Z

yes that is correct - it uses both

icexelloss · 2023-03-17T13:34:26Z

@ildipo where does substrait uses dataset? (This is slightly surprising to me because I thought substrait code mostly just read in substrait plan protobuf and turn into an Acero declaration)

ildipo · 2023-03-17T15:41:26Z

https://github.com/ildipo/arrow/blob/main/cpp/src/arrow/engine/substrait/relation_internal.cc

…rrow (#34575) ### Rationale for this change This is the second and last block of refactoring before moving acero out of libarrow. This PR removes the remaining dependencies from libarrow into the compute/exec directory. ### What changes are included in this PR? In detail: * compute/exec/expression is moved inside compute since it is needed in several kernels * all compute/exec specific include files are removed from compute/api.h * light_array_exec_test.cc i(introduced in the previous refactoring) s merged back into light_array_test * testing generators are refactored to remove the need for exec/options.h * some utilities are moved from compute/exec/util to compute/util * updated python and r libraries usage of moved files ### Are these changes tested? Yes using existing tests (some of which have been updated due to refactoring). ### Are there any user-facing changes? no * Issue: #15280 * Closes: #34630

westonpace · 2023-03-22T12:55:45Z

@ildipo where does substrait uses dataset? (This is slightly surprising to me because I thought substrait code mostly just read in substrait plan protobuf and turn into an Acero declaration)

@icexelloss the scan and write nodes are provided by datasets. You can use source (e.g. an in-memory source) with plain Acero but if you want to scan files you need datasets. Substrait's ReadRel when LocalFiles will use a scan node.

…f libarrow (apache#34575) ### Rationale for this change This is the second and last block of refactoring before moving acero out of libarrow. This PR removes the remaining dependencies from libarrow into the compute/exec directory. ### What changes are included in this PR? In detail: * compute/exec/expression is moved inside compute since it is needed in several kernels * all compute/exec specific include files are removed from compute/api.h * light_array_exec_test.cc i(introduced in the previous refactoring) s merged back into light_array_test * testing generators are refactored to remove the need for exec/options.h * some utilities are moved from compute/exec/util to compute/util * updated python and r libraries usage of moved files ### Are these changes tested? Yes using existing tests (some of which have been updated due to refactoring). ### Are there any user-facing changes? no * Issue: apache#15280 * Closes: apache#34630

jorisvandenbossche · 2023-03-30T16:48:13Z

The current PR #34711 creates a src/arrow/acero (so still part of the arrow namespace), adn not to src/acero. Both have been mentioned above, I don't know if there was a clear conclusion on that question? (I don't have a strong opinion myself)

wjones127 · 2023-03-30T17:19:56Z

Agree that wasn't what I initially hoped, but I think the argument for doing it like this is:

We don't want to move libarrow_dataset to libacero_dataset, which causes more breaking changes for users than strictly necessary
We have no intention of having Acero usable without Arrow (unlike the Parquet codebase, which theoretically should be splittable into its own repo if we ever want to).

icexelloss · 2023-03-30T18:35:06Z

I agree with Will here that Acero is still better to live under the Arrow umbrella since there are internal dependency (e.g., dataset depend on acero, arrow R dependency)

…g previously in compute/exec (#34711) ### Rationale for this change See the linked issue ### What changes are included in this PR? C++: * remove all compute/exec/* from libarrow * rename compute/exec -> acero and make libarrow_acero * add new ARROW_ACERO option, required if ARROW_DATASET is on * libarrow_dataset now depends on libarrow_acero c_glib: add the new libarrow_acero dependency - we disallow building glib without it python: added PYARROW_BUILD_ACERO, set to on if DATASETS are built ### Are these changes tested? All the standard tests do work properly. I manually compiled C++ with: * no ARROW_ACERO * ARROW_ACERO and no ARROW_DATASET * ARROW_ACERO and ARROW_DATASET and no ARROW_SUBSTRAIT I manually compiled python without ACERO & DATASET and with ACERO and without DATASET ### Are there any user-facing changes? If users include compute/exec files directly then they'll have to update their code. * Closes: #15280 Lead-authored-by: Davide Pasetto <dpasetto69@gmail.com> Co-authored-by: Li Jin <ice.xelloss@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com>

…rything previously in compute/exec (apache#34711) ### Rationale for this change See the linked issue ### What changes are included in this PR? C++: * remove all compute/exec/* from libarrow * rename compute/exec -> acero and make libarrow_acero * add new ARROW_ACERO option, required if ARROW_DATASET is on * libarrow_dataset now depends on libarrow_acero c_glib: add the new libarrow_acero dependency - we disallow building glib without it python: added PYARROW_BUILD_ACERO, set to on if DATASETS are built ### Are these changes tested? All the standard tests do work properly. I manually compiled C++ with: * no ARROW_ACERO * ARROW_ACERO and no ARROW_DATASET * ARROW_ACERO and ARROW_DATASET and no ARROW_SUBSTRAIT I manually compiled python without ACERO & DATASET and with ACERO and without DATASET ### Are there any user-facing changes? If users include compute/exec files directly then they'll have to update their code. * Closes: apache#15280 Lead-authored-by: Davide Pasetto <dpasetto69@gmail.com> Co-authored-by: Li Jin <ice.xelloss@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com>

I think we got this part wrong in conan-io#17540 I'm not sure if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not sure what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about.

I think we got this part wrong in conan-io#17540 It's not clear if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about.

I think we got this part wrong in #17540 It's not clear if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about. Co-authored-by: James <james@conan.io>

I think we got this part wrong in conan-io#17540 It's not clear if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about. Co-authored-by: James <james@conan.io>

wjones127 added Type: enhancement Component: C++ labels Jan 9, 2023

wjones127 mentioned this issue Jan 19, 2023

GH-33659: [Developer Tools] Add definition of Breaking Change and Critical Fix #33660

Merged

github-actions bot mentioned this issue Mar 9, 2023

GH-15280: [C++] Refactor to reorganize dependencies as a prequel to moving acero out of libarrow #34518

Merged

github-actions bot assigned ildipo Mar 9, 2023

icexelloss closed this as completed in #34518 Mar 14, 2023

icexelloss reopened this Mar 14, 2023

github-actions bot mentioned this issue Mar 15, 2023

GH-34630: [C++] Second block of refactoring to move acero out of libarrow #34575

Merged

github-actions bot mentioned this issue Mar 23, 2023

GH-15280: [C++][Python][GLib] add libarrow_acero containing everything previously in compute/exec #34711

Merged

icexelloss closed this as completed in #34711 Apr 1, 2023

github-actions bot mentioned this issue Apr 4, 2023

GH-33976: [Python] Clean-up Acero related declarations in libarrow_acero.pxd #34773

Merged

raulcd added this to the 12.0.0 milestone Apr 18, 2023

wjones127 added the Breaking Change Includes a breaking change to the API label Apr 26, 2023

raulcd mentioned this issue May 2, 2023

Be explicit about ARROW_ACERO; add tests; remove use_local in migrators conda-forge/arrow-cpp-feedstock#1037

Merged

5 tasks

This was referenced May 10, 2023

[arrow] Update to 12.0.0 microsoft/vcpkg#31321

Merged

arrow: add version 12.0.0 conan-io/conan-center-index#17540

Merged

mpimenov mentioned this issue May 24, 2023

arrow: fix the library name for the libacero component conan-io/conan-center-index#17691

Merged

tjni mentioned this issue Jul 2, 2023

Build failure: rPackages.arrow NixOS/nixpkgs#241026

Closed

assignUser mentioned this issue Aug 30, 2023

On ubuntu:bionic can not install libarrow-acero-dev via apt #37452

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Move Acero out of libarrow #15280

[C++] Move Acero out of libarrow #15280

wjones127 commented Jan 9, 2023 •

edited by assignUser

Loading

wjones127 commented Jan 9, 2023

kou commented Jan 10, 2023

jorisvandenbossche commented Jan 10, 2023

rok commented Jan 10, 2023

jorisvandenbossche commented Jan 10, 2023

rok commented Jan 10, 2023

paleolimbot commented Jan 10, 2023

wjones127 commented Jan 10, 2023

westonpace commented Jan 10, 2023

wjones127 commented Jan 10, 2023

wjones127 commented Jan 11, 2023

westonpace commented Jan 11, 2023

icexelloss commented Jan 12, 2023

wjones127 commented Jan 13, 2023

zeroshade commented Jan 23, 2023

ildipo commented Mar 2, 2023

icexelloss commented Mar 14, 2023

ildipo commented Mar 16, 2023

westonpace commented Mar 16, 2023 •

edited

Loading

ildipo commented Mar 17, 2023

icexelloss commented Mar 17, 2023 •

edited

Loading

ildipo commented Mar 17, 2023

westonpace commented Mar 22, 2023

jorisvandenbossche commented Mar 30, 2023 •

edited

Loading

wjones127 commented Mar 30, 2023

icexelloss commented Mar 30, 2023 •

edited

Loading

[C++] Move Acero out of libarrow #15280

[C++] Move Acero out of libarrow #15280

Comments

wjones127 commented Jan 9, 2023 • edited by assignUser Loading

Describe the enhancement requested

Component(s)

wjones127 commented Jan 9, 2023

kou commented Jan 10, 2023

jorisvandenbossche commented Jan 10, 2023

rok commented Jan 10, 2023

jorisvandenbossche commented Jan 10, 2023

rok commented Jan 10, 2023

paleolimbot commented Jan 10, 2023

wjones127 commented Jan 10, 2023

westonpace commented Jan 10, 2023

wjones127 commented Jan 10, 2023

wjones127 commented Jan 11, 2023

westonpace commented Jan 11, 2023

icexelloss commented Jan 12, 2023

wjones127 commented Jan 13, 2023

zeroshade commented Jan 23, 2023

ildipo commented Mar 2, 2023

icexelloss commented Mar 14, 2023

ildipo commented Mar 16, 2023

westonpace commented Mar 16, 2023 • edited Loading

ildipo commented Mar 17, 2023

icexelloss commented Mar 17, 2023 • edited Loading

ildipo commented Mar 17, 2023

westonpace commented Mar 22, 2023

jorisvandenbossche commented Mar 30, 2023 • edited Loading

wjones127 commented Mar 30, 2023

icexelloss commented Mar 30, 2023 • edited Loading

wjones127 commented Jan 9, 2023 •

edited by assignUser

Loading

westonpace commented Mar 16, 2023 •

edited

Loading

icexelloss commented Mar 17, 2023 •

edited

Loading

jorisvandenbossche commented Mar 30, 2023 •

edited

Loading

icexelloss commented Mar 30, 2023 •

edited

Loading