Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Move Acero out of libarrow #15280

Closed
1 task done
wjones127 opened this issue Jan 9, 2023 · 26 comments · Fixed by #34518 or #34711
Closed
1 task done

[C++] Move Acero out of libarrow #15280

wjones127 opened this issue Jan 9, 2023 · 26 comments · Fixed by #34518 or #34711
Assignees
Labels
Breaking Change Includes a breaking change to the API Component: C++ Type: enhancement
Milestone

Comments

@wjones127
Copy link
Member

wjones127 commented Jan 9, 2023

Describe the enhancement requested

libarrow_acero? libacero?

The core libarrow is largely stable while Acero remains experimental. It would be nice to eventually let these libraries evolve somewhat independently, like arrow-rs and datafusion.

Component(s)

C++

@wjones127
Copy link
Member Author

libarrow_dataset depends on Acero, and is tightly integrated with it. Should it be moved into libacero? or renamed libacero_dataset?

@kou
Copy link
Member

kou commented Jan 10, 2023

I prefer to libacero because Gandiva uses libgandiva not libarrow_gandiva.
ADBC also uses libadbc_* not libarrow_adbc_*.

@jorisvandenbossche
Copy link
Member

What part would we want to move to a separate library?
Right now Acero lives in arrow/compute/exec/, right? Would we only move that, or the full of arrow/compute ? (xref ARROW-8891 - "Split non-cast compute kernels into a separate shared library")

@rok
Copy link
Member

rok commented Jan 10, 2023

How usable are compute kernels now without Acero? How usable is Acero without kernels?

@jorisvandenbossche
Copy link
Member

How usable is Acero without kernels?

Acero would then of course have to depend on libarrow (or libarrow_compute) for the kernels.

How usable are compute kernels now without Acero?

Very usable (if you are OK with the numpy paradigm of having eager functions for all operations). The current integration in pandas to work with arrow data is built on top of plain compute kernels.

@rok
Copy link
Member

rok commented Jan 10, 2023

So then another option is to split further to libkernels, libacero. I don't have an opinion on the split though :).

@paleolimbot
Copy link
Member

I would love to see this split, although the ability to distribute it separately (e.g., so that a separate R or Python 'acero' package could exist) is something I hope this would be a step in the direction of.

@wjones127
Copy link
Member Author

What part would we want to move to a separate library?
Right now Acero lives in arrow/compute/exec/, right? Would we only move that, or the full of arrow/compute ? (xref ARROW-8891 - "Split non-cast compute kernels into a separate shared library")

Yes, just arrow/compute/exec/. Splitting compute functions is its own issue and much messier.

@westonpace
Copy link
Member

libacero will be easier than libkernels (there are still parts of the core lib that depend on compute kernels, I think) though both seem valid.

The only gray area I can see is group-by. There is currently a standalone group-by implementation that runs outside of Acero, that is marked internal, that I am trying to get rid of.

My vote would be that "group-by" is a part of Acero. Note, the hash_ aggregate kernels themselves could remain as part of compute/kernels. Aggregations would still be usable outside of Acero.

If that's the case then you could also move arrow/compute/row to Acero (and probably move it to arrow/compute/exec/row while you are at it).

I prefer to libacero because Gandiva uses libgandiva not libarrow_gandiva.
ADBC also uses libadbc_* not libarrow_adbc_*.

Both datasets and substrait will need to depend on libacero. It would be a bit weird to have libarrow_dataset -> libacero -> libarrow_compute. I'm ok with the weirdness though.

@wjones127
Copy link
Member Author

Both datasets and substrait will need to depend on libacero. It would be a bit weird to have libarrow_dataset -> libacero -> libarrow_compute. I'm ok with the weirdness though.

Yeah that's why I was contemplating renaming libarrow_dataset to libacero_dataset.

@wjones127
Copy link
Member Author

I spent some time looking at the necessary changes, and it occured to me with have a bit of a mess with namespaces as well. We introduced a library called libarrow_substrait and that contains the namespace arrow::engine. Nothing else uses that namespace. Acero (i.e. the code in src/arrow/compute/exec all lives in the arrow::compute namespace.

Although I'd like to have this issue somewhat limited in scope, it would be nice to know where we are trying to go with the library structure. I created a document outlining the current structure, an a couple ideas of a slightly more sane structure:

https://docs.google.com/document/d/1IcQfqFlvMgM3GreNbAvax0_2XKFHRdmcs9nqTPQlhOs/edit?usp=sharing

I welcome any comments or alternative suggestions.

@westonpace
Copy link
Member

Thanks for the detailed analysis and the suggestions. engine echos back to a time when we were thinking there might be potentially more formats like Substrait that could drive the engine. It was an early standin for "acero" as the name wasn't agreed at the time. I'm fine getting rid of the term "engine" and merging substrait & acero and agree that is a good idea.

I'm a little hesitant about lumping "datasets" with Acero. Datasets is somewhat heavyweight, it requires parquet, orc, csv, filesystems, etc. None of those things are relevant to the core Acero. One could imagine Acero being used without any need for file I/O. For example, flight input, apply some compute operations, flight output.

So my preference would be second chart you drew that has libacero and libacero_dataset.

@icexelloss
Copy link
Contributor

For what it's worth, as a Acero user I currently don't use dataset and don't see myself using it in the near future. So I am +1 not putting dataset into libacero and have a seperate so files for it.

@wjones127
Copy link
Member Author

cc @jorisvandenbossche @zeroshade

@zeroshade
Copy link
Member

CC @benibus

@ildipo
Copy link
Contributor

ildipo commented Mar 2, 2023

Since at TS we need this I'll be doing this refactoring with the solution n2 in the google doc.

icexelloss pushed a commit that referenced this issue Mar 14, 2023
…oving acero out of libarrow (#34518)

### Rationale for this change

In order to remove acero from libarrow we need to reorganize the code to
reduce entanglement between compute/exec and the rest of the product by
removing any dependency from the codebase (except dataset) into
compute/exec.

### What changes are included in this PR?

This PR removes some of the dependencies:
* some of the functions in `compute/exec/util.[h,cc]` that are widely
used are moved to a new `compute/util.[h,cc] `file
* `key_map` and `key_hash` are widely used and thus moved from
compute/exec to compute
* some light_array tests do use functionalities inside `compute/exec`,
so the tests are split into 2 parts, one inside `compute` and another
inside `compute/exec`

### Are these changes tested?

The existing tests cover all these changes.
One additional test program has been added by splitting light_array_test
in two parts.

### Are there any user-facing changes?

no

* Closes: #15280
@icexelloss icexelloss reopened this Mar 14, 2023
@icexelloss
Copy link
Contributor

The above PR doesn't not close this issue - only a very first step towards it

@ildipo
Copy link
Contributor

ildipo commented Mar 16, 2023

@icexelloss @westonpace

I think we need a slightly different solution:

  • substrait uses dataset
  • dataset uses acero
  • acero uses arrow

my plan is to introduce a libacero (from compute/exec) which depends only on libarrow and leave libsubstrait as is, make libdataset depend on libacero

@westonpace
Copy link
Member

westonpace commented Mar 16, 2023

@ildipo

Do you mean "substrait uses dataset AND acero"? Then I agree.

@ildipo
Copy link
Contributor

ildipo commented Mar 17, 2023

yes that is correct - it uses both

@icexelloss
Copy link
Contributor

icexelloss commented Mar 17, 2023

@ildipo where does substrait uses dataset? (This is slightly surprising to me because I thought substrait code mostly just read in substrait plan protobuf and turn into an Acero declaration)

icexelloss pushed a commit that referenced this issue Mar 18, 2023
…rrow (#34575)

### Rationale for this change

This is the second and last block of refactoring before moving acero out
of libarrow. This PR removes the remaining dependencies from libarrow
into the compute/exec directory.

### What changes are included in this PR?

In detail:

* compute/exec/expression is moved inside compute since it is needed in
several kernels
* all compute/exec specific include files are removed from compute/api.h
* light_array_exec_test.cc i(introduced in the previous refactoring) s
merged back into light_array_test
* testing generators are refactored to remove the need for
exec/options.h
* some utilities are moved from compute/exec/util to compute/util
* updated python and r libraries usage of moved files

### Are these changes tested?

Yes using existing tests (some of which have been updated due to
refactoring).

### Are there any user-facing changes?

no
* Issue: #15280
* Closes: #34630
@westonpace
Copy link
Member

@ildipo where does substrait uses dataset? (This is slightly surprising to me because I thought substrait code mostly just read in substrait plan protobuf and turn into an Acero declaration)

@icexelloss the scan and write nodes are provided by datasets. You can use source (e.g. an in-memory source) with plain Acero but if you want to scan files you need datasets. Substrait's ReadRel when LocalFiles will use a scan node.

rtpsw pushed a commit to rtpsw/arrow that referenced this issue Mar 27, 2023
…f libarrow (apache#34575)

### Rationale for this change

This is the second and last block of refactoring before moving acero out
of libarrow. This PR removes the remaining dependencies from libarrow
into the compute/exec directory.

### What changes are included in this PR?

In detail:

* compute/exec/expression is moved inside compute since it is needed in
several kernels
* all compute/exec specific include files are removed from compute/api.h
* light_array_exec_test.cc i(introduced in the previous refactoring) s
merged back into light_array_test
* testing generators are refactored to remove the need for
exec/options.h
* some utilities are moved from compute/exec/util to compute/util
* updated python and r libraries usage of moved files

### Are these changes tested?

Yes using existing tests (some of which have been updated due to
refactoring).

### Are there any user-facing changes?

no
* Issue: apache#15280
* Closes: apache#34630
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 30, 2023

The current PR #34711 creates a src/arrow/acero (so still part of the arrow namespace), adn not to src/acero. Both have been mentioned above, I don't know if there was a clear conclusion on that question? (I don't have a strong opinion myself)

@wjones127
Copy link
Member Author

Agree that wasn't what I initially hoped, but I think the argument for doing it like this is:

  1. We don't want to move libarrow_dataset to libacero_dataset, which causes more breaking changes for users than strictly necessary
  2. We have no intention of having Acero usable without Arrow (unlike the Parquet codebase, which theoretically should be splittable into its own repo if we ever want to).

@icexelloss
Copy link
Contributor

icexelloss commented Mar 30, 2023

I agree with Will here that Acero is still better to live under the Arrow umbrella since there are internal dependency (e.g., dataset depend on acero, arrow R dependency)

icexelloss added a commit that referenced this issue Apr 1, 2023
…g previously in compute/exec (#34711)

### Rationale for this change

See the linked issue

### What changes are included in this PR?

C++:
* remove all compute/exec/* from libarrow
* rename compute/exec -> acero and make libarrow_acero
* add new ARROW_ACERO option, required if ARROW_DATASET is on
* libarrow_dataset now depends on libarrow_acero

c_glib: add the new libarrow_acero dependency - we disallow building glib without it

python: added PYARROW_BUILD_ACERO, set to on if DATASETS are built

### Are these changes tested?

All the standard tests do work properly.

I manually compiled C++ with:
* no ARROW_ACERO
* ARROW_ACERO and no ARROW_DATASET
* ARROW_ACERO and ARROW_DATASET and no ARROW_SUBSTRAIT 

I manually compiled python without ACERO & DATASET and with ACERO and without DATASET

### Are there any user-facing changes?

If users include compute/exec files directly then they'll have to update their code.

* Closes: #15280

Lead-authored-by: Davide Pasetto <dpasetto69@gmail.com>
Co-authored-by: Li Jin <ice.xelloss@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Li Jin <ice.xelloss@gmail.com>
@raulcd raulcd added this to the 12.0.0 milestone Apr 18, 2023
@wjones127 wjones127 added the Breaking Change Includes a breaking change to the API label Apr 26, 2023
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…rything previously in compute/exec (apache#34711)

### Rationale for this change

See the linked issue

### What changes are included in this PR?

C++:
* remove all compute/exec/* from libarrow
* rename compute/exec -> acero and make libarrow_acero
* add new ARROW_ACERO option, required if ARROW_DATASET is on
* libarrow_dataset now depends on libarrow_acero

c_glib: add the new libarrow_acero dependency - we disallow building glib without it

python: added PYARROW_BUILD_ACERO, set to on if DATASETS are built

### Are these changes tested?

All the standard tests do work properly.

I manually compiled C++ with:
* no ARROW_ACERO
* ARROW_ACERO and no ARROW_DATASET
* ARROW_ACERO and ARROW_DATASET and no ARROW_SUBSTRAIT 

I manually compiled python without ACERO & DATASET and with ACERO and without DATASET

### Are there any user-facing changes?

If users include compute/exec files directly then they'll have to update their code.

* Closes: apache#15280

Lead-authored-by: Davide Pasetto <dpasetto69@gmail.com>
Co-authored-by: Li Jin <ice.xelloss@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Li Jin <ice.xelloss@gmail.com>
mpimenov added a commit to mpimenov/conan-center-index that referenced this issue May 24, 2023
I think we got this part wrong in conan-io#17540
I'm not sure if this name stays (see apache/arrow#15280) but currently
the file being created is libarrow_acero.a

I'm not sure what to do with names of other components, such as pkg_config: for arrow_substrait
we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make
the smallest possible diff that I am sure about.
mpimenov added a commit to mpimenov/conan-center-index that referenced this issue May 29, 2023
I think we got this part wrong in conan-io#17540
I'm not sure if this name stays (see apache/arrow#15280) but currently
the file being created is libarrow_acero.a

I'm not sure what to do with names of other components, such as pkg_config: for arrow_substrait
we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make
the smallest possible diff that I am sure about.
mpimenov added a commit to mpimenov/conan-center-index that referenced this issue May 30, 2023
I think we got this part wrong in conan-io#17540
It's not clear if this name stays (see apache/arrow#15280) but currently
the file being created is libarrow_acero.a

I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait
we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make
the smallest possible diff that I am sure about.
conan-center-bot pushed a commit to conan-io/conan-center-index that referenced this issue Jun 29, 2023
I think we got this part wrong in #17540
It's not clear if this name stays (see apache/arrow#15280) but currently
the file being created is libarrow_acero.a

I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait
we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make
the smallest possible diff that I am sure about.

Co-authored-by: James <james@conan.io>
tannerbitz pushed a commit to tannerbitz/conan-center-index that referenced this issue Jul 8, 2023
I think we got this part wrong in conan-io#17540
It's not clear if this name stays (see apache/arrow#15280) but currently
the file being created is libarrow_acero.a

I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait
we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make
the smallest possible diff that I am sure about.

Co-authored-by: James <james@conan.io>
pezy pushed a commit to pezy/conan-center-index that referenced this issue Jul 15, 2023
I think we got this part wrong in conan-io#17540
It's not clear if this name stays (see apache/arrow#15280) but currently
the file being created is libarrow_acero.a

I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait
we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make
the smallest possible diff that I am sure about.

Co-authored-by: James <james@conan.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment