-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Move Acero out of libarrow #15280
Comments
|
I prefer to |
What part would we want to move to a separate library? |
How usable are compute kernels now without Acero? How usable is Acero without kernels? |
Acero would then of course have to depend on libarrow (or libarrow_compute) for the kernels.
Very usable (if you are OK with the numpy paradigm of having eager functions for all operations). The current integration in pandas to work with arrow data is built on top of plain compute kernels. |
So then another option is to split further to |
I would love to see this split, although the ability to distribute it separately (e.g., so that a separate R or Python 'acero' package could exist) is something I hope this would be a step in the direction of. |
Yes, just |
The only gray area I can see is group-by. There is currently a standalone group-by implementation that runs outside of Acero, that is marked internal, that I am trying to get rid of. My vote would be that "group-by" is a part of Acero. Note, the If that's the case then you could also move
Both datasets and substrait will need to depend on libacero. It would be a bit weird to have |
Yeah that's why I was contemplating renaming |
I spent some time looking at the necessary changes, and it occured to me with have a bit of a mess with namespaces as well. We introduced a library called Although I'd like to have this issue somewhat limited in scope, it would be nice to know where we are trying to go with the library structure. I created a document outlining the current structure, an a couple ideas of a slightly more sane structure: https://docs.google.com/document/d/1IcQfqFlvMgM3GreNbAvax0_2XKFHRdmcs9nqTPQlhOs/edit?usp=sharing I welcome any comments or alternative suggestions. |
Thanks for the detailed analysis and the suggestions. I'm a little hesitant about lumping "datasets" with Acero. Datasets is somewhat heavyweight, it requires parquet, orc, csv, filesystems, etc. None of those things are relevant to the core Acero. One could imagine Acero being used without any need for file I/O. For example, flight input, apply some compute operations, flight output. So my preference would be second chart you drew that has libacero and libacero_dataset. |
For what it's worth, as a Acero user I currently don't use dataset and don't see myself using it in the near future. So I am +1 not putting dataset into libacero and have a seperate so files for it. |
CC @benibus |
Since at TS we need this I'll be doing this refactoring with the solution n2 in the google doc. |
…oving acero out of libarrow (#34518) ### Rationale for this change In order to remove acero from libarrow we need to reorganize the code to reduce entanglement between compute/exec and the rest of the product by removing any dependency from the codebase (except dataset) into compute/exec. ### What changes are included in this PR? This PR removes some of the dependencies: * some of the functions in `compute/exec/util.[h,cc]` that are widely used are moved to a new `compute/util.[h,cc] `file * `key_map` and `key_hash` are widely used and thus moved from compute/exec to compute * some light_array tests do use functionalities inside `compute/exec`, so the tests are split into 2 parts, one inside `compute` and another inside `compute/exec` ### Are these changes tested? The existing tests cover all these changes. One additional test program has been added by splitting light_array_test in two parts. ### Are there any user-facing changes? no * Closes: #15280
The above PR doesn't not close this issue - only a very first step towards it |
I think we need a slightly different solution:
my plan is to introduce a libacero (from compute/exec) which depends only on libarrow and leave libsubstrait as is, make libdataset depend on libacero |
Do you mean "substrait uses dataset AND acero"? Then I agree. |
yes that is correct - it uses both |
@ildipo where does substrait uses dataset? (This is slightly surprising to me because I thought substrait code mostly just read in substrait plan protobuf and turn into an Acero declaration) |
…rrow (#34575) ### Rationale for this change This is the second and last block of refactoring before moving acero out of libarrow. This PR removes the remaining dependencies from libarrow into the compute/exec directory. ### What changes are included in this PR? In detail: * compute/exec/expression is moved inside compute since it is needed in several kernels * all compute/exec specific include files are removed from compute/api.h * light_array_exec_test.cc i(introduced in the previous refactoring) s merged back into light_array_test * testing generators are refactored to remove the need for exec/options.h * some utilities are moved from compute/exec/util to compute/util * updated python and r libraries usage of moved files ### Are these changes tested? Yes using existing tests (some of which have been updated due to refactoring). ### Are there any user-facing changes? no * Issue: #15280 * Closes: #34630
@icexelloss the |
…f libarrow (apache#34575) ### Rationale for this change This is the second and last block of refactoring before moving acero out of libarrow. This PR removes the remaining dependencies from libarrow into the compute/exec directory. ### What changes are included in this PR? In detail: * compute/exec/expression is moved inside compute since it is needed in several kernels * all compute/exec specific include files are removed from compute/api.h * light_array_exec_test.cc i(introduced in the previous refactoring) s merged back into light_array_test * testing generators are refactored to remove the need for exec/options.h * some utilities are moved from compute/exec/util to compute/util * updated python and r libraries usage of moved files ### Are these changes tested? Yes using existing tests (some of which have been updated due to refactoring). ### Are there any user-facing changes? no * Issue: apache#15280 * Closes: apache#34630
The current PR #34711 creates a |
Agree that wasn't what I initially hoped, but I think the argument for doing it like this is:
|
I agree with Will here that Acero is still better to live under the Arrow umbrella since there are internal dependency (e.g., dataset depend on acero, arrow R dependency) |
…g previously in compute/exec (#34711) ### Rationale for this change See the linked issue ### What changes are included in this PR? C++: * remove all compute/exec/* from libarrow * rename compute/exec -> acero and make libarrow_acero * add new ARROW_ACERO option, required if ARROW_DATASET is on * libarrow_dataset now depends on libarrow_acero c_glib: add the new libarrow_acero dependency - we disallow building glib without it python: added PYARROW_BUILD_ACERO, set to on if DATASETS are built ### Are these changes tested? All the standard tests do work properly. I manually compiled C++ with: * no ARROW_ACERO * ARROW_ACERO and no ARROW_DATASET * ARROW_ACERO and ARROW_DATASET and no ARROW_SUBSTRAIT I manually compiled python without ACERO & DATASET and with ACERO and without DATASET ### Are there any user-facing changes? If users include compute/exec files directly then they'll have to update their code. * Closes: #15280 Lead-authored-by: Davide Pasetto <dpasetto69@gmail.com> Co-authored-by: Li Jin <ice.xelloss@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com>
…rything previously in compute/exec (apache#34711) ### Rationale for this change See the linked issue ### What changes are included in this PR? C++: * remove all compute/exec/* from libarrow * rename compute/exec -> acero and make libarrow_acero * add new ARROW_ACERO option, required if ARROW_DATASET is on * libarrow_dataset now depends on libarrow_acero c_glib: add the new libarrow_acero dependency - we disallow building glib without it python: added PYARROW_BUILD_ACERO, set to on if DATASETS are built ### Are these changes tested? All the standard tests do work properly. I manually compiled C++ with: * no ARROW_ACERO * ARROW_ACERO and no ARROW_DATASET * ARROW_ACERO and ARROW_DATASET and no ARROW_SUBSTRAIT I manually compiled python without ACERO & DATASET and with ACERO and without DATASET ### Are there any user-facing changes? If users include compute/exec files directly then they'll have to update their code. * Closes: apache#15280 Lead-authored-by: Davide Pasetto <dpasetto69@gmail.com> Co-authored-by: Li Jin <ice.xelloss@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com>
I think we got this part wrong in conan-io#17540 I'm not sure if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not sure what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about.
I think we got this part wrong in conan-io#17540 I'm not sure if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not sure what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about.
I think we got this part wrong in conan-io#17540 It's not clear if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about.
I think we got this part wrong in #17540 It's not clear if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about. Co-authored-by: James <james@conan.io>
I think we got this part wrong in conan-io#17540 It's not clear if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about. Co-authored-by: James <james@conan.io>
I think we got this part wrong in conan-io#17540 It's not clear if this name stays (see apache/arrow#15280) but currently the file being created is libarrow_acero.a I'm not certain what to do with names of other components, such as pkg_config: for arrow_substrait we prepend "arrow_" everywhere, for "arrow_flight_sql" we do not. I've chosen to make the smallest possible diff that I am sure about. Co-authored-by: James <james@conan.io>
Describe the enhancement requested
libarrow_acero
?libacero
?The core libarrow is largely stable while Acero remains experimental. It would be nice to eventually let these libraries evolve somewhat independently, like arrow-rs and datafusion.
Component(s)
C++
The text was updated successfully, but these errors were encountered: