Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support joining tables with non-key fields as list #32504

Open
asfimport opened this issue Jul 26, 2022 · 5 comments
Open

[C++] Support joining tables with non-key fields as list #32504

asfimport opened this issue Jul 26, 2022 · 5 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jul 26, 2022

I am trying to join 2 Arrow tables where some columns are of list<float> data type. Note that my join columns/keys are primitive data types and some my non-join columns/keys are of {}list<float>{}. But, PyArrow join() cannot join such as table, although pandas can. It says

ArrowInvalid: Data type list<item: float> is not supported in join non-key field

when I execute this piece of code

joined_table = table_1.join(table_2, ['k1', 'k2', 'k3'])

A stackoverflow response pointed out that Arrow currently cannot handle non-fixed types for joins. Can this be fixed ? Or is this intentional ?

Reporter: Jayjeet Chakraborty / @JayjeetAtGithub

Related issues:

Note: This issue was originally created as ARROW-17216. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Carlos Maltzahn:
@JayjeetAtGithub and I are willing to help implement support for joining tables with lists in non-key values. But we might need some help on where to start.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
That'd be great. The starting point would be src/arrow/compute/exec/hash_join_node.cc. This is where you'll find the check itself that is currently failing, but this is not where most of the join logic lives. Fair warning: the hash-join node has been a bit of a staging ground for performance-critical arrow compute and so it relies on a number of utilities not used elsewhere. As such, this node has a pretty high learning curve at the moment (though my hope is that is more diffusely spread throughout the engine in the future).

As of the 9.0.0 release (still pending) there are two implementations of hash-join. The basic implementation (HashJoinImpl) is backed by std::unordered_map and can be found in src/arrow/compute/exec/hash_join.h. A newer version (SwissJoin) extends HashJoinImpl and is backed by a custom hash map and is found in src/arrow/compute/exec/swiss_join.h. I'd recommend testing and adding support to the newer version as the work required is going to be similar between the two. Note that the basic version supports dictionary types but not the newer version (and we just fall back to the basic version if needed) so that is an option if the newer version proves to be trouble.

Support for types here is mostly gated by support for some of the alternate views/encodings used by the hash join. One of these is a non-owning arraydata view called KeyColumnArray which is in src/arrow/compute/light_array.h. This view does not currently supported nested data. Note that ArraySpan is pretty similar (see ARROW-17257) and does support nested types (I think) so maybe it makes sense to tackle ARROW-17257 as part of this.

The second significant thing is RowTableImpl in src/arrow/compute/row/row_internal.h. This implements a row-major encoding for Arrow data. During the hash-join operation, the build data is placed into a table in this row-major form. Then, during materialization, it is converted back to a column-major form.

On top of those two key elements there are a number of other utilities like ExecBatchBuilder, RowArray (which should maybe be renamed to RowTable), RowArrayAccessor, RowArrayMerge, the hashing utilities themselves (there are two versions of this too, I'm pretty sure the older implementation uses arrow/util/hashing.h and I know the newer version uses arrow/compute/exec/key_hash.h), etc.

So I would probably start by looking at the unit tests that exists for those utilities encodings (this reminded me that I had some unit tests I had forgotten to push for ARROW-17022 so I will try and get those up today) and try to get these utilities working with nested types. Some of these utilities could probably also use some more unit tests too. Once the utilities are working with nested types you can enable them for the join itself and see what breaks.

CC @michalursa and @save-buffer as they are more knowledgeable in this area and might have some additional input / advice.

@asfimport
Copy link
Collaborator Author

Jayjeet Chakraborty / @JayjeetAtGithub:
Thanks a lot for the detailed pointers @westonpace.

@asfimport
Copy link
Collaborator Author

Aldrin Montana / @drin:
I am currently working on scalar hash functions, which might help to understand the hashing interfaces:

  • A way to interface with hashing functions from key_hash.h: scalar_hash.cc#L119

  • A way to interface with hashing function from hashing.h: scalar_hash.cc#L160

    As for ArraySpan vs KeyColumnArray, they're both ultimately views into the same buffers (from ArrayData). I don't see anything that makes it difficult to support nested types in either, but I don't see anything that explicitly supports nested types (unless ArraySpan does so via some nuanced templates). I came across this issue while looking to see if anyone was working on convenient nested type support for ArraySpan, since it currently is a view into ArrayData buffers, but I don't see convenient functions in the spirit of Array level functions such as StructArray::field().

    I will hopefully make more headway into this tomorrow. So, at least as far as nested type support via ArraySpan and/or KeyColumnArray I might be able to put together some sort of tutorial-style cookbook on how to access the data appropriately for nested data types.

@asfimport
Copy link
Collaborator Author

Aldrin Montana / @drin:
Okay, re-visiting this: I commented on ARROW-17257 about how ARROW-8991 is related, and for now KeyColumnArray is essentially a flattened array. In ARROW-8991, I will have code for flattening arrays to be able to hash them. I think once that is done, perhaps it can be used to support this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant