-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Support joining tables with non-key fields as list #32504
Comments
Carlos Maltzahn: |
Weston Pace / @westonpace: As of the 9.0.0 release (still pending) there are two implementations of hash-join. The basic implementation (HashJoinImpl) is backed by std::unordered_map and can be found in src/arrow/compute/exec/hash_join.h. A newer version (SwissJoin) extends HashJoinImpl and is backed by a custom hash map and is found in src/arrow/compute/exec/swiss_join.h. I'd recommend testing and adding support to the newer version as the work required is going to be similar between the two. Note that the basic version supports dictionary types but not the newer version (and we just fall back to the basic version if needed) so that is an option if the newer version proves to be trouble. Support for types here is mostly gated by support for some of the alternate views/encodings used by the hash join. One of these is a non-owning arraydata view called KeyColumnArray which is in src/arrow/compute/light_array.h. This view does not currently supported nested data. Note that ArraySpan is pretty similar (see ARROW-17257) and does support nested types (I think) so maybe it makes sense to tackle ARROW-17257 as part of this. The second significant thing is RowTableImpl in src/arrow/compute/row/row_internal.h. This implements a row-major encoding for Arrow data. During the hash-join operation, the build data is placed into a table in this row-major form. Then, during materialization, it is converted back to a column-major form. On top of those two key elements there are a number of other utilities like ExecBatchBuilder, RowArray (which should maybe be renamed to RowTable), RowArrayAccessor, RowArrayMerge, the hashing utilities themselves (there are two versions of this too, I'm pretty sure the older implementation uses arrow/util/hashing.h and I know the newer version uses arrow/compute/exec/key_hash.h), etc. So I would probably start by looking at the unit tests that exists for those utilities encodings (this reminded me that I had some unit tests I had forgotten to push for ARROW-17022 so I will try and get those up today) and try to get these utilities working with nested types. Some of these utilities could probably also use some more unit tests too. Once the utilities are working with nested types you can enable them for the join itself and see what breaks. CC @michalursa and @save-buffer as they are more knowledgeable in this area and might have some additional input / advice. |
Jayjeet Chakraborty / @JayjeetAtGithub: |
Aldrin Montana / @drin:
|
Aldrin Montana / @drin: |
I am trying to join 2 Arrow tables where some columns are of
list<float>
data type. Note that my join columns/keys are primitive data types and some my non-join columns/keys are of{}list<float>{
}. But, PyArrowjoin()
cannot join such as table, although pandas can. It saysArrowInvalid: Data type list<item: float> is not supported in join non-key field
when I execute this piece of code
joined_table = table_1.join(table_2, ['k1', 'k2', 'k3'])
A stackoverflow response pointed out that Arrow currently cannot handle non-fixed types for joins. Can this be fixed ? Or is this intentional ?
Reporter: Jayjeet Chakraborty / @JayjeetAtGithub
Related issues:
Note: This issue was originally created as ARROW-17216. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: