How to iterate over a record batch? #468

tisonkun · 2024-12-26T13:24:46Z

Describe the usage question you have. Please include as many useful details as possible.

Now I have a VectorSchemaRoot. I can see to iterate over the batch with getVector and then getObject.

But the return value is of type Obejct. And I wonder how I can downcast it for some useful class I can retrieve the real value (string, int, float, etc.).

I know we have the field info of each vetcor, but I don't know the mapping between field type to real Java class. It looks over challenge to remember all the mapping by reverse engineering the code, and it may change as version evolves.

I checked https://arrow.apache.org/docs/java/index.html but all the pages tell about constructing a batch and how to move it from one place to another, rather than tell about how to read and dump a batch to a typed two-dimensional matrix.

The most trivial usage, contentToTSVString, call Object::toString on each cell. But I don't thing we should convert all the values to String and reparse it to concrete type.

The text was updated successfully, but these errors were encountered:

ParthChonkar · 2024-12-29T08:01:57Z

The patterns I've seen/used revolved around down-casting/type matching on the FieldVector subclasses:

Downcast the returned FieldVector to the concrete class (IntVector, VarCharVector, etc) and then use the corresponding typed get* method from them
Use visitor pattern/method signature overloading

VectorSchemaRoot stores a schema and a corresponding bag of FieldVectors returned by getVector. The actual subclass is tied to the arrow type for that field. (This is gets a more tricky if you are using nested types)

For (1) you do need to know the mapping between your schema's arrow field types <-> ValueVectors subclasses and have logic to cast accordingly based on the field type. This is reflected a bit in this example the vectors have to be downcasted in order to write the values properly, this also applies to reading their values in a typed manner.

Agree that it's a bit tricky at first to map the simple types to their ValueVector subclasses (basically need to look here) - would be nice documentation add. It looks like there's already a stub for a table with this mapping here. ("Table with non-intuitive names").

As an aside it seems safer to cast the vector to its typed vector first rather than casting the type directly from the FieldVector#getObject (in case you want to fail loudly before accidentally coercing doubles/ints via a downcast).

For (2) this is letting java multiple dispatch route the FieldVector from your VectorSchemaRoot to a function that accepts its concrete subclass.

Curious what other folks approaches are/if there are conventions or patterns I might be missing here.

tisonkun · 2024-12-29T14:24:33Z

@ParthChonkar Thank you! These two links help:

lidavidm · 2025-01-03T01:22:49Z

The mapping should stay fixed. Unfortunately I don't think there's a way in Java to do some type-level metaprogramming like we can in the C++ library (in C++ the vector type is effectively an associated type of the...type type, so you can write typename TypeTraits<StringType>::ArrayType)

lidavidm · 2025-01-03T01:24:57Z

In general I would like to see several documentation improvements:

Replace or update the cookbook's way of handling Java recipes so we aren't embedding giant chunks of Java code inside reST (makes them a pain to write/test/update)
Add more usage documentation and cookbook examples,
Add general documentation on how the library is structured (Java doesn't follow the C++ convention of effectively dataframes, instead valuing memory management and streaming data)
Use the sphinx-javadoc bridge I wrote for arrow-adbc so we can cross-link Sphinx and Javadocs more easily,

tisonkun added the Type: usage usage question label Dec 26, 2024

lidavidm mentioned this issue Jan 3, 2025

[Doc] Generate Java documentation #455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to iterate over a record batch? #468

How to iterate over a record batch? #468

tisonkun commented Dec 26, 2024

ParthChonkar commented Dec 29, 2024 •

edited

Loading

tisonkun commented Dec 29, 2024

lidavidm commented Jan 3, 2025

lidavidm commented Jan 3, 2025

How to iterate over a record batch? #468

How to iterate over a record batch? #468

Comments

tisonkun commented Dec 26, 2024

Describe the usage question you have. Please include as many useful details as possible.

ParthChonkar commented Dec 29, 2024 • edited Loading

tisonkun commented Dec 29, 2024

lidavidm commented Jan 3, 2025

lidavidm commented Jan 3, 2025

ParthChonkar commented Dec 29, 2024 •

edited

Loading