Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to iterate over a record batch? #468

Open
tisonkun opened this issue Dec 26, 2024 · 4 comments
Open

How to iterate over a record batch? #468

tisonkun opened this issue Dec 26, 2024 · 4 comments
Labels
Type: usage usage question

Comments

@tisonkun
Copy link
Member

Describe the usage question you have. Please include as many useful details as possible.

Now I have a VectorSchemaRoot. I can see to iterate over the batch with getVector and then getObject.

But the return value is of type Obejct. And I wonder how I can downcast it for some useful class I can retrieve the real value (string, int, float, etc.).

I know we have the field info of each vetcor, but I don't know the mapping between field type to real Java class. It looks over challenge to remember all the mapping by reverse engineering the code, and it may change as version evolves.

I checked https://arrow.apache.org/docs/java/index.html but all the pages tell about constructing a batch and how to move it from one place to another, rather than tell about how to read and dump a batch to a typed two-dimensional matrix.

The most trivial usage, contentToTSVString, call Object::toString on each cell. But I don't thing we should convert all the values to String and reparse it to concrete type.

@tisonkun tisonkun added the Type: usage usage question label Dec 26, 2024
@ParthChonkar
Copy link
Contributor

ParthChonkar commented Dec 29, 2024

The patterns I've seen/used revolved around down-casting/type matching on the FieldVector subclasses:

  1. Downcast the returned FieldVector to the concrete class (IntVector, VarCharVector, etc) and then use the corresponding typed get* method from them
  2. Use visitor pattern/method signature overloading

VectorSchemaRoot stores a schema and a corresponding bag of FieldVectors returned by getVector. The actual subclass is tied to the arrow type for that field. (This is gets a more tricky if you are using nested types)

For (1) you do need to know the mapping between your schema's arrow field types <-> ValueVectors subclasses and have logic to cast accordingly based on the field type. This is reflected a bit in this example the vectors have to be downcasted in order to write the values properly, this also applies to reading their values in a typed manner.

Agree that it's a bit tricky at first to map the simple types to their ValueVector subclasses (basically need to look here) - would be nice documentation add. It looks like there's already a stub for a table with this mapping here. ("Table with non-intuitive names").

As an aside it seems safer to cast the vector to its typed vector first rather than casting the type directly from the FieldVector#getObject (in case you want to fail loudly before accidentally coercing doubles/ints via a downcast).

For (2) this is letting java multiple dispatch route the FieldVector from your VectorSchemaRoot to a function that accepts its concrete subclass.

Curious what other folks approaches are/if there are conventions or patterns I might be missing here.

@tisonkun
Copy link
Member Author

@lidavidm
Copy link
Member

lidavidm commented Jan 3, 2025

The mapping should stay fixed. Unfortunately I don't think there's a way in Java to do some type-level metaprogramming like we can in the C++ library (in C++ the vector type is effectively an associated type of the...type type, so you can write typename TypeTraits<StringType>::ArrayType)

@lidavidm
Copy link
Member

lidavidm commented Jan 3, 2025

In general I would like to see several documentation improvements:

  • Replace or update the cookbook's way of handling Java recipes so we aren't embedding giant chunks of Java code inside reST (makes them a pain to write/test/update)
  • Add more usage documentation and cookbook examples,
  • Add general documentation on how the library is structured (Java doesn't follow the C++ convention of effectively dataframes, instead valuing memory management and streaming data)
  • Use the sphinx-javadoc bridge I wrote for arrow-adbc so we can cross-link Sphinx and Javadocs more easily,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: usage usage question
Projects
None yet
Development

No branches or pull requests

3 participants