-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we stop copying the Arrow schema over FFI for every batch? #1115
Comments
I refactored the code to make separate FFI calls for exporting schema and array and measured the time of each: val arrowSchema = ArrowSchema.wrap(schemaAddrs(index))
val arrowArray = ArrowArray.wrap(arrayAddrs(index))
val export = getFieldVector(valueVector, "export")
// export schema
val t1 = System.nanoTime()
Data.exportField(allocator, export.getField, provider, arrowSchema)
val t2 = System.nanoTime()
// export array
Data.exportVector(allocator, export, provider, arrowArray)
val t3 = System.nanoTime()
// scalastyle:off println
println(s"Exported schema in ${t2 - t1} ns and array in ${t3 - t2} ns") Exporting the schema seems to be more expensive than exporting the data. It seems like this would be worth optimizing.
|
Schema can vary between batches due to dictionary encoding, so maybe we cannot avoid serializing it each time. |
Another approach could be to use the Arrow C streaming interface |
Spark has an |
After more metrics improvements in #1133 it is clear that improving FFI performance is not a high priority for now, although moving to the Arrow C stream interface could make it more efficient |
What is the problem the feature request solves?
In
CometBatchIterator
we export the schema with each batch via Arrow FFI:Exporting the schema seems quite expensive since it involves string copies and memory allocation. It gets more expensive for complex schemas, especially when nested types are involved.
Internally in
Data.exportVector
, the schema is exported with:I wonder if we could refactor
CometBatchIterator
to just export the schema once, with the first batch, and then have the native side re-use that schema for subsequent batches.Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: