Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we stop copying the Arrow schema over FFI for every batch? #1115

Closed
andygrove opened this issue Nov 23, 2024 · 5 comments
Closed

Can we stop copying the Arrow schema over FFI for every batch? #1115

andygrove opened this issue Nov 23, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request performance
Milestone

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

In CometBatchIterator we export the schema with each batch via Arrow FFI:

          val arrowSchema = ArrowSchema.wrap(schemaAddrs(index))
          val arrowArray = ArrowArray.wrap(arrayAddrs(index))
          Data.exportVector(
            allocator,
            getFieldVector(valueVector, "export"),
            provider,
            arrowArray,
            arrowSchema)

Exporting the schema seems quite expensive since it involves string copies and memory allocation. It gets more expensive for complex schemas, especially when nested types are involved.

Internally in Data.exportVector, the schema is exported with:

exportField(allocator, vector.getField(), provider, outSchema);

I wonder if we could refactor CometBatchIterator to just export the schema once, with the first batch, and then have the native side re-use that schema for subsequent batches.

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added enhancement New feature or request performance labels Nov 23, 2024
@andygrove andygrove added this to the 0.5.0 milestone Nov 23, 2024
@andygrove
Copy link
Member Author

I refactored the code to make separate FFI calls for exporting schema and array and measured the time of each:

          val arrowSchema = ArrowSchema.wrap(schemaAddrs(index))
          val arrowArray = ArrowArray.wrap(arrayAddrs(index))
          val export = getFieldVector(valueVector, "export")
          // export schema
          val t1 = System.nanoTime()
          Data.exportField(allocator, export.getField, provider, arrowSchema)
          val t2 = System.nanoTime()
          // export array
          Data.exportVector(allocator, export, provider, arrowArray)
          val t3 = System.nanoTime()
          // scalastyle:off println
          println(s"Exported schema in ${t2 - t1} ns and array in ${t3 - t2} ns")

Exporting the schema seems to be more expensive than exporting the data. It seems like this would be worth optimizing.

Exported schema in 1773 ns and array in 421 ns
Exported schema in 1402 ns and array in 401 ns
Exported schema in 1704 ns and array in 891 ns
Exported schema in 1884 ns and array in 351 ns
Exported schema in 1232 ns and array in 1072 ns
Exported schema in 1553 ns and array in 450 ns
Exported schema in 1923 ns and array in 1032 ns
Exported schema in 1392 ns and array in 1663 ns
Exported schema in 481 ns and array in 561 ns
Exported schema in 1022 ns and array in 551 ns
Exported schema in 1082 ns and array in 942 ns

@andygrove
Copy link
Member Author

Schema can vary between batches due to dictionary encoding, so maybe we cannot avoid serializing it each time.

@andygrove
Copy link
Member Author

Another approach could be to use the Arrow C streaming interface

@andygrove
Copy link
Member Author

Spark has an ArrowBatchStreamWriter, but I don't think it supports dictionaries.

@andygrove andygrove self-assigned this Nov 30, 2024
@andygrove
Copy link
Member Author

After more metrics improvements in #1133 it is clear that improving FFI performance is not a high priority for now, although moving to the Arrow C stream interface could make it more efficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant