Can we stop copying the Arrow schema over FFI for every batch? #1115

andygrove · 2024-11-23T16:29:02Z

What is the problem the feature request solves?

In CometBatchIterator we export the schema with each batch via Arrow FFI:

          val arrowSchema = ArrowSchema.wrap(schemaAddrs(index))
          val arrowArray = ArrowArray.wrap(arrayAddrs(index))
          Data.exportVector(
            allocator,
            getFieldVector(valueVector, "export"),
            provider,
            arrowArray,
            arrowSchema)

Exporting the schema seems quite expensive since it involves string copies and memory allocation. It gets more expensive for complex schemas, especially when nested types are involved.

Internally in Data.exportVector, the schema is exported with:

exportField(allocator, vector.getField(), provider, outSchema);

I wonder if we could refactor CometBatchIterator to just export the schema once, with the first batch, and then have the native side re-use that schema for subsequent batches.

Describe the potential solution

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove · 2024-11-23T17:04:45Z

I refactored the code to make separate FFI calls for exporting schema and array and measured the time of each:

          val arrowSchema = ArrowSchema.wrap(schemaAddrs(index))
          val arrowArray = ArrowArray.wrap(arrayAddrs(index))
          val export = getFieldVector(valueVector, "export")
          // export schema
          val t1 = System.nanoTime()
          Data.exportField(allocator, export.getField, provider, arrowSchema)
          val t2 = System.nanoTime()
          // export array
          Data.exportVector(allocator, export, provider, arrowArray)
          val t3 = System.nanoTime()
          // scalastyle:off println
          println(s"Exported schema in ${t2 - t1} ns and array in ${t3 - t2} ns")

Exporting the schema seems to be more expensive than exporting the data. It seems like this would be worth optimizing.

Exported schema in 1773 ns and array in 421 ns
Exported schema in 1402 ns and array in 401 ns
Exported schema in 1704 ns and array in 891 ns
Exported schema in 1884 ns and array in 351 ns
Exported schema in 1232 ns and array in 1072 ns
Exported schema in 1553 ns and array in 450 ns
Exported schema in 1923 ns and array in 1032 ns
Exported schema in 1392 ns and array in 1663 ns
Exported schema in 481 ns and array in 561 ns
Exported schema in 1022 ns and array in 551 ns
Exported schema in 1082 ns and array in 942 ns

andygrove · 2024-11-23T19:48:35Z

Schema can vary between batches due to dictionary encoding, so maybe we cannot avoid serializing it each time.

andygrove · 2024-11-23T22:17:24Z

Another approach could be to use the Arrow C streaming interface

andygrove · 2024-11-27T16:09:03Z

Spark has an ArrowBatchStreamWriter, but I don't think it supports dictionaries.

andygrove · 2024-12-03T15:49:34Z

After more metrics improvements in #1133 it is clear that improving FFI performance is not a high priority for now, although moving to the Arrow C stream interface could make it more efficient

andygrove added enhancement New feature or request performance labels Nov 23, 2024

andygrove added this to the 0.5.0 milestone Nov 23, 2024

andygrove mentioned this issue Nov 23, 2024

chore: Stop exporting schema for every batch in CometBatchIterator #1116

Closed

andygrove mentioned this issue Nov 26, 2024

[EPIC] Improve shuffle performance #1123

Open

andygrove self-assigned this Nov 30, 2024

andygrove closed this as completed Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we stop copying the Arrow schema over FFI for every batch? #1115

Can we stop copying the Arrow schema over FFI for every batch? #1115

andygrove commented Nov 23, 2024

andygrove commented Nov 23, 2024

andygrove commented Nov 23, 2024

andygrove commented Nov 23, 2024

andygrove commented Nov 27, 2024

andygrove commented Dec 3, 2024

Can we stop copying the Arrow schema over FFI for every batch? #1115

Can we stop copying the Arrow schema over FFI for every batch? #1115

Comments

andygrove commented Nov 23, 2024

What is the problem the feature request solves?

Describe the potential solution

Additional context

andygrove commented Nov 23, 2024

andygrove commented Nov 23, 2024

andygrove commented Nov 23, 2024

andygrove commented Nov 27, 2024

andygrove commented Dec 3, 2024