Skip to content

Implement Arrow extension type-based Python UDF #13

@jpolchlo

Description

@jpolchlo

At the moment, we cannot use @pandas_udfs if we want to use TileUDT columns in the UDF. This is due to a lack of support in Spark's ArrowUtils for Arrow extension types.

The proposal here is to circumvent this omission by reimplementing the same UDF pathway in RF, but with proper support for Arrow types. Rasterframes already has shown that we can define new objects in the org.apache.spark package namespace to get around package-private definitions, so we can utilize the same method to provide a new implementation.

In many ways, this will be a cut-and-paste operation, simply importing and renaming classes from Spark, and providing an @arrow_udf decorator that cribs directly from pandas_udf, and redirects into our modified implementation.

The real work here will be to plumb in the Arrow types needed for the system to work. Of course, we need to reimplement ArrowUtils to include extension types, but we also need to make sure that we can properly interface with the extension type registry on both ends of the transaction. This is more worrisome in the Python context, where worker.py is going to need to have access to the type definition on the python side, in separate process on the executor nodes. Figuring this out is unlikely to be a gimme.

This work will also require that tiles have an extension type representation. This connects with issues #5 and #10.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions