Implement Arrow extension type-based Python UDF

At the moment, we cannot use `@pandas_udf`s if we want to use `TileUDT` columns in the UDF.  This is due to a lack of support in Spark's [`ArrowUtils`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala#L36-L57) for Arrow extension types.

The proposal here is to circumvent this omission by reimplementing the same UDF pathway in RF, but with proper support for Arrow types.  Rasterframes already has shown that we can define new objects in the `org.apache.spark` package namespace to get around package-private definitions, so we can utilize the same method to provide a new implementation.

In many ways, this will be a cut-and-paste operation, simply importing and renaming classes from Spark, and providing an `@arrow_udf` decorator that cribs directly from [`pandas_udf`](https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L2915-L3389), and redirects into our modified implementation.

The real work here will be to plumb in the Arrow types needed for the system to work.  Of course, we need to reimplement `ArrowUtils` to include extension types, but we also need to make sure that we can properly interface with the extension type registry on both ends of the transaction.  This is more worrisome in the Python context, where [`worker.py`](https://github.com/apache/spark/blob/master/python/pyspark/worker.py) is going to need to have access to the type definition on the python side, in separate process on the executor nodes.  Figuring this out is unlikely to be a gimme.

This work will also require that tiles have an extension type representation.  This connects with issues #5 and #10.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Arrow extension type-based Python UDF #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Arrow extension type-based Python UDF #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions