You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document is a proposal for extending the Expression typing system in Daft.
RFC Summary
After the implementation of this RFC, Daft's typing system will look like the following:
Arrow types
a) Arrow native types (int64, float64, list[int64] etc)
a) Daft Arrow Extension types (image, audio, video, latlong etc)
Python types
Daft only supports serializing Arrow types for writing to disk and long-term storage. For storing Python types, users will first have to marshal data (e.g. DaftImage.from_pil(df["pil_image"])) into the appropriate Arrow type before leveraging Daft's tooling for saving the data.
RFC Details
Motivation
Expression Types in Daft serve the following purposes:
Allow for validation of the validity of Expressions at definition-time instead of failing at runtime
Help Daft visualize the data in each column
Help Daft understand how to represent data in-memory for efficient and convenient data manipulation
Help Daft understand how to serialize data for long-term storage
Daft defines common operations on these data types, such as a extracting the year from a date, or concatenating two strings
Daft currently has 2 main types:
"Primitive" types are a subset of the Arrow type specification represented in-memory as Arrow, and we leverage Arrow's serialization capabilities when saving data to disk
"Python" types are represented in-memory as Python objects, and cannot be serialized when saving data to disk
Problem
This typing system has the following shortcomings:
Hard to add custom complex types: since all complex types are represented as PY types, Daft currently has no support for complex types as an Arrow type
Cannot serialize Python types for long term storage: follows the previous point. Aside from users writing their own UDFs to serialize a PY type to a Daft BYTES type, there isn't support for native complex types at the moment.
Custom visualizations: Daft currently has corner-cases for displaying PIL images, but adding custom visualization code for every possible Python type is not possible
Solution
A solution needs to provide:
Extensibility to new complex types
a) Should have a serializable in-memory format (e.g. Arrow, protobuf, flatbuffer)
b) Ability to define methods/kernels on these complex types for domain-specific functionality (e.g. image resizing)
c) Ability to marshal to/from common Python representations such as numpy, PIL etc for custom processing
d) Visualization logic for complex types during interactive development
Flexibility at runtime to represent data as Python types for ease of development
User Experience
This is how users interact with Daft types, and what they are printed as in a dataframe visualization:
importdaftimportdaft.typesasdtype# Simple builtin typesdtype.int64() # int64dtype.list(dtype.int64()) # list[int64]# Type aliasing of common Python types for convenienceint# Daft aliases this to dtype.int64()str# Daft aliases this to dtype.string()datetime.date# Daft aliases this to dtype.date()# Complex types inheriting from daft.ArrowExtensionTypedtype.ImageType() # imagedtype.BBoxType() # bbox# Python typeslist# PY[list]np.ndarray# PY[list]PIL.Image.Image# PY[Image]
Example workflow of loading images from Parquet, converting to PIL, performing custom operations in Python and PIL, and then saving the data again:
Discussed in #335
Originally posted by jaychia November 21, 2022
[RFC] Arrow/Py/Daft Expression Types
This document is a proposal for extending the Expression typing system in Daft.
RFC Summary
After the implementation of this RFC, Daft's typing system will look like the following:
a) Arrow native types (int64, float64, list[int64] etc)
a) Daft Arrow Extension types (image, audio, video, latlong etc)
Daft only supports serializing Arrow types for writing to disk and long-term storage. For storing Python types, users will first have to marshal data (e.g.
DaftImage.from_pil(df["pil_image"])
) into the appropriate Arrow type before leveraging Daft's tooling for saving the data.RFC Details
Motivation
Expression Types in Daft serve the following purposes:
Daft currently has 2 main types:
Problem
This typing system has the following shortcomings:
Solution
A solution needs to provide:
a) Should have a serializable in-memory format (e.g. Arrow, protobuf, flatbuffer)
b) Ability to define methods/kernels on these complex types for domain-specific functionality (e.g. image resizing)
c) Ability to marshal to/from common Python representations such as numpy, PIL etc for custom processing
d) Visualization logic for complex types during interactive development
User Experience
This is how users interact with Daft types, and what they are printed as in a dataframe visualization:
Example workflow of loading images from Parquet, converting to PIL, performing custom operations in Python and PIL, and then saving the data again:
The text was updated successfully, but these errors were encountered: