Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Arrow + ComplexArrow) / Python Types #367

Closed
jaychia opened this issue Dec 5, 2022 Discussed in #335 · 1 comment
Closed

(Arrow + ComplexArrow) / Python Types #367

jaychia opened this issue Dec 5, 2022 Discussed in #335 · 1 comment

Comments

@jaychia
Copy link
Contributor

jaychia commented Dec 5, 2022

Discussed in #335

Originally posted by jaychia November 21, 2022

[RFC] Arrow/Py/Daft Expression Types

This document is a proposal for extending the Expression typing system in Daft.

RFC Summary

After the implementation of this RFC, Daft's typing system will look like the following:

  1. Arrow types
    a) Arrow native types (int64, float64, list[int64] etc)
    a) Daft Arrow Extension types (image, audio, video, latlong etc)
  2. Python types

Daft only supports serializing Arrow types for writing to disk and long-term storage. For storing Python types, users will first have to marshal data (e.g. DaftImage.from_pil(df["pil_image"])) into the appropriate Arrow type before leveraging Daft's tooling for saving the data.

RFC Details

Motivation

Expression Types in Daft serve the following purposes:

  1. Allow for validation of the validity of Expressions at definition-time instead of failing at runtime
  2. Help Daft visualize the data in each column
  3. Help Daft understand how to represent data in-memory for efficient and convenient data manipulation
  4. Help Daft understand how to serialize data for long-term storage
  5. Daft defines common operations on these data types, such as a extracting the year from a date, or concatenating two strings

Daft currently has 2 main types:

  1. "Primitive" types are a subset of the Arrow type specification represented in-memory as Arrow, and we leverage Arrow's serialization capabilities when saving data to disk
  2. "Python" types are represented in-memory as Python objects, and cannot be serialized when saving data to disk

Problem

This typing system has the following shortcomings:

  1. Hard to add custom complex types: since all complex types are represented as PY types, Daft currently has no support for complex types as an Arrow type
  2. Cannot serialize Python types for long term storage: follows the previous point. Aside from users writing their own UDFs to serialize a PY type to a Daft BYTES type, there isn't support for native complex types at the moment.
  3. Custom visualizations: Daft currently has corner-cases for displaying PIL images, but adding custom visualization code for every possible Python type is not possible

Solution

A solution needs to provide:

  1. Extensibility to new complex types
    a) Should have a serializable in-memory format (e.g. Arrow, protobuf, flatbuffer)
    b) Ability to define methods/kernels on these complex types for domain-specific functionality (e.g. image resizing)
    c) Ability to marshal to/from common Python representations such as numpy, PIL etc for custom processing
    d) Visualization logic for complex types during interactive development
  2. Flexibility at runtime to represent data as Python types for ease of development

User Experience

This is how users interact with Daft types, and what they are printed as in a dataframe visualization:

import daft
import daft.types as dtype

# Simple builtin types
dtype.int64()  # int64
dtype.list(dtype.int64())  # list[int64]

# Type aliasing of common Python types for convenience
int  # Daft aliases this to dtype.int64()
str  # Daft aliases this to dtype.string()
datetime.date  # Daft aliases this to dtype.date()

# Complex types inheriting from daft.ArrowExtensionType
dtype.ImageType()  # image
dtype.BBoxType()  # bbox

# Python types
list  # PY[list]
np.ndarray  # PY[list]
PIL.Image.Image  # PY[Image]

Example workflow of loading images from Parquet, converting to PIL, performing custom operations in Python and PIL, and then saving the data again:

from daft import DataFrame
import daft.types as dtype

df = DataFrame.read_parquet("s3://my-images-parquet-files/")  # image_data: image
df = df.with_column("pil_image", df["image_data"].image.to_pil().apply(resize_img))  # pil_image: PY[PIL.Image.Image]
df = df.with_column("resized_image", dtype.ImageType.from_pil(df["pil_image"]))  # resized_image: image
df.write_parquet(s3://my-resized-images-parquet-files/)
@jaychia
Copy link
Contributor Author

jaychia commented Apr 24, 2023

Requires new specifications after Rust execution refactor

@jaychia jaychia closed this as completed Apr 24, 2023
@jaychia jaychia moved this to Done in Daft-OSS Apr 24, 2023
@jaychia jaychia closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

1 participant