Notes on potential Ibis integration #355

jonmmease · 2023-07-13T03:44:10Z

jonmmease
Jul 13, 2023
Collaborator

Background

At SciPy 2023, I had a nice chat with @jcrist about the potential of dispatching VegaFusion data transformations to Ibis, which would then dispatch transformations to the suite of SQL backends supported by Ibis.

Benefit

The benefit for end users would be the ability to pass Ibis tables to Altair Chart objects and have the Altair/Vega data transformations (e.g. histogram binning and aggregation) automatically pushed down to the Ibis backend (e.g. postgres) during chart display.

Approach

One approach we talked about would be to add an IbisDataFrame Rust struct to the vegafusion-python-embed crate. An instance of this struct would wrap a Python Ibis table as a PyO3 PyObect. The trait method implementations would use PyO3 to call Ibis data transformations methods on the wrapped Ibis table.

We would need to write conversion functions from DataFusion expressions to Ibis expressions. I'm currently picturing that this logic would be written in Rust, and use PyO3 to call Python methods on the wrapped Ibis tables.

An alternative to writing this logic in Rust would be to come up with a declarative representation of DataFusion expressions that could be serialized to Python. Then the translation of these declarative expressions to Ibis expressions could be implemented in Python. One candidate for this declarative representation would be substrait. Ibis has early support for ingesting Substrait plans, but it's not well supported and is going to be removed from Ibis. So at the moment, writing the translation logic in Rust + PyO3 seems more straightforward than defining a new declarative representation and converting this representation to Ibis in Python.

jonmmease · 2023-07-13T12:31:40Z

jonmmease
Jul 13, 2023
Collaborator Author

To add a little more color. This IbisDataFrame implementation would parallel the SqlDataFrame implementation:

The IbisDataFrame struct itself would wrap an Ibis table as a PyObject reference. It would implement the DataFrame trait:

The schema() implementation would compute the schema from the wrapped Ibis table (and potentially convert to arrow format)
collect() would evaluate the table and return the result as a VegaFusionTable (which wraps an arrow schema and vector of RecordBatches).
query methods like sort(), filter() select(), aggregate(), joinaggregate() would call query methods on the wrapped Ibis table and return a new IbisDataFrame instance that wraps the new Ibis table.
As part of implementing filter() and select(), we would need to translate the DataFusion Expr objects (which are passed into these methods) into Ibis Expressions. This logic would parallel the compile logic in the vegafusion-sql crate which converts DataFusion Exprs into sqlparser-rs AST Exprs.

Now that I think about it, maybe the expression compilation logic is in Rust (to avoid the need to serialize DataFusion expressions into Python), but the implementations of the various query methods are written in Python (to make the implementation a bit less awkward).

1 reply

jcrist Jul 13, 2023

Thanks for the writeup Jon!

Now that I think about it, maybe the expression compilation logic is in Rust (to avoid the need to serialize DataFusion expressions into Python), but the implementations of the various query methods are written in Python (to make the implementation a bit less awkward).

I think this sounds like the best approach. This would also be the easiest way for us to help out. If we had an example abstract backend class that we could fill in the methods on to implement the various operations. While I think many on our team could write rust, we're definitely a more python centric team.

An abstract python backend class might also be useful if you ever wanted to add another python dataframe-like library as a backend, so it may not be ibis-specific work.

Anyway, we're happy to pair on this work if given direction on how we can help out. Thanks!

jonmmease · 2023-07-19T17:24:14Z

jonmmease
Jul 19, 2023
Collaborator Author

Making some progress here. In #363 I'm working on a adding a new DataFrameDataset abstract Python class that can be overridden with an implementation based on Ibis (or Polars, pandas, etc.).

The expressions passed in are generated from the DataFusion protobuf specifications. See LogicalPlanNode in https://github.com/apache/arrow-datafusion/blob/main/datafusion/proto/proto/datafusion.proto.

It's missing tests, but I probably won't bother with that until we have an implementation to test.

I'll get this released over the next couple of days, and from there development of an Ibis implementation could happen in pure Python outside of VegaFusion (though I'd be interested in integrating it once we have it working).

0 replies

jonmmease · 2023-07-21T00:23:23Z

jonmmease
Jul 21, 2023
Collaborator Author

Ok, I just released VegaFusion 1.4.0rc3. Here's a starting point for an IbisDataset

import vegafusion as vf
from typing import List, Optional, Literal, Union
from vegafusion.dataset import DataFrameDataset, DataFrameOperationNotSupportedError
from vegafusion.proto.datafusion_pb2 import LogicalExprNode
import pyarrow as pa
import ibis

class IbisDataset(DataFrameDataset):
    _exprs = []
    
    def __init__(self, ibis_tbl: ibis.expr.types.relations.Table):
        self.tbl = ibis_tbl
        
    def schema(self) -> pa.Schema:
        """DataFrame's pyarrow schema"""
        return self.tbl.schema().to_pyarrow()
    
    def __dataframe__(self, **kwargs):
        # With ibis dev, pass kwargs through
        return self.tbl.__dataframe__()
    
    def collect(self) -> pa.Table:
        """Return DataFrame's value as a pyarrow Table"""
        return ibis_tbl.to_pyarrow()

    def select(
        self,
        exprs: List[LogicalExprNode]
    ) -> DataFrameDataset:
        """
        Select columns from Dataset. Selection expressions may include column names,
        column expressions, or window expressions

        :param exprs: Selection expressions
        :return: DataFrameDataset
        """
        print(f"select -\n    {exprs}")
        raise DataFrameOperationNotSupportedError()
        
    def sort(
        self, 
        exprs: List[LogicalExprNode], 
        limit: Optional[int]
    ) -> DataFrameDataset:
        """
        Sort and optionally limit dataset

        :param exprs: Sort expressions
        :param limit: Max number of rows to return
        :return: DataFrameDataset
        """
        print(f"sort -\n    {exprs}\n    {limit}")
        raise DataFrameOperationNotSupportedError()
    
    def aggregate(
        self, 
        group_exprs: List[LogicalExprNode], 
        agg_exprs: List[LogicalExprNode]
    ) -> DataFrameDataset:
        """
        Perform dataset aggregation. Resulting dataset includes grouping
        columns and aggregate expressions

        :param group_exprs: Expressions to group by
        :param agg_exprs: Aggregate expressions
        :return: DataFrameDataset
        """
        print(f"aggregate -\n.   {group_exprs}\n.   {agg_exprs}")
        raise DataFrameOperationNotSupportedError()

    
    def joinaggregate(
        self, 
        group_exprs: List[LogicalExprNode], 
        agg_exprs: List[LogicalExprNode]
    ) -> "DataFrameDataset":
        """
        Perform joinaggregate dataset operation.

        See: https://vega.github.io/vega/docs/transforms/joinaggregate/

        :param group_exprs: Expressions to group by
        :param agg_exprs: Aggregate expressions
        :return: DataFrameDataset
        """
        print(f"joinaggregate -\n.   {group_exprs}\n.   {agg_exprs}")
        raise DataFrameOperationNotSupportedError()
        
    def filter(self, predicate: LogicalExprNode) -> "DataFrameDataset":
        """
        Filter dataset by predicate expression

        :param predicate: Predicate expression
        :return: DataFrameDataset
        """
        print(f"filter -\n.   {predicate}")
        raise DataFrameOperationNotSupportedError()

    def limit(self, limit: int) -> "DataFrameDataset":
        """
        Limit dataset to max number of rows

        :param limit: Max number of rows
        :return: DataFrameDataset
        """
        print(f"limit -\n.   {limit}")
        raise DataFrameOperationNotSupportedError()
                    
    def fold(
        self,
        fields: List[str],
        value_col: str,
        key_col: str,
        order_field: Optional[str],
    ):
        """
        See: https://vega.github.io/vega/docs/transforms/fold/

        :param fields: List of fields to fold
        :param value_col: Name of output value column
        :param key_col: Name of output key column
        :param order_field: Name of input ordering column or
            None if input ordering is not defined
        :return: DataFrameDataset
        """
        print(f"fold -n\    {fields}\n   {value_col}\n   {key_col}\n    {order_field}")
        raise DataFrameOperationNotSupportedError()
    
    def stack(
        self,
        field: str,
        orderby: List[LogicalExprNode],
        groupby: List[str],
        start_field: str,
        stop_field: str,
        mode: Literal["zero", "center", "normalize"],
    ) -> "DataFrameDataset":
        """
        Computes a layout of stacking groups of values

        See: https://vega.github.io/vega/docs/transforms/stack/

        :param field: Column that determines stack height
        :param orderby: Criteria for sorting values within each stack
        :param groupby: List of columns by which to partition data into separate stacks
        :param start_field: Name of output stack start column
        :param stop_field: Name of output stack stop column
        :param mode: Stack mode. One of: "zero", "center", "normalize"
        :return:
        """
        print(f"stack -\n   {field}\n    {orderby}\n    {groupby}\n    {start_field}\n   {stop_field}\n    {mode}")
        raise DataFrameOperationNotSupportedError()
    
    def impute(
        self,
        field: str,
        value: Union[str, int, float],
        key: str,
        groupby: List[str],
        order_field: Optional[str],
    ) -> "DataFrameDataset":
        """
        Performs imputation of missing data objects.

        See: https://vega.github.io/vega/docs/transforms/impute/

        :param field: Column for which missing values should be imputed
        :param value: Value to impute with
        :param key: Key column that uniquely identifies data objects within a group.
            Missing key values (those occurring in the data but not in the current group)
            will be imputed.
        :param groupby: Optional list of columns to group by
        :param order_field: Optional input ordering field. If not provided, input is
            assumed to have arbitrary ordering
        :return:
        """
        print(f"impute -\n   {field}\n    {value}\n    {key}\n    {groupby}\n   {order_field}")
        raise DataFrameOperationNotSupportedError()

The schema and collect implementations are required (and easy with Ibis!). The rest of the query methods (select, sort, etc.) are optional. The default implementations raise a special DataFrameOperationNotSupportedError exception. When this is raised, VegaFusion will call the collect() method to extract the query so far as an Arrow table and then feed that into the default DataFusion implementation and continue on. If they are implemented, they should return a new instance of this IbisDataset class. And it's fine to implement them and then raise the DataFrameOperationNotSupportedError exception if a certain operation is not supported.

The implementation above doesn't take advantage of Ibis, but it's a totally valid implementation as far as VegaFusion is concerned.

The expressions passed to these query methods are LogicalPlanNode types, which correspond to the datafusion protobuf definition in https://github.com/apache/arrow-datafusion/blob/main/datafusion/proto/proto/datafusion.proto.

Using the latest Altair dev build, it's possible to use this IbisDataset with Altair and VegaFusion (Be sure to enable the VegaFusion data transformer with alt.data_transformers.enable("vegafusion")).

Here's an example based on https://altair-viz.github.io/gallery/simple_histogram.html.

import altair as alt
import pandas as pd

# Load movies dataset using pandas
movies_df = pd.read_json('https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json')
movies_df["Title"] = movies_df["Title"].astype(str)

# Create an Ibis memtable
ibis_tbl = ibis.memtable(movies_df)

# Wrap Ibis memtable in IbisDataset
ibis_dataset = IbisDataset(ibis_tbl)

# Enable the VegaFusion data transformer!
alt.data_transformers.enable("vegafusion")

# Pass the IbisDataset instance to the altair Chart constructor
# (as if it were a pandas DataFrame)
chart = alt.Chart(ibis_dataset).mark_bar().encode(
    alt.X("IMDB_Rating:Q", bin=True),
    y='count()',
)

# Display the chart
chart

Due to the print statements in the implementation, this will be printed out:

select -
    [alias {
  expr {
    window_expr {
      built_in_function: ROW_NUMBER
      window_frame {
        start_bound {
          window_frame_bound_type: PRECEDING
          bound_value {
            null_value {
              NONE {
              }
            }
          }
        }
        bound {
        }
      }
    }
  }
  alias: "_vf_order"
}
, wildcard: true
]

The first thing VegaFusion is trying to do here is add a ROW_NUMBER column named _vf_order. Equivalent to something like

SELECT ROW_NUMBER() OVER () as _vf_order, * from tbl

Since the IbisDataset.select method raises a DataFrameOperationNotSupportedError after printing the expression, the collect method is called and then rest of the query is evaluated by DataFusion.

Implementation process

It should be possible to build out the Ibis functionality incrementally. Start with select and cover at least the ROW_NUMBER window function (which is used in pretty much every query), column projection, and the creation of new columns that alias expressions. After that, I think the aggregate function will be called next in the example above.

You can follow the pattern above to play with any Altair example from the Altair gallery. Just keep in mind that column type inference for objects implementing the DataFrame Interchange Protocol isn't on master yet. So you may need to add :Q/:N suffixes to the column names for some field specifications. Or you can use this branch (vega/altair#3114), which should make inference work the same way as for pandas.

Adding IbisDataset to VegaFusion

It doesn't have to, but I'd be happy to have IbisDataset live in the VegaFusion repo eventually. An advantage of this is that it would be fairly straightforward to allow users to pass ibis tables directly into alt.Chart (without manually wrapping them in IbisDataset), as VegaFusion could wrap it automatically.

Another advantage is that we'll be able to test it against other backends more easily, and I'll know if I make changes to the API that break things. I can't commit to this interface being stable yet (that's probably a VegaFusion 2.0 milestone), but if we add it to the VegaFusion repo and get tests working then I can keep it working even if the interface needs to adapt.

I'm really excited for this integration! Let me know how I can support you all

9 replies

jonmmease Aug 3, 2023
Collaborator Author

Ok, on Altair main you can trigger the VegaFusion portion of what Altair does on render by calling chart.to_json(format="vega") with the VegaFusion data transformer enabled (alt.data_transformers.enable("vegafusion")).

Here's a full standalone Python script that successfully drops into pdb when called form the CLI

try_ibis_dataset.py

from typing import List, Optional, Literal, Union
from vegafusion.dataset import DataFrameDataset, DataFrameOperationNotSupportedError
from vegafusion.proto.datafusion_pb2 import LogicalExprNode
import pyarrow as pa
import ibis


class IbisDataset(DataFrameDataset):
    def __init__(self, ibis_tbl: ibis.expr.types.relations.Table):
        self.tbl = ibis_tbl

    def schema(self) -> pa.Schema:
        """DataFrame's pyarrow schema"""
        return self.tbl.schema().to_pyarrow()

    def __dataframe__(self, **kwargs):
        # With ibis dev, pass kwargs through
        return self.tbl.__dataframe__()

    def collect(self) -> pa.Table:
        """Return DataFrame's value as a pyarrow Table"""
        return ibis_tbl.to_pyarrow()

    def select(
            self,
            exprs: List[LogicalExprNode]
    ) -> DataFrameDataset:
        """
        Select columns from Dataset. Selection expressions may include column names,
        column expressions, or window expressions

        :param exprs: Selection expressions
        :return: DataFrameDataset
        """
        print(f"select -\n    {exprs}")
        breakpoint()
        raise DataFrameOperationNotSupportedError()

    def sort(
            self,
            exprs: List[LogicalExprNode],
            limit: Optional[int]
    ) -> DataFrameDataset:
        """
        Sort and optionally limit dataset

        :param exprs: Sort expressions
        :param limit: Max number of rows to return
        :return: DataFrameDataset
        """
        print(f"sort -\n    {exprs}\n    {limit}")
        raise DataFrameOperationNotSupportedError()

    def aggregate(
            self,
            group_exprs: List[LogicalExprNode],
            agg_exprs: List[LogicalExprNode]
    ) -> DataFrameDataset:
        """
        Perform dataset aggregation. Resulting dataset includes grouping
        columns and aggregate expressions

        :param group_exprs: Expressions to group by
        :param agg_exprs: Aggregate expressions
        :return: DataFrameDataset
        """
        print(f"aggregate -\n.   {group_exprs}\n.   {agg_exprs}")
        raise DataFrameOperationNotSupportedError()


    def joinaggregate(
            self,
            group_exprs: List[LogicalExprNode],
            agg_exprs: List[LogicalExprNode]
    ) -> "DataFrameDataset":
        """
        Perform joinaggregate dataset operation.

        See: https://vega.github.io/vega/docs/transforms/joinaggregate/

        :param group_exprs: Expressions to group by
        :param agg_exprs: Aggregate expressions
        :return: DataFrameDataset
        """
        print(f"joinaggregate -\n.   {group_exprs}\n.   {agg_exprs}")
        raise DataFrameOperationNotSupportedError()

    def filter(self, predicate: LogicalExprNode) -> "DataFrameDataset":
        """
        Filter dataset by predicate expression

        :param predicate: Predicate expression
        :return: DataFrameDataset
        """
        print(f"filter -\n.   {predicate}")
        raise DataFrameOperationNotSupportedError()

    def limit(self, limit: int) -> "DataFrameDataset":
        """
        Limit dataset to max number of rows

        :param limit: Max number of rows
        :return: DataFrameDataset
        """
        print(f"limit -\n.   {limit}")
        raise DataFrameOperationNotSupportedError()

    def fold(
            self,
            fields: List[str],
            value_col: str,
            key_col: str,
            order_field: Optional[str],
    ):
        """
        See: https://vega.github.io/vega/docs/transforms/fold/

        :param fields: List of fields to fold
        :param value_col: Name of output value column
        :param key_col: Name of output key column
        :param order_field: Name of input ordering column or
            None if input ordering is not defined
        :return: DataFrameDataset
        """
        print(f"fold -n\    {fields}\n   {value_col}\n   {key_col}\n    {order_field}")
        raise DataFrameOperationNotSupportedError()

    def stack(
            self,
            field: str,
            orderby: List[LogicalExprNode],
            groupby: List[str],
            start_field: str,
            stop_field: str,
            mode: Literal["zero", "center", "normalize"],
    ) -> "DataFrameDataset":
        """
        Computes a layout of stacking groups of values

        See: https://vega.github.io/vega/docs/transforms/stack/

        :param field: Column that determines stack height
        :param orderby: Criteria for sorting values within each stack
        :param groupby: List of columns by which to partition data into separate stacks
        :param start_field: Name of output stack start column
        :param stop_field: Name of output stack stop column
        :param mode: Stack mode. One of: "zero", "center", "normalize"
        :return:
        """
        print(f"stack -\n   {field}\n    {orderby}\n    {groupby}\n    {start_field}\n   {stop_field}\n    {mode}")
        raise DataFrameOperationNotSupportedError()

    def impute(
            self,
            field: str,
            value: Union[str, int, float],
            key: str,
            groupby: List[str],
            order_field: Optional[str],
    ) -> "DataFrameDataset":
        """
        Performs imputation of missing data objects.

        See: https://vega.github.io/vega/docs/transforms/impute/

        :param field: Column for which missing values should be imputed
        :param value: Value to impute with
        :param key: Key column that uniquely identifies data objects within a group.
            Missing key values (those occurring in the data but not in the current group)
            will be imputed.
        :param groupby: Optional list of columns to group by
        :param order_field: Optional input ordering field. If not provided, input is
            assumed to have arbitrary ordering
        :return:
        """
        print(f"impute -\n   {field}\n    {value}\n    {key}\n    {groupby}\n   {order_field}")
        raise DataFrameOperationNotSupportedError()


if __name__ == "__main__":
    import altair as alt
    import pandas as pd

    # Load movies dataset using pandas
    movies_df = pd.read_json('https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json')
    movies_df["Title"] = movies_df["Title"].astype(str)

    # Create an Ibis memtable
    ibis_tbl = ibis.memtable(movies_df)

    # Wrap Ibis memtable in IbisDataset
    ibis_dataset = IbisDataset(ibis_tbl)

    # Enable the VegaFusion data transformer!
    alt.data_transformers.enable("vegafusion")

    # Pass the IbisDataset instance to the altair Chart constructor
    # (as if it were a pandas DataFrame)
    chart = alt.Chart(ibis_dataset).mark_bar().encode(
        alt.X("IMDB_Rating:Q", bin=True),
        y='count()',
    )

    # Display the chart
    print(chart.to_json(format="vega"))

% python try_ibis_dataset.py

...
> /Users/jonmmease/scratch/ibis/try_ibis_dataset.py(40)select()
-> raise DataFrameOperationNotSupportedError()
(Pdb)

I usually develop in PyCharm, and I was also able to set a breakpoint inside the select method an use the pycharm debugger repl to play around with expressions.

jcrist Aug 3, 2023

Excellent, thanks. I can confirm that works for me.

Last question - is there docs or a reference implementation somewhere I can use for determining the meaning behind the various types of LogicalExprNode? I've looked through the proto file - some of them I can guess at their meaning, but some I'm not 100% sure what the intention is.

jonmmease Aug 3, 2023
Collaborator Author

Great!

In terms of interpreting the LogicalExprNode objects. These correspond to the DataFusion Expr type, which is part of the datafusion-expr crate. So first thing to do is look at the docs for this crate at https://docs.rs/datafusion-expr/28.0.0/datafusion_expr/. Here there will at least be documentation on the intended meaning of the fields of each type.

For example, the CaseNode corresponds to the datafusion_expr::expr::Case struct, which is documented at https://docs.rs/datafusion-expr/28.0.0/datafusion_expr/expr/struct.Case.html.

DataFusion's implementation of these expressions is spread through a few other crates in the DataFusion project, but it's all in this repo (https://github.com/apache/arrow-datafusion). Happy to dig into particular questions as they come up.

jcrist Aug 3, 2023

Perfect, that should be enough to get me going. Thanks!

jonmmease Aug 24, 2023
Collaborator Author

FYI, the DataFrameDataset base class is now released in VegaFusion 1.4.0 and Altair 5.1 should also be released relatively soon (probably next week).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on potential Ibis integration #355

{{title}}

Replies: 3 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Notes on potential Ibis integration #355

jonmmease Jul 13, 2023 Collaborator

Background

Benefit

Approach

Replies: 3 comments · 10 replies

jonmmease Jul 13, 2023 Collaborator Author

jcrist Jul 13, 2023

jonmmease Jul 19, 2023 Collaborator Author

jonmmease Jul 21, 2023 Collaborator Author

Implementation process

Adding IbisDataset to VegaFusion

jonmmease Aug 3, 2023 Collaborator Author

jcrist Aug 3, 2023

jonmmease Aug 3, 2023 Collaborator Author

jcrist Aug 3, 2023

jonmmease Aug 24, 2023 Collaborator Author

jonmmease
Jul 13, 2023
Collaborator

Replies: 3 comments 10 replies

jonmmease
Jul 13, 2023
Collaborator Author

jonmmease
Jul 19, 2023
Collaborator Author

jonmmease
Jul 21, 2023
Collaborator Author

jonmmease Aug 3, 2023
Collaborator Author

jonmmease Aug 3, 2023
Collaborator Author

jonmmease Aug 24, 2023
Collaborator Author