-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for type inference of dataframes using the DataFrame Interchange Protocol #3112
Comments
One thing to be careful about here. I don't want to force the evaluation of lazy dataframe-like objects just to get the schema. For example, we don't want the trigger a full Ibis query just to get the schema info out. For plain Altair, this isn't a big deal as long as we convert to Arrow at the same time as extracting the schema info. But for the "vegafusion" data transformer to have the chance to push computation down to the native data structure (e.g. into Ibis eventually) we don't want to convert the whole thing to arrow up front. @jcrist, do you know if there's a way to get schema info from the DataFrame interchange protocol without triggering a Ibis query? If not, we might need to add some specialized schema extraction logic (which doesn't trigger full evaluation) for the backends that VegaFusion supports. |
Not with the way we currently implement the If vegafusion has its own abstract layer though (as written about in vega/vegafusion#355), wouldn't you immediately convert to that wrapper class and use the generic apis described there instead? Or does altair still need access to the schema separately? |
VegaFusion will always be optional for Altair, so we do need a way to support this in Altair core. I think the ideal situation is that core Altair only knows about the I need to read the spec in more detail, do you know of any examples of using If there's a path to libraries providing the type through |
Oh, never mind. Just playing with pandas and pyarrow this does look straightforward from the Altair side: import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": ["A", "BB", "CCC"]})
dfi = df.__dataframe__()
dfi.column_names()
dt = dfi.get_column_by_name('b').dtype
dt[0].name
|
@mattijn, it looks like we don't need pyarrow to use the I'll give this a try soon. |
Nice! Good find👍 |
The spec makes this possible (as you found in a later comment), but I don't know of any libraries currently consuming this in a way where making this lazy on ibis's side would be useful. The |
As was was raised in #3109, the DataFrame Interchange Protocol is still experimental and it currently lacks features as type inference.
The current type inference for pandas dataframes can not be used for all dataframes (that are parsed through the dataframe interchange protocol).
Altair has adopted pyarrow for support of the DataFrame Interchange Protocol, so there will be a need to infer these pyarrow datatypes to the available encoding data types of Altair.
The current implementation of type inference for columns in Pandas DataFrames happens around here, which calls this
infer_vegalite_type
function.This function needs expansion. Initially it is probably best to do it side-by-side so we keep the current implementation for pandas dataframes and a new implementation for dataframes that are parsed through the DataFrame Interchange Protocol.
Some example data of a pyarrow table that can be used during development of this feature request:
The text was updated successfully, but these errors were encountered: