Python: Automatically convert Pandas types to valid Delta Lake types in write_deltalake() #686

wjones127 · 2022-07-11T22:34:08Z

Description

Many Pandas types aren't automatically converted into valid Delta Lake types when converted into Arrow tables. For example, Pandas Timestamps are converted into timestamps with nanosecond precision by default, but Delta Lake only supports microsecond precision. This makes write_deltalake() difficult to use for Pandas users.

We should write a test that validates all Pandas types can be written with write_deltalake() without manual conversion.

I'm not sure yet how to configure the conversion here:

delta-rs/python/deltalake/writer.py

Lines 128 to 129 in 431d0ea

    
           if _has_pandas and isinstance(data, pd.DataFrame): 
        
               data = pa.Table.from_pandas(data)

It's possible that we can pass in an adjusted schema to the schema parameter of pyarrow.Table.from_pandas() and that will make the correct conversion.

Use Case

Related Issue(s)

Based on #685

The text was updated successfully, but these errors were encountered:

# Description As described in #686 some pandas datatypes are not converted to a format that is compatible with delta lake. This handles the instance of timestamps, which are stored with `ns` resolution in Pandas. Here, if is a schema is not provided, we specify converting the timestamps to `us` resolution. We also update `python/tests/test_writer.py::test_write_pandas` to reflect this change. # Related Issue(s) #685 Co-authored-by: Will Jones <willjones127@gmail.com>

blaze225 · 2023-06-14T09:12:52Z

Would appreciate if this can be prioritized. Right now this is forcing us to use spark over delta-rs.

ion-elgreco · 2023-07-28T18:57:11Z

This also happens when you write delta from Polars with columns with nano precision datetime. However it's slightly more easy to circumvent you just have to do the casting first to micro precision.

ion-elgreco · 2023-09-24T11:33:20Z

Would appreciate if this can be prioritized. Right now this is forcing us to use spark over delta-rs.

@blaze225 You can also switch to polars, which casts the dtypes correctly to a delta compatible schema: https://github.com/pola-rs/polars/pull/10165/files#diff-843e4fa7334b1cfcdf4ebe039377c0d724d0abb51bcde68c9aaae1b93868e20b

thehappycheese · 2023-10-11T09:33:27Z

I made this as a stopgap solution. Its a dumb solution but it helped me actually get it to write and test out the library.

import deltalake as dl
from deltalake import DeltaTable
from typing import Union

def strip_categorical(df:pd.DataFrame):
    """convert categorical columns back into integer types,
    and return a dataframe of the categories
    
    Example:

    ```python
    (original_df, categories) = strip_categorical(df)
    ```"""
    categories = {}
    df=df.copy()
    for col in df.columns:
        if pd.api.types.is_categorical_dtype(df[col]):
            print(f"Converting categorical column to integer: '{col}' - {dict(enumerate(df[col].cat.categories))}")
            categories[col] = df[col].cat.categories
            df[col] = df[col].cat.codes
    return df, pd.DataFrame(categories)

def strip_duration_to_int(df:pd.DataFrame, to_int_unit:Union[str,dict[str,str]]="ms"):
    """convert Timedelta columns to integer types with the given unit
    to_int_unit should be a string or a dictionary of column names to units
    
    Example:
    
    ```python
    df, time_delta_cols = strip_duration_to_int(df, to_int_unit="ms")
    ```"""
    df=df.copy()
    time_delta_cols = {}
    for col in df.columns:
        if pd.api.types.is_timedelta64_dtype(df[col].dtype):
            col_to_int_unit = to_int_unit
            if isinstance(to_int_unit, dict):
                col_to_int_unit = to_int_unit[col]
            print(f"Converting Timedelta column to integer using units '{col_to_int_unit}': '{col}'")
            time_delta_cols[col] = col_to_int_unit
            df[col] = df[col] // pd.Timedelta(1, unit=col_to_int_unit)
    return df, time_delta_cols

def write_delta(path, data, timedelta_to_int_unit:Union[str,dict[str,str]]="ms", **kwargs):
    data, categories = strip_categorical(data)
    data, time_delta_cols = strip_duration_to_int(data, timedelta_to_int_unit)
    dl.write_deltalake(path, data, **kwargs)
    if len(categories) > 0:
        dl.write_deltalake(path+"_categories", categories,**kwargs)
    if len(time_delta_cols) > 0:
        dl.write_deltalake(path+"_time_delta_cols", time_delta_cols,**kwargs)

kangshung · 2023-10-23T10:27:02Z

Are there any plans to implement this?

ion-elgreco · 2023-10-23T10:37:47Z

Are there any plans to implement this?

You can use polars.io.delta import _convert_pa_schema_to_delta

kangshung · 2023-10-23T11:23:48Z

Are there any plans to implement this?

You can use polars.io.delta import _convert_pa_schema_to_delta

What about the _check_for_unsupported_types() method that lists Categorical as an unsupported type? Why would it work without polars if it doesn't with polars?

ion-elgreco · 2023-10-23T12:33:48Z

Are there any plans to implement this?

You can use polars.io.delta import _convert_pa_schema_to_delta

What about the _check_for_unsupported_types() method that lists Categorical as an unsupported type? Why would it work without polars if it doesn't with polars?

I don't see any categorical primitive types in here: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types

kangshung · 2023-10-23T13:30:18Z

Are there any plans to implement this?

You can use polars.io.delta import _convert_pa_schema_to_delta

What about the _check_for_unsupported_types() method that lists Categorical as an unsupported type? Why would it work without polars if it doesn't with polars?

I don't see any categorical primitive types in here: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types

And that's the issue. Delta returns deltalake.PyDeltaTableError: Schema error: Invalid data type for Delta Lake: Dictionary(Int8, Utf8) for Categorical fields.

Here you have a method that raises an exception on Categorical fields in polars: https://github.com/pola-rs/polars/blob/main/py-polars/polars/io/delta.py#L323-L329

ion-elgreco · 2023-10-23T13:43:10Z

Are there any plans to implement this?

You can use polars.io.delta import _convert_pa_schema_to_delta

What about the _check_for_unsupported_types() method that lists Categorical as an unsupported type? Why would it work without polars if it doesn't with polars?

I don't see any categorical primitive types in here: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types

And that's the issue. Delta returns deltalake.PyDeltaTableError: Schema error: Invalid data type for Delta Lake: Dictionary(Int8, Utf8) for Categorical fields.

Here you have a method that raises an exception on Categorical fields in polars: https://github.com/pola-rs/polars/blob/main/py-polars/polars/io/delta.py#L323-L329

I see, we could possibly port these things from Polars into delta-rs, I'll check with the polars contributors. Not super familiar with licenses and all

@stinodego

…iter/merge (#1820) # Description This ports some functionality that @stinodego and I had worked on in Polars. Where we converted a pyarrow schema to a compatible delta schema. It converts the following: - uint -> int - timestamp(any timeunit) -> timestamp(us) I adjusted the functionality to do schema conversion from large to normal when necessary, which is still needed in MERGE as workaround #1753. Additional things I've added: - Schema conversion for every input in write_deltalake/merge - Add Pandas dataframe conversion - Add Pandas dataframe as input in merge # Related Issue(s) - closes #686 - closes #1467 --------- Co-authored-by: Will Jones <willjones127@gmail.com>

@stinodego

…iter/merge (delta-io#1820) This ports some functionality that @stinodego and I had worked on in Polars. Where we converted a pyarrow schema to a compatible delta schema. It converts the following: - uint -> int - timestamp(any timeunit) -> timestamp(us) I adjusted the functionality to do schema conversion from large to normal when necessary, which is still needed in MERGE as workaround delta-io#1753. Additional things I've added: - Schema conversion for every input in write_deltalake/merge - Add Pandas dataframe conversion - Add Pandas dataframe as input in merge - closes delta-io#686 - closes delta-io#1467 --------- Co-authored-by: Will Jones <willjones127@gmail.com>

wjones127 added enhancement New feature or request good first issue Good for newcomers labels Jul 11, 2022

wjones127 mentioned this issue Jul 11, 2022

Cannot write delta table from pandas DataFrame with Timestamp column #685

Closed

hayesgb mentioned this issue Nov 26, 2022

Handle pandas timestamps #958

Merged

wjones127 mentioned this issue May 2, 2023

write_deltalake fails writing a simple dataset with categorical columns #1326

Closed

This was referenced Jun 6, 2023

File options are ignored when writing delta #1444

Closed

Timestamp parsing issue #1455

Closed

j-bennet mentioned this issue Jul 10, 2023

Implement writing with categoricals dask-contrib/dask-deltatable#31

Open

ion-elgreco mentioned this issue Nov 7, 2023

feat(python): add pyarrow to delta compatible schema conversion in writer/merge #1820

Merged

ion-elgreco added the binding/python Issues for the Python package label Nov 22, 2023

ion-elgreco closed this as completed in #1820 Nov 24, 2023

Abhishek1005 mentioned this issue Aug 4, 2024

AttributeError: 'pyarrow.lib.Schema' object has no attribute 'origin' when writing dask dataframe using dask-deltatable's to_deltalake() dask-contrib/dask-deltatable#80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Automatically convert Pandas types to valid Delta Lake types in write_deltalake() #686

Python: Automatically convert Pandas types to valid Delta Lake types in write_deltalake() #686

wjones127 commented Jul 11, 2022

blaze225 commented Jun 14, 2023

ion-elgreco commented Jul 28, 2023

ion-elgreco commented Sep 24, 2023 •

edited

Loading

thehappycheese commented Oct 11, 2023

kangshung commented Oct 23, 2023

ion-elgreco commented Oct 23, 2023

kangshung commented Oct 23, 2023

ion-elgreco commented Oct 23, 2023

kangshung commented Oct 23, 2023 •

edited

Loading

ion-elgreco commented Oct 23, 2023

Python: Automatically convert Pandas types to valid Delta Lake types in write_deltalake() #686

Python: Automatically convert Pandas types to valid Delta Lake types in write_deltalake() #686

Comments

wjones127 commented Jul 11, 2022

Description

blaze225 commented Jun 14, 2023

ion-elgreco commented Jul 28, 2023

ion-elgreco commented Sep 24, 2023 • edited Loading

thehappycheese commented Oct 11, 2023

kangshung commented Oct 23, 2023

ion-elgreco commented Oct 23, 2023

kangshung commented Oct 23, 2023

ion-elgreco commented Oct 23, 2023

kangshung commented Oct 23, 2023 • edited Loading

ion-elgreco commented Oct 23, 2023

ion-elgreco commented Sep 24, 2023 •

edited

Loading

kangshung commented Oct 23, 2023 •

edited

Loading