Convert df to pyspark DataFrame if it is pandas before writing #469

dbeatty10 · 2022-09-16T19:08:14Z

resolves #468

Description

Copies the solution by @chamini2 in dbt-labs/dbt-bigquery#301

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

github-actions · 2022-09-16T19:08:36Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-spark contributing guide.

dbeatty10 · 2022-09-17T22:56:51Z

Overview

I manually verified that the following didn't work before for the dbt-databrick adapter (which inherits from dbt-spark). Confirmed that it works using @chamini2's fix 👍

Code example

import pandas as pd

def model(dbt, session):
    dbt.config(
        materialized="table",
        packages=["pandas"]
    )

    df = pd.DataFrame(
        {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
        'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
        'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
        'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]}
        )

    return df

Details

dbt-databricks has it's own implementation of py_write_table which overrides the implementation provided by dbt-spark.

⚠️ As a result, Databricks users won't have support for Pandas DataFrames until one of the following happens:

dbt-databricks deletes the custom implementation of py_write_table
This implementation is copied into dbt-databricks also
Some other update is added to dbt-databricks that enables Pandas DataFrames

We want to provide an equivalent implementation here so that dbt-databricks has the option to fully drop their implementation of py_write_table. So this PR adopts two changes from dbt-databricks's version of py_write_table:

.option("overwriteSchema", "true")
# --- Autogenerated dbt materialization code. --- #

Follow-up

I will open an issue in dbt-databricks for this.

b-per

I think we can slightly update the code (but I could be wrong as well)

b-per · 2022-09-19T14:41:23Z

dbt/include/spark/macros/materializations/table.sql

+import importlib.util
+package_name = 'pandas'
+if importlib.util.find_spec(package_name):
+    import pandas


Is this line required? (the import pandas one)

From what I see, in databricks we might usually want to load pyspark.pandas rather than pandas (and I don't think it is required here)

@b-per I believe it is required in order to do test the type: isinstance(df, pandas.core.frame.DataFrame). I tried commenting out the single line with the import, and it did not work.

However, digging into your question did uncover something else...

Goal

For the py_write_table() macro, we want the return type to be pyspark.sql.dataframe.DataFrame.

Potential input types

There are three different data types we expect ¹ it to be:

pyspark.sql.dataframe.DataFrame (PySpark DataFrame)

pandas.core.frame.DataFrame (Pandas DataFrame)

pyspark.pandas.frame.DataFrame (Pandas-on-Spark DataFrame)

Are we handling each of the three cases?

✅ For the first case, it is already a pyspark.sql.dataframe.DataFrame, so no conversion necessary.

✅ For the second case, this PR infers if it is a pandas.core.frame.DataFrame and converts it using spark.createDataFrame() if so.

❌ However, the third case isn't handled yet! We can handle it by calling to_spark() on it ².

I'll update this PR to include the third case.

We could proactively raise an exception if it falls outside of these expected types (rather than sending it off to the database where it will fall on its face and emit an unhelpful error message).

Thank you to @Adricarpin for this handy resource: https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45#64b7

Yep, my bad, I thought isinstance was comparing with a string and not an object type.

@b-per Very glad you asked, because now we have a more robust implementation as a result!

… before writing

chamini2 · 2022-09-19T17:20:24Z

dbt/include/spark/macros/materializations/table.sql

+    # convert to pyspark.sql.dataframe.DataFrame
+    if isinstance(df, pandas.core.frame.DataFrame):
+      df = spark.createDataFrame(df)
+    elif isinstance(df, pyspark.pandas.frame.DataFrame):


I think this can happen outside the if importlib.util.find_spec(package_name):. Since it would run only with pandas package installed.

Thanks @chamini2 -- I manually tested that suggestion and pushed a commit 👍

dbeatty10 · 2022-09-19T17:29:21Z

@ChenyuLInx see below for the Python model files that I used for manual testing. Example data is from the GeoPandas documentation.

Pandas DataFrame

# models/pandas_df.py

import pandas as pd


def model(dbt, session):
    dbt.config(
        materialized="table",
        packages=["pandas"]
    )

    df = pd.DataFrame(
        {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
        'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
        'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
        'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]}
        )

    return df

PySpark DataFrame

# models/pyspark_df.py

def model(dbt, session):
    dbt.config(
        materialized="table",
    )

    df = spark.createDataFrame(
        [
            ("Buenos Aires", "Argentina", -34.58, -58.66),
            ("Brasilia", "Brazil", -15.78, -47.91),
            ("Santiago", "Chile", -33.45, -70.66),
            ("Bogota", "Colombia", 4.60, -74.08),
            ("Caracas", "Venezuela", 10.48, -66.86),
        ],
        ["City", "Country", "Latitude", "Longitude"]
    )

    return df

Pandas-on-Spark DataFrame

# models/pandas_on_spark_df.py

import pyspark.pandas as ps


def model(dbt, session):
    dbt.config(
        materialized="table",
    )

    df = ps.DataFrame(
        {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
        'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
        'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
        'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]}
        )

    return df

Pandas-on-Spark DataFrame to Pandas DataFrame

# models/pandas_on_spark_df_to_pandas.py

import pyspark.pandas as ps


def model(dbt, session):
    dbt.config(
        materialized="table",
    )

    df = ps.DataFrame(
        {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
        'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
        'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
        'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]}
        )
    pdf = df.to_pandas()

    return pdf

chamini2 · 2022-09-19T19:04:52Z

dbt/include/spark/macros/materializations/table.sql

+    import pandas
+
+# convert to pyspark.sql.dataframe.DataFrame
+if isinstance(df, pandas.core.frame.DataFrame):


But this has to happen inside of the importlib.util.find_spec(package_name): if right? If not, you will be accessing a module pandas that is not really imported if it was not present.

chamini2 · 2022-09-19T19:06:26Z

dbt/include/spark/macros/materializations/table.sql

+import importlib.util
+import pyspark.pandas
+package_name = 'pandas'
+if importlib.util.find_spec(package_name):
+    import pandas
+
+# convert to pyspark.sql.dataframe.DataFrame
+if isinstance(df, pandas.core.frame.DataFrame):
+  df = spark.createDataFrame(df)
+elif isinstance(df, pyspark.pandas.frame.DataFrame):
+  df = df.to_spark()


Suggested change

import importlib.util

import pyspark.pandas

package_name = 'pandas'

if importlib.util.find_spec(package_name):

import pandas

# convert to pyspark.sql.dataframe.DataFrame

if isinstance(df, pandas.core.frame.DataFrame):

df = spark.createDataFrame(df)

elif isinstance(df, pyspark.pandas.frame.DataFrame):

df = df.to_spark()

# convert to pyspark.sql.dataframe.DataFrame

import importlib.util

if importlib.util.find_spec('pandas'):

import pandas

if isinstance(df, pandas.core.frame.DataFrame):

df = spark.createDataFrame(df)

import pyspark.pandas

if isinstance(df, pyspark.pandas.frame.DataFrame):

df = df.to_spark()

This is how I would add the pyspark.pandas check

dbeatty10 · 2022-09-19T23:55:56Z

Incorporated feedback from @chamini2, @ueshin, and @ChenyuLInx.

Added these features:

Also check the availability of pyspark.pandas since it was introduced in Spark 3.2 and Databricks still supports DBR 9.1 that is based on Spark 3.1.2
Raise an error exception if unable to convert to a Spark DataFrame (pyspark.sql.dataframe.DataFrame)

The net effect is:

No need to convert Spark DataFrames -- already the type we want
Convert pandas DataFrames to Spark DataFrame
Convert pandas-on-Spark DataFrames to Spark DataFrame
Raise an exception for all other cases

Example when there is an error:

dbeatty10 · 2022-09-20T00:40:15Z

Incorporated some more feedback from @ueshin given on databricks/dbt-databricks#180

Specifically, preferentially convert pandas DataFrames to pandas-on-Spark DataFrames first since @ueshin shared that:

they know how to convert pandas DataFrames better than spark.createDataFrame(df)
and converting from pandas-on-Spark to Spark DataFrame has no overhead

chamini2

Looks great!

jtcohen6 · 2022-09-20T18:13:52Z

dbt/include/spark/macros/materializations/table.sql

+# since they know how to convert pandas DataFrames better than `spark.createDataFrame(df)`
+# and converting from pandas-on-Spark to Spark DataFrame has no overhead
+if pyspark_available and pandas_available and isinstance(df, pandas.core.frame.DataFrame):
+  df = pyspark.pandas.frame.DataFrame(df)


At the risk of making this even more complex than it needs to be — I believe pyspark.pandas was introduced in Spark v3.2: https://www.databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

It won't be available in earlier versions. (The same functionality was available via the koalas package, which was the old codename for pandas-on-PySpark.)

I don't want us to get too-too clever with this logic, though! Could just look like a try/except here

I think what will happen currently if the input is a pandas-on-Spark DataFrame but pyspark.pandas is not available:

msg = f"{type(df)} is not a supported type for dbt Python materialization"

What I believe will happen if the input is a pandas DataFrame but pyspark.pandas is not available:

df = spark.createDataFrame(df)

We can add in an attempt to import databricks.koalas so we are covering as many bases as possible. If we go that route, is there an environment we could test it out on?

If we go that route, is there an environment we could test it out on?

We could spin up a Databricks cluster running an older Spark version (v3.1). This is also what will be running inside Dataproc — the latest Apache Spark release it supports is v3.1.

…fore writing (#181) resolves #179 ### Description Per #180 (comment) removing `py_write_table` macro since dbt-labs/dbt-spark#469 is merged.

Convert df to pyspark DataFrame if it is pandas before writing

6f28e2e

cla-bot bot added the cla:yes label Sep 16, 2022

dbeatty10 added 3 commits September 16, 2022 13:10

Changelog entry

1afde7a

Use overwriteSchema option like dbt-databricks

68c6aac

Upstream py_write_table macro from dbt-databricks

82062d4

dbeatty10 marked this pull request as ready for review September 17, 2022 22:57

dbeatty10 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Sep 17, 2022

dbeatty10 mentioned this pull request Sep 17, 2022

Convert df to pyspark DataFrame if it is pandas before writing databricks/dbt-databricks#180

Closed

3 tasks

b-per requested changes Sep 19, 2022

View reviewed changes

ChenyuLInx approved these changes Sep 19, 2022

View reviewed changes

Convert df to a PySpark DataFrame if it's a Pandas-on-Spark DataFrame…

1afc312

… before writing

chamini2 reviewed Sep 19, 2022

View reviewed changes

Separate conversion logic from import logic

09388ff

chamini2 reviewed Sep 19, 2022

View reviewed changes

b-per self-requested a review September 19, 2022 20:25

ChenyuLInx mentioned this pull request Sep 19, 2022

[CT-1206] [Enhancement] test for saving pandas dataframe in python models dbt-labs/dbt-core#5881

Closed

3 tasks

Raise exception if not able to convert to a Spark DataFrame

dca2b40

Prefer pandas → pandas-on-Spark → Spark over direct pandas → Spark

cbf11e9

ueshin approved these changes Sep 20, 2022

View reviewed changes

b-per approved these changes Sep 20, 2022

View reviewed changes

chamini2 approved these changes Sep 20, 2022

View reviewed changes

dbeatty10 merged commit 4c88e4a into main Sep 20, 2022

dbeatty10 deleted the dbeatty/pandas-to-pyspark branch September 20, 2022 15:46

dbeatty10 mentioned this pull request Sep 20, 2022

Convert df to pyspark DataFrame if it is pandas or pandas-on-Spark before writing databricks/dbt-databricks#181

Merged

3 tasks

dbeatty10 mentioned this pull request Sep 20, 2022

[CT-1214] [Feature] Enable pandas-on-Spark DataFrames for BigQuery dbt-labs/dbt-bigquery#316

Closed

3 tasks

jtcohen6 reviewed Sep 20, 2022

View reviewed changes

dbeatty10 mentioned this pull request Sep 21, 2022

Convert df to Spark DataFrame if it is a pandas or pandas-on-Spark DataFrame before writing dbt-labs/dbt-bigquery#317

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert df to pyspark DataFrame if it is pandas before writing #469

Convert df to pyspark DataFrame if it is pandas before writing #469

dbeatty10 commented Sep 16, 2022 •

edited

Loading

github-actions bot commented Sep 16, 2022

dbeatty10 commented Sep 17, 2022

b-per left a comment

b-per Sep 19, 2022 •

edited

Loading

dbeatty10 Sep 19, 2022

b-per Sep 19, 2022

dbeatty10 Sep 20, 2022

chamini2 Sep 19, 2022 •

edited

Loading

dbeatty10 Sep 19, 2022

dbeatty10 commented Sep 19, 2022

chamini2 Sep 19, 2022

chamini2 Sep 19, 2022

chamini2 Sep 19, 2022

dbeatty10 commented Sep 19, 2022 •

edited

Loading

dbeatty10 commented Sep 20, 2022

chamini2 left a comment

jtcohen6 Sep 20, 2022 •

edited

Loading

dbeatty10 Sep 20, 2022

jtcohen6 Sep 21, 2022

Convert df to pyspark DataFrame if it is pandas before writing #469

Convert df to pyspark DataFrame if it is pandas before writing #469

Conversation

dbeatty10 commented Sep 16, 2022 • edited Loading

Description

Checklist

github-actions bot commented Sep 16, 2022

dbeatty10 commented Sep 17, 2022

Overview

Details

Follow-up

b-per left a comment

Choose a reason for hiding this comment

b-per Sep 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Goal

Potential input types

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamini2 Sep 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbeatty10 commented Sep 19, 2022

Pandas DataFrame

PySpark DataFrame

Pandas-on-Spark DataFrame

Pandas-on-Spark DataFrame to Pandas DataFrame

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbeatty10 commented Sep 19, 2022 • edited Loading

dbeatty10 commented Sep 20, 2022

chamini2 left a comment

Choose a reason for hiding this comment

jtcohen6 Sep 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbeatty10 commented Sep 16, 2022 •

edited

Loading

b-per Sep 19, 2022 •

edited

Loading

chamini2 Sep 19, 2022 •

edited

Loading

dbeatty10 commented Sep 19, 2022 •

edited

Loading

jtcohen6 Sep 20, 2022 •

edited

Loading