Skip to content

Appending a local Pandas dataframe to a Delta table in Azure is slow #450

Open
@beefupinetree

Description

@beefupinetree

I followed the instructions on this page to create a SQLAlchemy engine and used it with the Pandas to_sql() method. It's taking around 2 seconds to append one data point to a Delta table in Azure Databricks, and it seems to be scaling linearly. A dataframe containing 1 column and 10 rows is taking ~20 seconds to push to Azure.

Is there a way to make writing back to Databricks from a local machine faster?

Example code:

from sqlalchemy import create_engine
import pandas as pd
import os

server = 'my_server.azuredatabricks.net'
http_path = "/sql/1.0/warehouses/my_warehouse"
access_token = "MY_TOKEN"
catalog = "my_catalog"
schema = "my_schema"

if "NO_PROXY" in os.environ:
    os.environ["NO_PROXY"] = os.environ["NO_PROXY"] + "," + server
else:
    os.environ["NO_PROXY"] = server

if "no_proxy" in os.environ:
    os.environ["no_proxy"] = os.environ["no_proxy"] + "," + server
else:
    os.environ["no_proxy"] = server

engine = create_engine(f"databricks://token:{access_token}@{server}?" +
        f"http_path={http_path}&catalog={catalog}&schema={schema}"
)

df = pd.DataFrame({'name' : ['User 1', 'User 2', 'User 3', User 4', 'User 5', 'User 6', User 7', 'User 8', 'User 9', 'User 10']})

# The next command takes around 20 seconds to complete
df.to_sql(name='my_test_table', con=engine, if_exists='append', index=False)

Local specs:
databricks-sql-connector==3.4.0
pandas==2.2.2
Python 3.10.14

Azure specs:
Databricks SQL warehouse cluster
Runtime==13.3 LTS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions