Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

databricks.sql.exc.RequestError when inserting more than 255 cells via pandas.to_sql. #300

Open
mbelak-dtml opened this issue Dec 11, 2023 · 12 comments

Comments

@mbelak-dtml
Copy link

mbelak-dtml commented Dec 11, 2023

The following example works with databricks-sql-connector version 2.9.3, but fails with version 3.0.1:

import numpy as np
import pandas as pd 
from sqlalchemy import create_engine

sqlalchemy_connection_string = f"databricks://token:{access_token}@{host}?http_path={http_path}&catalog={catalog}&schema={schema}"
engine = create_engine(sqlalchemy_connection_string)
shape = (50, 16)
random_df = pd.DataFrame(
    data=np.random.standard_normal(size=shape),
    columns=[f"col{i}" for i in range(shape[1])]
)
table_name = "pandas_sqlalchemy_insert"
random_df.to_sql(table_name, engine, if_exists="replace", index=False, method="multi")

The error is: OperationalError: (databricks.sql.exc.RequestError) Error during request to server

With setting chunksize in random_df.to_sql to any value which is 15 or lower (to make the number of table cells inserted less than 256), the insert runs without issue.

What I have tried

  • Instead of random_df.to_sql(table_name, engine, if_exists="replace", index=False, method="multi") used:
with engine.connect() as conn:
    random_df.to_sql(table_name, conn, if_exists="replace", index=False, method="multi")

-- the observed behavior is the same in all cases.

  • Various other DataFrames and chunksizes. Anything under 256 cells works, anything with 256 or more cells fails with the described exception.

Versions

Not working: databricks-sql-connector==3.0.1, sqlalchemy==2.0.23, pandas==2.1.4
Working: databricks-sql-connector==2.9.3, sqlalchemy==1.4.50, pandas==2.1.4

@mbelak-dtml
Copy link
Author

mbelak-dtml commented Dec 11, 2023

Possibly related to how the Databricks SQLAlchemy dialect constructs queries, where databricks-sql-connector = ~3 uses parameters instead of directly injecting values into the INSERT statement, which version 2.9.3 did?

@susodapop
Copy link
Contributor

Thanks for your question. As of databricks-sql-connector==3.0.0 there is a major difference to how parameters are handled. This change (and the limitation of 255 parameters per query) is documented extensively here.

Possibly related to how the Databricks SQLAlchemy dialect constructs queries, where databricks-sql-connector = ~3 uses parameters instead of directly injecting values into the INSERT statement, which version 2.9.3 did?

This is pretty close. SQLAlchemy hasn't changed its approach, but the connector has. There are now two ways that the connector can send parameters: native and inline. Native is enabled by default and currently there is a limit at the server of no more than 255 parameters per query. While the older inline approach doesn't have this limitation, as documented here, the SQLAlchemy dialect in connector version 3 and above will only work with the native approach.

Unfortunately, the only workaround for this is to modify your to_sql call with a chunksize that guarantees that any individual request will not exceed 255 parameters. We are working with the server team to increase the limit of 255 parameters per request.

@mbelak-dtml
Copy link
Author

Thank you for the answer.

We are working with the server team to increase the limit of 255 parameters per request.

Is the limit on the number of parameters a limitation on the Databricks SQL endpoint side? Do you happen to know of any public information or information you can share (e.g. ETA) on this?

@susodapop
Copy link
Contributor

Yes the limitation is at the SQL warehouse. Still waiting to hear internally when this will be increased.

@newlandj
Copy link

newlandj commented Jan 3, 2024

Hi Team. I work at Procore Technologies and am coming across the same issue. I found that I had to set my particular query to a chunksize of 28 rows, which now makes sense due to the 255 cell limit.

This dramatically slows down the process of writing to Databricks, to the point where our intended use case may not work.

I also haven't seen anywhere in the docs where it shows examples of creating a SQL Alchemy Engine and using pd.to_sql / read_sql. Our code looks just like the code at the beginning of this thread, and it would be great if it was clearly documented somewhere how to use pd.to_sql / read_sql.

Alternately, it would be great if Databricks provided an optimized version of these functions with the connector. As an example, Snowflake offers the following functions with their connector: fetch_pandas_all, fetch_bandas_batches, write_pandas.

We'd like to use SQLAlchemy 2.0+, which I believe requires that we use the Databricks-sql-connector 3.0+, so here's a +1 to sorting this out quickly. Thanks!

@susodapop
Copy link
Contributor

I also haven't seen anywhere in the docs where it shows examples of creating a SQL Alchemy Engine and using pd.to_sql / read_sql.

It's part of the sqlalchemy documentation in this repository: https://github.com/databricks/databricks-sql-python/blob/main/src/databricks/sqlalchemy/README.sqlalchemy.md#usage-with-pandas

Alternately, it would be great if Databricks provided an optimized version of these functions with the connector.

Thanks for registering your support for this. Can you send an email to databricks-sql-connector-maintainers@databricks.com so we can follow-up with you?

@newlandj
Copy link

newlandj commented Jan 3, 2024

Thanks for the link to the docs! I would recommend you link to that page on the main docs page here, as I assumed they would include everything I needed to know.

I'll shoot an email to that address.

@susodapop
Copy link
Contributor

Those docs are being updated as we speak :)

@susodapop
Copy link
Contributor

Update for you all: we're re-classifying this as a regression in the latest version of databricks-sql-connector and working to implement a fix that doesn't rely on the server supporting > 255 params. More details soon.

@newlandj
Copy link

Hi team, any progress on this?

@akshay-s-ciq
Copy link

Hi, we're also facing the same issue. Could someone please provide an update on this? @susodapop

@susodapop
Copy link
Contributor

Hi @akshay-s-ciq, since I'm no longer maintaining this repository I'd recommend contacting Databricks support directly to get the current status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants