Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In MsSqlHook, SQLAlchemy engine scheme is overriden by the change in #40669 #42664

Closed
2 tasks done
emredjan opened this issue Oct 2, 2024 · 15 comments · Fixed by #42705
Closed
2 tasks done

In MsSqlHook, SQLAlchemy engine scheme is overriden by the change in #40669 #42664

emredjan opened this issue Oct 2, 2024 · 15 comments · Fixed by #42705

Comments

@emredjan
Copy link

emredjan commented Oct 2, 2024

Apache Airflow Provider(s)

common-sql

Versions of Apache Airflow Providers

apache-airflow-providers-common-sql==1.16.0
apache-airflow-providers-microsoft-mssql==3.9.0
apache-airflow-providers-odbc==4.7.0

Apache Airflow version

2.10.2

Operating System

RHEL 8.10

Deployment

Virtualenv installation

Deployment details

No response

What happened

We're using mssql+pyodbc connections in airflow to connect to our MS SQL Server. We're creating a hook using MsSqlHook and using .get_sqlalchemy_engine() to get an sql alchemy engine object to pass along our database tasks.
Prior to apache-airflow-providers-common-sql 1.15.0, this was working as expected (last known working version is 1.14.1, please see below for a working example)
With a change introduced in 1.15.0 (#40669), .get_sqlalchemy_engine() passes the creator argument with the value self.get_conn to the .create_engine() function, which overrides the engine creation with the default connection scheme, which uses pymssql, even though we specifically define the scheme as pyodbc in the connection, and in the hook.

Resulting engine object becomes a weird combination of two, with a mssql+pyodbc scheme in the connection uri, but tries to connect to using the pymssql dialect internally, which results in fatal errors:

image

As if it's trying to make a pymssql connection, but with a pyodbc URL.

What you think should happen instead

The connection should work, as it does prior to the change.

image

If we comment out / delete the line engine_kwargs["creator"] = self.get_conn in ‎airflow/providers/common/sql/hooks/sql.py (f6c7388#diff-6e1b2f961cb951d05d66d2d814ef5f6d8f8bf8f43c40fb5d40e27a031fed8dd7R246), connections works as expected.

How to reproduce

This is the connection that's used (values changed for privacy)

Basically use any MSSQL connection with a mssql+pyodbc scheme

{
    "conn_type": "mssql",
    "host": "<redacted>",
    "login": "user",
    "password": "pass",
    "schema": "master",
    "port": 1433,
    "extra": {
        "driver": "ODBC Driver 18 for SQL Server",
        "encrypt": "yes",
        "sqlalchemy_scheme": "mssql+pyodbc"
    }
}

Try to connect through a hook and sqlalchemy engine:

from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
hook = MsSqlHook(mssql_conn_id='conn1')  # conn1: the id for above connection
#hook = MsSqlHook(mssql_conn_id='conn1', sqlalchemy_scheme='mssql+pyodbc') # optional, same result
engine = hook.get_sqlalchemy_engine()
engine.connect()

This fails on > 1.15.0
Works on <= 1.14.1

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@emredjan emredjan added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Oct 2, 2024
@potiuk potiuk added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Oct 2, 2024
@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

Weird, because we are also using MSSQL with ODBC and are using the get_sqlalchemy_engine() method as well and having the same connection config as you posted without issues.

@emredjan
Copy link
Author

emredjan commented Oct 3, 2024

Weird, because we are also using MSSQL with ODBC and are using the get_sqlalchemy_engine() method as well and having the same connection config as you posted without issues.

Are you using "sqlalchemy_scheme": "mssql+pyodbc" in the connection or the hook? Or you just use the ODBC Driver with the default SQLAlchemy scheme (mssql+pymssql)

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

This error is weird, but now I understand why, what you are doing is wrong, and in the past this wouldn't have caused any issues, but now it is.

Don't do this as you show above:

from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
hook = MsSqlHook(mssql_conn_id='conn1')  # conn1: the id for above connection
#hook = MsSqlHook(mssql_conn_id='conn1', sqlalchemy_scheme='mssql+pyodbc') # optional, same result
engine = hook.get_sqlalchemy_engine()
engine.connect()

But do this instead, as this will work, and is also better way of doing it:

hook = DbApiHook.get_hook(conn_id='conn1')
engine = hook.get_sqlalchemy_engine()
engine.connect()

I will explain why your example fails now. As you stated, you use MSSQL through ODBC connection (and thus ODBC driver), but in your example you use the MsSqlHook instead of OdbcHook, so here you try to us an ODBC connection with a PyMSSQL based hook (e.g. MsSqlHook), which of course won't work. It could have worked in the past but then you were lucky, now it won't as the get_sqlalchemy_engine method is now a generic method in DbApiHook which isn't overridden anymore in MsSqlHook.

Important:: It is always a good practice not to instantiate the specialized hook yourself, it's better to use the DbApiHook.get_hook(conn_id="conn_id") classmethod, as this will return you the specialized hook depending on the connection type of the connection id you passed. Depending of the connection type defined in the connection, Airflow will be smart enough to return you the correct Hook.

So please try my suggestion and let me know.

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

Weird, because we are also using MSSQL with ODBC and are using the get_sqlalchemy_engine() method as well and having the same connection config as you posted without issues.

Are you using "sqlalchemy_scheme": "mssql+pyodbc" in the connection or the hook? Or you just use the ODBC Driver with the default SQLAlchemy scheme (mssql+pymssql)

Yes, I also defined the "sqlalchemy_scheme": "mssql+pyodbc" in the extra of the connection as you showed above, the configuration of your connection is correct, but the instantiation of the hook isn't. Please check my answer above.

@emredjan
Copy link
Author

emredjan commented Oct 3, 2024

Tried this:

from airflow.providers.common.sql.hooks.sql import DbApiHook
hook = DbApiHook.get_hook(conn_id='conn1')
engine = hook.get_sqlalchemy_engine()
engine.connect()

It results in the same error as above: AttributeError: 'pymssql._pymssql.Connection' object has no attribute 'add_output_converter' (This is an error from SQLAlchemy, when it internally tries to pass a pymssql connection object into a pyodbc dialect based connection script)

While I understand DbApiHook.get_hook() is the proper way to instantiate a database hook, it just creates a MsSqlHook with the same parameters as above. In practice, nothing changes on the engine creation side with .get_sqlalchemy_engine().

As I don't use any other databases, I'm not entirely sure about the benefits of passing the connection to the engine with the creator argument (engine_kwargs["creator"] = self.get_conn), but that causes any connection parameters to be ignored. As the passed connection with the get_conn just returns the default MS SQL connection with pymssql, it creates a basic connection, and any other scheme or other parameters in the connection are bypassed. See the get_conn definition:

def get_conn(self):
    """Return a connection object."""
    db = self.get_connection(self.get_conn_id())
    return self.connector.connect(host=db.host, port=db.port, username=db.login, schema=db.schema)

From SQLAlchemy create_engine() docs:

image

The resulting engine is a weird object with a pyodbc driver and dialect, but with a pymssql connection underlying.

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

Tried this:

from airflow.providers.common.sql.hooks.sql import DbApiHook
hook = DbApiHook.get_hook(conn_id='conn1')
engine = hook.get_sqlalchemy_engine()
engine.connect()

It results in the same error as above: AttributeError: 'pymssql._pymssql.Connection' object has no attribute 'add_output_converter' (This is an error from SQLAlchemy, when it internally tries to pass a pymssql connection object into a pyodbc dialect based connection script)

While I understand DbApiHook.get_hook() is the proper way to instantiate a database hook, it just creates a MsSqlHook with the same parameters as above. In practice, nothing changes on the engine creation side with .get_sqlalchemy_engine().

As I don't use any other databases, I'm not entirely sure about the benefits of passing the connection to the engine with the creator argument (engine_kwargs["creator"] = self.get_conn), but that causes any connection parameters to be ignored. As the passed connection with the get_conn just returns the default MS SQL connection with pymssql, it creates a basic connection, and any other scheme or other parameters in the connection are bypassed. See the get_conn definition:

def get_conn(self):
    """Return a connection object."""
    db = self.get_connection(self.get_conn_id())
    return self.connector.connect(host=db.host, port=db.port, username=db.login, schema=db.schema)

From SQLAlchemy create_engine() docs:

image

The resulting engine is a weird object with a pyodbc driver and dialect, but with a pymssql connection underlying.

Ok now I'm confused... Are you trying to access MSSQL through ODBC or PyMSQL? Because in your initial post you stated ODBC, but now you're referring to PyMSSQL? If you want to connect through PDBC, make sure the connection type is ODBC and not MSSQL which will use pymssql underneath.

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

This config is also contradicting:

{
    "conn_type": "mssql",
    "host": "<redacted>",
    "login": "user",
    "password": "pass",
    "schema": "master",
    "port": 1433,
    "extra": {
        "driver": "ODBC Driver 18 for SQL Server",
        "encrypt": "yes",
        "sqlalchemy_scheme": "mssql+pyodbc"
    }
}

Connection type is mssql (so pymssql), but in your extras you specify you want to use the ODBC driver and "mssql+pyodbc" sqlalchemy_scheme while the connection type is mssql which is actually pymssql, not ODBC. Try changing the connection type to ODBC.

@emredjan
Copy link
Author

emredjan commented Oct 3, 2024

I'm trying to access MSSQL through Microsoft ODBC driver, using the mssql+pyodbc sqlalchemy driver / dialect. I'm aware that MSSQL connection uses pymssql driver / dialect underneath, but you were always able to override that for getting sqlalchemy engine objects with the sqlchemy_scheme argument, which is the exact reason that this argument exists as far as I'm aware.

It should be possible to change the driver in the MSSQL connections without changing the connection type to ODBC. See MSSQL provider docs:

https://airflow.apache.org/docs/apache-airflow-providers-microsoft-mssql/stable/_api/airflow/providers/microsoft/mssql/hooks/mssql/index.html

This was always possible, and according to documentation should still be an available option, but this change breaks existing connections that use a different driver.

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

I'm trying to access MSSQL through Microsoft ODBC driver, using the mssql+pyodbc sqlalchemy driver / dialect. I'm aware that MSSQL connection uses pymssql driver / dialect underneath, but you were always able to override that for getting sqlalchemy engine objects with the sqlchemy_scheme argument, which is the exact reason that this argument exists as far as I'm aware.

It should be possible to change the driver in the MSSQL connections without changing the connection type to ODBC. See MSSQL provider docs:

https://airflow.apache.org/docs/apache-airflow-providers-microsoft-mssql/stable/_api/airflow/providers/microsoft/mssql/hooks/mssql/index.html

As far as I know, you should always use ODBC as connection type when using ODBC connections, not using a native connection type and then still override it as ODBC. @potiuk what do you think? Or I'm seeing/understood this wrongly?

This is also the reason why I'm working on this PR, so you could have notion of dialects (used by insert_rows method in DbApiHook) that can be used across different connection types pointing to the same database, so the dialect which would generate the replace statement (e.g. MERGE INTO for MSSQL), would work as well for PyMSSQL as ODBC with MSSQL, as the dialect wouldn't be tied to the connection type, which would also not be feasible for a generic connection type like ODBC, as ODBCHook isn't database aware, the driver is while MsSqlHook is specific to MSSQL. At the moment the replace statement is only supported in native MsSqlHook, but this PR will make it work independently of which connection type you use (native/odbc).

@emredjan
Copy link
Author

emredjan commented Oct 3, 2024

Think about this:

I modified the connection to remove the "contradiction":

{
    "conn_type": "mssql",
    "host": "<redacted>",
    "login": "user",
    "password": "pass",
    "schema": "master",
    "port": 1433,
    "extra": {
        "driver": "ODBC Driver 18 for SQL Server",
        "encrypt": "yes"
    }
}

Creating an MSSQL hook, overriding the sqlalchemy_scheme in the hook instantiation instead, which is an allowed option according to the documentation:

image

from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
hook = MsSqlHook(mssql_conn_id='conn1', sqlalchemy_scheme='mssql+pyodbc')
engine = hook.get_sqlalchemy_engine()

engine.connect()

^ This is not working as well, which means this change is a breaking one.

Using pymssql as an internal connection is fine, but it should respect the choice of sqlalchemy_scheme argument in the hook, in get_sqlalchemy_engine and get_sqlalchemy_connection methods. Changing the connection type to ODBC should not be needed in my opinion as the abstraction is done by sqlalchemy anyway, which allows different connection dialects with MSSQL (mssql+pyodbc, mssql+pymssql, etc.).

If this is the way MSSQL connections in Airflow will move forward, and will not allow any other driver to be used than pymssql, then this was never communicated as such in the changelogs and documentations, and there needs to be more changes (like removing sqlalchemy_scheme override option in mssql provider), and would be better communicated as a breaking change for these use cases.

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

Think about this:

I modified the connection to remove the "contradiction":

{
    "conn_type": "mssql",
    "host": "<redacted>",
    "login": "user",
    "password": "pass",
    "schema": "master",
    "port": 1433,
    "extra": {
        "driver": "ODBC Driver 18 for SQL Server",
        "encrypt": "yes"
    }
}

Creating an MSSQL hook, overriding the sqlalchemy_scheme in the hook instantiation instead, which is an allowed option according to the documentation:

image

from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
hook = MsSqlHook(mssql_conn_id='conn1', sqlalchemy_scheme='mssql+pyodbc')
engine = hook.get_sqlalchemy_engine()

engine.connect()

^ This is not working as well, which means this change is a breaking one.

Using pymssql as an internal connection is fine, but it should respect the choice of sqlalchemy_scheme argument in the hook, in get_sqlalchemy_engine and get_sqlalchemy_connection methods. Changing the connection type to ODBC should not be needed in my opinion as the abstraction is done by sqlalchemy anyway, which allows different connection dialects with MSSQL (mssql+pyodbc, mssql+pymssql, etc.).

If this is the way MSSQL connections in Airflow will move forward, and will not allow any other driver to be used than pymssql, then this was never communicated as such in the changelogs and documentations, and there needs to be more changes (like removing sqlalchemy_scheme override option in mssql provider), and would be better communicated as a breaking change for these use cases.

It should actually be like this:

{
    "conn_type": "odbc",
    "host": "<redacted>",
    "login": "user",
    "password": "pass",
    "schema": "master",
    "port": 1433,
    "extra": {
        "driver": "ODBC Driver 18 for SQL Server",
        "encrypt": "yes",
        "sqlalchemy_scheme": "mssql+pyodbc"
    }
}

This is what we are using and works.

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

I've digged a bit deeper regarding following line of code:

engine_kwargs["creator"] = self.get_conn

This was to support JDBC connections with sqlalchemy, I could remove it from the DbApiHook and override it in the JdbcHook, then you would still be able to use it as you intended to.

@emredjan
Copy link
Author

emredjan commented Oct 3, 2024

It should actually be like this:

{
    "conn_type": "odbc",
    "host": "<redacted>",
    "login": "user",
    "password": "pass",
    "schema": "master",
    "port": 1433,
    "extra": {
        "driver": "ODBC Driver 18 for SQL Server",
        "encrypt": "yes",
        "sqlalchemy_scheme": "mssql+pyodbc"
    }
}

This is what we are using and works.

This works, yes (Didn't test further than engine connection, but at least no failure, and uses correct driver and connection type). But it doesn't change the fact that it should be working for mssql connection type as well, and currently it's not.

I've digged a bit deeper regarding following line of code:

engine_kwargs["creator"] = self.get_conn

This was to support JDBC connections with sqlalchemy, I could remove it from the DbApiHook and override it in the JdbcHook, then you would still be able to use it as you intended to.

I believe that would be the correct approach. This JDBC specific change should not break existing connection types.

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

I believe that would be the correct approach. This JDBC specific change should not break existing connection types.

Indeed, fully agree on the part that it should only be passed there where needed (e.g. JdbcHook).

@dabla
Copy link
Contributor

dabla commented Oct 3, 2024

PR succeeded, so it could be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants