-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DbApiHook.insert_rows
unnecessarily restarting connections
#40609
Comments
@dabla -> can you please take a look. I am not sure but I think this is a side-effect of: @contextmanager
def _create_autocommit_connection(self, autocommit: bool = False):
"""Context manager that closes the connection after use and detects if autocommit is supported."""
with closing(self.get_conn()) as conn:
if self.supports_autocommit:
self.set_autocommit(conn, autocommit)
yield conn |
Actually it's a side effect of #38528 |
@dabla -> the root cause of the problem is now the connection is created every time "placeholder" property is accessed. Would you like to take a stab and fix it ? |
Will check it today |
I could make this a @cached_property but dunno if that will fix the issue |
Create PR for this issue that should fix it. |
Yeah . I thought about and cached property is "good enough". To @plutaniano and @dabla -> this did not actuallly cause restarting of the connection that often (so it was not that bad). What it did, it retrieved Connection object from secret and read it's 'placeholder It had the side effect of making it everything slower by a) printing logs b) accessing secrets manager / DB to read the connection extra. Turning placeholder into |
No the messag you see has indeed nothing to do with database connections, it just retrieving each time the connection details from Airflow which allow you to create a database connection, but anyway it will be a good improvement nonetheless. |
Correct. No new connection. But it is much slower now because:
So yeah - caching property solves both problems, and speeds things up and makes them far less costly |
Ah also logging migtht be expensive (money) as well :D depends on whether you use remote logging solution and whether it charges "per message". |
Completely agree on that, it will cost as well in performance as in money (disk). Hopefully, in the future, AIP-59 will help us detect such regressions/side-effects ;) |
Indeed ... Cases like that are very difficult to spot with regular unit-testing/code reviews - this one was like a side-effect going three levels deep + it's not obvious that |
cc: @bjankie1 :D ^^ |
Thanks a lot, guys. Really appreciate the attention put into this. |
For anyone who has the same problem, this should work as a temporary fix while 2.9.3 is not out. Just import these hooks instead of the ones from from functools import cached_property
from airflow.providers.mysql.hooks.mysql import MySqlHook as _MySqlHook
from airflow.providers.postgres.hooks.postgres import PostgresHook as _PostgresHook
class MySqlHook(_MySqlHook):
@cached_property
def placeholder(self):
return super().placeholder
class PostgresHook(_PostgresHook):
@cached_property
def placeholder(self):
return super().placeholder |
You can also downgrade common.sql provider to 1.11.1 which did not have placeaholder configurable (it was added in 1.12.0) or upgrade to a new common.sql provider that will be released soon (I wm thinking @eladkal ? ) maybe we should release ad-hoc common.sql because of that before 2.9.3 release ? |
I plan to cut provider wave tommorow |
@potiuk @eladkal
Maybe we should cache the connection within the Hook instance so it can be reused without having to worry which property is using it? Problem is get_connection is a classmethod, and I would not want to cache the result of the lookup into a static class variable which isn't a good idea, it would be better if it would be cached on the instance level of the Hook, but then that would mean we would need to changed the signature of the get_connection method in BaseHook. Following methods would need to be changed from:
To:
|
Ah .. bummer. But we can fix it in the next round - it's very localized and it's just a lower performance of the "insert_rows" issue as we know. |
What do you think of the proposed solution above? Or is this to invasive? |
A bit too invasive, I think. This actually changes semantics of the methods - someone could rely on the fact that they are returning a new connection object every time. I think maybe a variation of that - add an optional and default to False |
Or even better - add "get_connection_extra" method that will set that flag - this way anyone who wants to just retrieve the extra will use that method - then we will not have to remember to set the flag to True. |
Good idea, think I saw something similar in JdbcHook already, will do that instead. |
Something like that in DbApiBook maybe:
Will wait until 40665 is merged, as then I can also use the get_conn_id method which is cleaner. |
PR 40751 will even go further an cache the connection on the DbApiHook instance, as some hooks were already doing it, it is have now become a property in DbApiHook. |
Some help/hints/tips of random failing tests in PR 40751 would be handy so we can merge the PR and also close this issue with it. |
They look like real compatibility issues with Airflow 2.7 - 2.9 |
Discussed in #40608
Originally posted by plutaniano July 4, 2024
Apache Airflow Provider(s)
common-sql
Versions of Apache Airflow Providers
Apache Airflow version
2.9.2
Operating System
MacOS Sonoma 14.5 (docker host)
Deployment
Docker-Compose
Deployment details
I'm using the official Airflow
docker-compose.yaml
+ a MySQL database, details in the reproduction steps.What happened
The database connection is restarted multiple times during a single
DbApiHook.insert_rows
call.What you think should happen instead
DbApiHook.insert_rows
should create and maintain a single db connection.How to reproduce
Creating a temporary test project
Add the following mysql db to the docker-compose file
Run the docker compose
Add the following connections to Airflow using
docker exec -it airflow-airflow-triggerer-1 bash
Then open a python shell and execute the following scripts:
And for MySQL
Both scripts will open up multiple connections to database while inserting, instead of maintaining just one. Postgres seems to recreate the connection every 1000 inserts, mysql does it after every insert.
Postgres:
MySQL
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: