-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove db call from DatabricksHook.__init__()
#20180
Remove db call from DatabricksHook.__init__()
#20180
Conversation
CC @kaxil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I wonder how many of those we will find in other operators..
BTW. for the multi-tenancy work we are planning to add testing the DB method calls and we are planning to run all the tests we have for all the hooks and operators through a test harness tthat will be catching basically any DB operations and flagging those that are "not expected".
I think that might be great opportunity to catch and fix all similar cases too. Would be as easy as to check if there is a db call in any of the initts
of any of the operators instantiated during tests. Sounds pretty doable.
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
After seeing this I poked around. There are a good number of instances where, at the very least, |
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
Yep. I suspected just that. Until we have an auto-detection of such cases this is a bit of an "uphill battle" (if it slipped through before, it will slip through in the future as well), but yeah i think we could do a one time effort to cleanup it long before we automate it. Or maybe you could automate that not by running but by what you've done via - essentially - static analysis (either our own or plugin to one of the existing checkers). That would be even better. We could have done that with pylint plugin but since we all (including myself) hate it, we should figure another way. |
Yeah agree, we do instantiate all provider classes somewhere in our test no?? I can’t remember where but that might be a good place to check this |
Unfortunately, we only import all.the classes :(. Not really instantiate them (that would be impossible as we do not know which parameters to pass to constructors). We (hopefully) instantiate all operators in the unit tests - so there we could (and we plan to for multi-tenancy/db isolation) tap into the db session created and warn if an operator performs an unexpected DB operation. But that is a bit involved because we have to pass through all the 'expected' calls as well to the real db because otherwise many tests will fail. For example if a test performs get_connection() in the execute() - it should be passed through during unit tests. We have the description coming in AIP-44 describing the test harness and what it is going to give us as this is a part of the solution. We will be be able to verify that community managed operators are all compliant with the DB isolation mode (and fix those that aren't). And we can also (i believe) include that check for db operations in init() for the operators as part of the test harness. We can basically check the stack trace and see if any BaseOperator's derived class is calling any db operation in their init. Static checks might also be helpful but they are limited - they can only (reliable) go as far as to check if certain methods (like get_connectiion) are called directly in the _init but any transitive calls will not be possible to track reliably. Only actually instantiating the operators (which can only be done in unit tests IMHO) can give us complete answer. It's also not 100% complete with tests as there might be some combination of parameters in constructors that will choose different,.untested paths, but hopefully unit tests are comprehensive enough to cover those in most cases. Also when we have the harness in place, it will be easy to reproduce and fix any problems with db isolation by adding more unit tests with those paths covered. I am super excited for this part especially - as i wanted to do that check on init'() for quite a while and with multi-tenancy we have finally a good reason to do it. Before that was not justified enough to extend our test harness. |
This PR aligns both the original PR to move the db call to the metadata database out of the
DatabricksHook.__init__()
method (#18339) and a refactoring PR for the same hook (#19835).^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.