-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Introducing prek hook to detect airflow-core imports in task-sdk #60676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing prek hook to detect airflow-core imports in task-sdk #60676
Conversation
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits
This also won't catch function/method scoped imports. Should it be looking there too?
At one point ruff had an anayze or graph subcommand that nought give us the imports without needing to parse the full ast ourselves. I think another ci script uses that too. Might be worth doing it the same?
|
@ashb thanks for your review, I handled the Ran it by an example: (apache-airflow) ➜ airflow git:(prek-hook-to-check-core-imports) ✗ cat /tmp/examples.py
from airflow.sdk import Connection
def my_function():
from airflow.models import DagRun
import airflow.triggers.base
from airflow.sdk.types import RuntimeTaskInstance
def another_function():
if True:
from airflow.serialization import serialize
class MyClass:
def method(self):
import airflow.dag_processing
(apache-airflow) ➜ airflow git:(prek-hook-to-check-core-imports) ✗ python scripts/ci/prek/check_core_imports_in_sdk.py /tmp/examples.py
/tmp/examples.py:
Line 4: from airflow.models import DagRun
Line 5: import airflow.triggers.base
Line 11: from airflow.serialization import serialize
Line 15: import airflow.dag_processing
Found 4 core import(s) in task-sdk files
|
|
Ah cool |
Not too sure about that but we already have prek hooks that follow pattern like the one in this PR, I could explore if needed but the result would be the same 😅 |
|
It might be worth adding explicit line based ignores rather than whole files. Lgtm as it is though |
|
Let me consider that too, this one should be good as an immediate prevention |
We already got rid of it because it did not work well with how prek works. The big difference of what ruff does versus what we can do here is that ruff does not generate dependencies to files it did not analyse - and we know the convemtion (from airflow.* from airlfow.sdk*) we are looking for - so we can analyse individual files that are passed to prek rather than having to pre-load all source code - this is what ruff has to do in order to generate "complete" dependency graph. And this means that to check single file, you need to analyze all of them. See this output to know what I am talking about (a little trimmed down. From that you will see that you need to load all task-sdk + all airflow-core + all providers - so basically all files. to see if you have not used some pendencies from "other" files" - this means (on my machine) 0.47s overhead for just "airflow-core + task-sdk" folders, 2s for "airflow-core + task-sdk + providers", 3s for "." (i.e including all tests, dev and other potentially included files). And it means we have to pay this overhead even if we change one file - i.e. every single commit where a .py file changes, will take 2s longer if we use it. Also the code to use it would also be a little complex - we would have to keep exclusion rules inside the code rather than in .pre-commit-config.yaml file - becuase we could only exclude files from the output json if after all of it is generated. AST + our prek script is way faster for incremental checks. When I removed the decorators/init.py from the config the time to analyse it was 0.8s for all files, and 0.15s when only one file is checked - so it beats the "full check" for incremental check a lot - and this is what really matters for pre-comit - 0.4s for "all files" in CI is nothing, while 1.8s for every commit locally matters a lot. |
Was generative AI tooling used to co-author this PR?
Generated-by: [Cursor IDE] following the guidelines
As we are approaching client server separation soon, we want to prevent working on a moving target. ie: if contributors continue adding airflow-core imports in sdk, it will make our lives harder trying to remove them. Adding a prevention layer for it through a prek hook. The excluded files contain such imports and need to be refactored to remove them
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.