Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AirflowClusterPolicyViolation Overhaul - Pause DAG, Make it so that DAG cannot be unpaused, still shows in DAG list view but has tooltip saying why can't be unpaused #18410

Open
2 tasks done
alex-astronomer opened this issue Sep 21, 2021 · 6 comments
Labels
kind:feature Feature Requests

Comments

@alex-astronomer
Copy link
Contributor

Description

Overview

AirflowClusterPolicyViolation behaves in very strange ways right now. I think that this would be an important feature to overhaul for users that have complex DAG definition requirements for tagging, owners, naming conventions, etc. Very useful for Airflow administrators that have many application teams using the same deployment of Airflow.

Current behavior

The AirflowClusterPolicyViolation exception when called from airflow_local_settings.py is that the DAG shows up still in the DAG list view (different from other import errors, where sometimes they will not show in this list any more). The DAG remains paused or unpaused depending on what state it was in before the DAG cluster policy was deployed. The DAG can still be scheduled and run, but all of the tasks within that DAG will fail with no errors and no logs. This silent failure is very confusing for developers that don't see the import error on their DAG.

New Expected Behavior and Overhaul

The behavior that I expect to see when the AirflowClusterPolicyViolation is thrown is that the DAG will become paused, and cannot be unpaused until the DAG adheres to the cluster policy. There will be a tooltip on the pause button that explains why the DAG cannot be unpaused. The import error will show in the DAG view as well as the DAG list view, solved by #17818. No tasks will be scheduled or run, and no DAGRuns will be scheduled or run, until the cluster policy is adhered to.

Use case/motivation

I want to offer more support to users that have many application teams working on the same deployment of Airflow. Part of data pipeline quality is making policies that teams are unable to violate. If the teams adhere to the policy, they can run their DAG.

Right now the behavior is very confusing and it's challenging to pause a DAG that violates a cluster policy. It is possible to pause a DAG within the cluster policy function before the exception is raised, but there exists a race condition if the DAG is unpaused again after that. The DAG will try to run its tasks between the time that the DAG is unpaused and the DAG policy function has time to pause it again. This creates even more confusing behavior.

I believe that a DAG should not be able to be run unless it adheres to the cluster policy.

Very open to discussion about the best way to solve this issue.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@alex-astronomer alex-astronomer added the kind:feature Feature Requests label Sep 21, 2021
@mik-laj
Copy link
Member

mik-laj commented Sep 21, 2021

CC: @jaketf

@jaketf
Copy link
Contributor

jaketf commented Sep 21, 2021

@alex-astronomer this sounds like a great improvement to me, and the use case you describe is the intention of the original AirflowClusterPolicyViolation PR. At the time the scope was just to make tasks not run but I agree these DAGs shouldn't even get scheduled and like the idea of having a tool tip explaining why you cannot unpause.
My only additional comment is that CLI / REST API PATCH requests trying to unpause a dag that does not obey the policy should get an error message about the policy violation as well (not just UI).

@eladkal
Copy link
Contributor

eladkal commented Sep 21, 2021

I believe that a DAG should not be able to be run unless it adheres to the cluster policy.

I agree but it seems a bit odd to me to mix paused/active to this.

Say I add a new policy that violated 50 DAGs. According to this all of them will be paused.
What if the action I decided to take is to remove the policy rather than fixing the DAGs? It will be very hard and frustrating to locating all the 50 DAGs that were automated paused.

@alex-astronomer
Copy link
Contributor Author

@eladkal Great point. I still do believe that the DAGs should be paused and unable to run in this situation. Here's my reason for that. If suddenly there is an AirflowClusterPolicy that is implemented that would disable 50 DAGs shouldn't that be done, just because if the policy is violated then we don't want to be able to run the DAGs anyway? I understand what you're saying is that it would be hard to know what DAGs were paused and it would be hard to revert. I think that a way that we could get around this shortcoming is audit logging for the DAGs that were paused and putting some logs saying what happened in the Logs section of Airlfow. What do you think about that?

@eladkal
Copy link
Contributor

eladkal commented Oct 4, 2021

@alex-astronomer I can argue that the DAGs should not be paused.
As cluster admin I add/remove/update policies. I don't always know what DAGs are going to be effected by a policy I just add it and then all effected users change their code accordingly.

If you prevent DAGs from running you have 2 edge cases:

  1. If a policy is removed - who would know to set the DAG back to active? How will they be notified?
  2. If a policy is changed (lets say first policy had a horrible exception message and the change made the message better) the user will never get the updated message because the DAG is paused and can not run.

I think the easiest way to overcome this is just having a configuration in airflow.cfg that lets users decide if they want to enable this feature or not.

@bmoon4
Copy link

bmoon4 commented Mar 28, 2022

@alex-astronomer @eladkal I agree with I think the easiest way to overcome this is just having a configuration in airflow.cfg that lets users decide if they want to enable this feature or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Feature Requests
Projects
None yet
Development

No branches or pull requests

5 participants