Skip to content

Conversation

@jason810496
Copy link
Member

related: #42360

Why

Since #47062 was reverted in #47402, this PR will also address the full test issue here.

@boring-cyborg boring-cyborg bot added the area:API Airflow's REST/HTTP API label Mar 6, 2025
@Lee-W Lee-W added the full tests needed We need to run full set of tests for this PR to merge label Mar 6, 2025
@Lee-W Lee-W self-requested a review March 6, 2025 02:24
@Lee-W Lee-W force-pushed the feature/AIP-84/auth/dags branch from 5482c3e to e7c9e29 Compare March 6, 2025 09:36
@pierrejeambrun
Copy link
Member

pierrejeambrun commented Mar 6, 2025

As expected full tests break and need fixing. Also while testing the UI I realized that the server is not happy with this change (can't load dags internal error 500) and we will need to wait for @vincbeck fix. FabAuthManager is crashing on get_permitted_dag_ids. This is a bug in the auth manager implementation and not detected by our tests at the moment because unit tests use the SimpleAuthManager where the dev setup use FabAuthManager by default.

@jason810496
Copy link
Member Author

As expected full tests break and need fixing. Also while testing the UI I realized that the server is not happy with this change (can't load dags internal error 500) and we will need to wait for @vincbeck fix. FabAuthManager is crashing on get_permitted_dag_ids. This is a bug in the auth manager implementation and not detected by our tests at the moment because unit tests use the SimpleAuthManager where the dev setup use FabAuthManager by default.

Just to double-check, do you mean that for this PR, I don’t need to make any changes but should wait for FabAuthManager to be fixed and then rebase to verify everything works correctly?

If that’s the case, should #47388 also be reverted? ( Since you mentioned that the dev setup use FabAuthManager by default, I initially thought the issue was caused by not adding a JWT token to the API client. )

@pierrejeambrun
Copy link
Member

pierrejeambrun commented Mar 6, 2025

There are two problems:

  • this PR is breaking the k8s KubeExec integration tests, this need to be fixed
  • this PR uses get_permitted_dag_ids which breaks the front-end at the moment, Vincent will fix this I believe. Other auth PRs do not use get_permitted_dag_ids, this is only introduced here, so other PRs don't need to be reverted.

@jason810496
Copy link
Member Author

There are two problems:

  • this PR is breaking the k8s KubeExec integration tests, this need to be fixed
  • this PR uses get_permitted_dag_ids which breaks the front-end at the moment, Vincent will fix this I believe. Other auth PRs do not use get_permitted_dag_ids, this is only introduced here, so other PRs don't need to be reverted.

Thanks for confirmation, I will start fixing k8s integration tests.

@vincbeck
Copy link
Contributor

vincbeck commented Mar 6, 2025

I'll have a PR for FAB today

@vincbeck
Copy link
Contributor

vincbeck commented Mar 6, 2025

#47458

@Lee-W Lee-W mentioned this pull request Mar 6, 2025
47 tasks
@Lee-W Lee-W force-pushed the feature/AIP-84/auth/dags branch from e7c9e29 to 3621398 Compare March 7, 2025 08:04
@Lee-W Lee-W force-pushed the feature/AIP-84/auth/dags branch 5 times, most recently from 5148e1a to c327620 Compare March 7, 2025 15:26
@jason810496 jason810496 force-pushed the feature/AIP-84/auth/dags branch from c327620 to 0813087 Compare March 7, 2025 18:43
@jason810496
Copy link
Member Author

jason810496 commented Mar 7, 2025

Finally fixed the Kubernetes tests locally!

Some notes:

  1. The API server still uses FAB Auth Manager, so when retrieving a JWT, the same session must be used to handle the CSRF token session issue. Otherwise, it results in:

    <html lang=en>
    <title>400 Bad Request</title>
    <h1>Bad Request</h1>
    <p>The CSRF session token is missing.</p>
    
  2. The current redirect_url from FAB is still http://localhost:8080/?token=..., but KUBERNETES_HOST_PORT is not localhost:8080.
    This can be fixed in a future PR.
    Update: Fixed in Fix Breeze K8s: Patch ConfigMap [api/base_url] with Port-Forwarded Host Port #47544

@jason810496 jason810496 force-pushed the feature/AIP-84/auth/dags branch from 0813087 to 7ca6fd4 Compare March 8, 2025 02:30
@jason810496
Copy link
Member Author

jason810496 commented Mar 8, 2025

I ran the Kubernetes tests locally yesterday, and they all passed. However, after rebasing today, they consistently fail.
It seems that triggering a dagRun results in a 404 error:

post_string = f"http://{host}/public/dags/{dag_id}/dagRuns"
print(f"Calling [start_dag]#2 {post_string}")
logical_date = datetime.now(timezone.utc).isoformat()
# Trigger a new dagrun
result = self.session.post(post_string, json={"logical_date": logical_date})

self = <requests.adapters.HTTPAdapter object at 0x119d835b0>, request = <PreparedRequest [POST]>, stream = False, timeout = Timeout(connect=None, read=None, total=None), verify = True, cert = None, proxies = OrderedDict()

# ...
    
        except MaxRetryError as e:
            if isinstance(e.reason, ConnectTimeoutError):
                # TODO: Remove this in 3.0.0: see #2811
                if not isinstance(e.reason, NewConnectionError):
                    raise ConnectTimeout(e, request=request)
    
            if isinstance(e.reason, ResponseError):
>               raise RetryError(e, request=request)
E               requests.exceptions.RetryError: HTTPConnectionPool(host='localhost', port=14476): Max retries exceeded with url: /public/dags/example_kubernetes_executor/dagRuns (Caused by ResponseError('too many 404 error responses'))

Tracing the issue on the API side, it looks like the DAG must be active to be triggered:

dm = session.scalar(select(DagModel).where(DagModel.is_active, DagModel.dag_id == dag_id).limit(1))
if not dm:
raise HTTPException(status.HTTP_404_NOT_FOUND, f"DAG with dag_id: '{dag_id}' not found")

However, in the patch DAG endpoint, only the is_paused field can be modified, while the is_active field cannot be changed:

dag = session.get(DagModel, dag_id)
if dag is None:
raise HTTPException(status.HTTP_404_NOT_FOUND, f"Dag with id: {dag_id} was not found")
fields_to_update = patch_body.model_fields_set
if update_mask:
if update_mask != ["is_paused"]:
raise HTTPException(
status.HTTP_400_BAD_REQUEST, "Only `is_paused` field can be updated through the REST API"
)


Update: This can be resolved by adding airflow dags reserialize before the test run.

@jason810496
Copy link
Member Author

jason810496 commented Mar 9, 2025

No error with Kubernetes tests / K8S System:KubernetesExecutor-3.9-v1.29.12-true
but Kubernetes tests / K8S System:KubernetesExecutor-3.9-v1.29.12-false always fail due to timeout.

=========================== short test summary info ============================
FAILED kubernetes_tests/test_kubernetes_executor.py::TestKubernetesExecutor::test_integration_run_dag_with_scheduler_failure - Failed: Timeout >300.0s
=================== 1 failed, 1 passed in 376.40s (0:06:16) ====================

Update:
I just looked into the Kubernetes test with the -false suffix, which indicates that useStandardNaming=false. I will try deploying with useStandardNaming=false and observe the differences when running the Kubernetes test against a cluster with this setting.

@Lee-W Lee-W force-pushed the feature/AIP-84/auth/dags branch from 7724236 to e62d885 Compare March 10, 2025 00:15
@Lee-W
Copy link
Member

Lee-W commented Mar 11, 2025

Yep, seems to be something starts to happen this morning and can be fixed by retriggering

@jason810496
Copy link
Member Author

jason810496 commented Mar 11, 2025

@rawwar also ran the k8s test locally and found that it only fails when running all tests but passes when running a specific test.

I'm looking into the reason now.


Update: Not found the root cause, but add the following mechanism

  • Add JWTRefreshAdapter to handle 401, 403 status code.
  • Only rollout restart airflow-api-server deployment if needed.

@jason810496
Copy link
Member Author

Only 1 k8s test fail flaky 🎉 , I will rebase main and run again.

kubernetes_tests/test_kubernetes_pod_operator.py::TestKubernetesPodOperator::test_kubernetes_pod_operator_active_deadline_seconds[10] - requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

@jason810496
Copy link
Member Author

Good news! All k8s tests are success in this run, I will rebase later ( there are conflicts with latest main ).

@Lee-W
Copy link
Member

Lee-W commented Mar 11, 2025

all green. merging

@jason810496
Copy link
Member Author

jason810496 commented Mar 11, 2025

All green!
I believe the Kubernetes tests have become more stable after this PR by adding additional HTTP retries at each stage of the API calls.

Lesson learned:
Retries are essential in distributed systems since components or connections can become unstable at any time.

@Lee-W Lee-W merged commit 41a8a9a into apache:main Mar 11, 2025
89 checks passed
@rawwar
Copy link
Contributor

rawwar commented Mar 11, 2025

Lesson learned: Retries are essential in distributed systems since components or connections can become unstable at any time.

How many retries?
Answer: I'll take all of them

@Lee-W
Copy link
Member

Lee-W commented Mar 11, 2025

#protm

nailo2c pushed a commit to nailo2c/airflow that referenced this pull request Apr 4, 2025
* AIP-84 | Add Auth for Dag

Refactor conftest for api_fastapi and test_dags

Add unauthorized 403 test cases

Remove PATCH in requires_access

Fix unauthorized_test_client, requires_access_dag

Add EditableDagsFilterDep, ReadableDagsFilterDep

Add permitted_dag_filter for dags API

Fix test_security

Add OrmFilterClause

Fix mypy error

* fix(api_fastapi): rename methods argument to method

* Fix kubernetes_tests

* Fix api_fastapi/test_dags

* Add dags_reserialize for k8s tests

Refactor _get_jwt_token

* Increase threshold of test_integration_run_dag_with_scheduler_failure

* test: raise if we cannot get jwt_token not due to connection error

* Fix _get_jwt_token after dynamic patching k8s configMap

* Remove dags_reserialize setup in BaseK8STest

* Fix test_docker_compose_quick_start

* Ensure scheduler health in test_integration_run_dag_with_scheduler_failure

* Increase timeout threshold

* Add HTTP retry for _get_jwt_token

* Add JWTRefreshAdapter and restart api-server if needed

---------

Co-authored-by: Wei Lee <weilee.rx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API full tests needed We need to run full set of tests for this PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants