DebugExecutor use ti.run() instead of ti._run_raw_task #24357

o-nikolas · 2022-06-09T18:53:44Z

The DebugExecutor previously executed tasks by calling the "private" ti._run_raw_task(...) method instead of ti.run(...). But the latter contains the logic to increase task instance try_numbers when running, thus tasks executed with the DebugExecutor were never getting their try_numbers increased. For rescheduled tasks this led to off-by-one errors (as the logic to reduce the try_number for the reschedule was still working while the increase was not).

This off-by-one error manifests as a KeyError in _update_counters (seen in #13322) since the try_number for ti.keys in the in-memory ti_status.running map don't match the try_number in the ti.keys in the DB. #13322 was marked as resolved because there were two issues being conflated, one issues was fixed (and the ticket closed) but users were still seeing this failure due to the issue fixed in this PR.

This unblocks system tests (which are executed with the DebugExecutor) which will hit the aforementioned off-by-one error if a system test includes a Sensor (which does more than one poke, it needs to be rescheduled at least once to raise the exception).

NOTE: The main fix here is to use ti.run() for the debug executor, if anyone has the historical context for that design decision please weigh in, thanks! (CC @turbaszek)

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragement file, named {pr_number}.significant.rst, in newsfragments.

o-nikolas · 2022-06-09T18:55:26Z

tests/jobs/test_backfill_job.py

@@ -1599,7 +1615,7 @@ def test_mapped_dag(self, dag_id, executor_name, session):
        self.dagbag.process_file(str(TEST_DAGS_FOLDER / f'{dag_id}.py'))
        dag = self.dagbag.get_dag(dag_id)

-        when = datetime.datetime(2022, 1, 1)
+        when = timezone.datetime(2022, 1, 1)


Note: DB errors are thrown if a non-timezoned date is used, which makes me think these long running tests aren't being run in our CI? Since they would be failing. @potiuk do you have any context around that?

Yep. Long running tests are NOT run on our CI at all :).

There is no more context than that :D

😆 I meant more context for why that's the case. I assume it's because folks just don't want to slow down the regular builds, but maybe there were timeouts on the test runners? Either way, what do you think about a scheduled CI run (once daily maybe?) on main which includes long running tests so that we at least get some coverage for them, thoughts?

github-actions · 2022-06-10T10:14:54Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

The DebugExecutor previously executed tasks by calling the "private" ti._run_raw_task(...) method instead of ti.run(...). But the latter contains the logic to increase task instance try_numbers when running, thus tasks executed with the DebugExecutor were never getting their try_numbers increased and for rescheduled tasks this led to off-by-one errors (as the logic to reduce the try_number for the reschedule was still working while the increase was not).

turbaszek

Looks reasonable to me. IIRC sensors were previously executed properly but in blocking manner:

airflow/airflow/models/taskinstance.py

Lines 905 to 908 in fe2334f

    
           if isinstance(task_copy, BaseSensorOperator) and \ 
        
                   conf.get('core', 'executor') == "DebugExecutor": 
        
               self.log.warning("DebugExecutor changes sensor mode to 'reschedule'.") 
        
               task_copy.mode = 'reschedule'

However, this code is no longer present.

potiuk · 2022-06-13T07:02:50Z

Merging. The failing tests are because of "full tests needed" and timeouts on K8S tests (should be fixed by #24408)

The DebugExecutor previously executed tasks by calling the "private" ti._run_raw_task(...) method instead of ti.run(...). But the latter contains the logic to increase task instance try_numbers when running, thus tasks executed with the DebugExecutor were never getting their try_numbers increased and for rescheduled tasks this led to off-by-one errors (as the logic to reduce the try_number for the reschedule was still working while the increase was not). (cherry picked from commit da7b22b)

o-nikolas requested review from kaxil, XD-DENG and ashb as code owners June 9, 2022 18:53

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jun 9, 2022

o-nikolas commented Jun 9, 2022

View reviewed changes

potiuk approved these changes Jun 10, 2022

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 10, 2022

o-nikolas force-pushed the onikolas/sensor_try_number_debug_executor branch from 1acafa6 to 86b94e2 Compare June 10, 2022 17:07

turbaszek approved these changes Jun 10, 2022

View reviewed changes

potiuk approved these changes Jun 10, 2022

View reviewed changes

potiuk merged commit da7b22b into apache:main Jun 13, 2022

ephraimbuddy added this to the Airflow 2.3.3 milestone Jul 5, 2022

ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Jul 5, 2022

ephraimbuddy mentioned this pull request Jul 6, 2022

Status of testing of Apache Airflow 2.3.3rc3 #24863

Closed

74 tasks

eladkal mentioned this pull request Feb 6, 2023

DebugExecutor runs task callbacks twice #24957

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DebugExecutor use ti.run() instead of ti._run_raw_task #24357

DebugExecutor use ti.run() instead of ti._run_raw_task #24357

o-nikolas commented Jun 9, 2022

o-nikolas Jun 9, 2022

potiuk Jun 10, 2022

o-nikolas Jun 10, 2022

github-actions bot commented Jun 10, 2022

turbaszek left a comment

potiuk commented Jun 13, 2022

	if isinstance(task_copy, BaseSensorOperator) and \
	conf.get('core', 'executor') == "DebugExecutor":
	self.log.warning("DebugExecutor changes sensor mode to 'reschedule'.")
	task_copy.mode = 'reschedule'

DebugExecutor use ti.run() instead of ti._run_raw_task #24357

DebugExecutor use ti.run() instead of ti._run_raw_task #24357

Conversation

o-nikolas commented Jun 9, 2022

o-nikolas Jun 9, 2022

Choose a reason for hiding this comment

potiuk Jun 10, 2022

Choose a reason for hiding this comment

o-nikolas Jun 10, 2022

Choose a reason for hiding this comment

github-actions bot commented Jun 10, 2022

turbaszek left a comment

Choose a reason for hiding this comment

potiuk commented Jun 13, 2022