[WIP] Add a public interface for custom weight_rule implementation #36029

hussein-awala · 2023-12-02T22:21:00Z

related: #35210

airflow/migrations/versions/0132_2_8_0_add_priority_weight_strategy_to_task_.py

…ns tests

…ntation (apache#35210)" (apache#36066)" This reverts commit f60d458.

…35210) * Add a public interface for custom weight_rule implementation * Remove _weight_strategy attribute * Move priority weight calculation to TI to support advanced strategies * Fix loading the var from mapped operators and simplify loading it from task * Update default value and deprecated the other one * Update task endpoint API spec * fix tests * Update docs and add dag example * Fix serialization test * revert change in spark provider * Update unit tests

github-actions · 2024-01-21T00:13:35Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

jscheffl

I generally like the idea f custom weight rules and I had a couple of cases where I thought it would be cool to inject this function. PR looks mostly good but I'd like to understand if there is really a need for:

Adding a DB column per task instance and serializing (new) context informationfor every task instance. Sounds to me like a lot of overhead for potentially millions of tuples in the DB for a very special use case. I'd prefer to assume that in 95% of cases the task context is sufficient to calculate.
Why the parameter on the task is renamed. If is serving the same purpose and has the same config property like before, renaming needs activities by all users to change DAGs whereas I assume for 9% of users this is not having any effect.
Docs are very brief. For the complexity added, especially in terms of serialization needed I'd miss more details and a pointer to the example provided.

Please don't judge it negative but I assume we need a proper review and feel like the complexity added needs a rationale as the DB schema changes might have a negative side impact on large high throughput setups. But other opinions welcome. I feel too much overhead for a niche feature.

jscheffl · 2024-03-11T22:25:56Z

airflow/migrations/versions/0137_2_9_0_add_priority_weight_strategy_to_task.py

+def upgrade():
+    """Apply add priority_weight_strategy to task_instance"""
+    with op.batch_alter_table("task_instance") as batch_op:
+        batch_op.add_column(sa.Column("_priority_weight_strategy", sa.JSON()))


I understand we need to store some context information about the selcted priority weight strategy - but do we really need to add this to the DB? TaskInstance is the most largest table in the DB scheme and potentially contains millions of rows. Do we really want to store the same values in mostly millions of cases? Or can we leave it NULL and store only a value if we have this special rule and data needs stored?

I like the approach of this PR in general but fear this will create a lot of overhead in DB. Especially as it is a JSON field.

jscheffl · 2024-03-11T22:29:45Z

airflow/models/baseoperator.py

@@ -575,6 +579,11 @@ class derived from this one results in the creation of a task object,
        significantly speeding up the task creation process as for very large
        DAGs. Options can be set as string or using the constants defined in
        the static class ``airflow.utils.WeightRule``
+    :param priority_weight_strategy: weighting method used for the effective total priority weight


I do not fully understand why we need to rename the field. Even if we are using a custom implementation, is the old name not matching anymore? Especially as the same parameters apply.
In this way we force all customers to change the DAG definitions as we migrate the parameter name, in 95% of cases the parameter and function does not change.

jscheffl · 2024-03-11T22:34:02Z

airflow/serialization/serialized_objects.py

+    importable_string = qualname(priority_weight_strategy_class)
+    if _get_registered_priority_weight_strategy(importable_string) is None:
+        raise _PriorityWeightStrategyNotRegistered(importable_string)
+    return {Encoding.TYPE: importable_string, Encoding.VAR: var.serialize()}


Do we assume we need to store a state for the strategy? Would it not enough just to use the task context information and the python code? Would we assume real "data" needs to be serialized for a custom strategy that need to be stored/retrieved? I feel a lot of overhead for very special use cases where strategy state information needs to be persisted.
Do you have a use case in mind where per task (other than the context) data needs to be persisted in a custom strategy? I feel this is a niche use case - but maybe I don't have a use case in mind.

jscheffl · 2024-03-11T22:35:27Z