Prevent large objects from being stored in RenderedTaskInstanceFields #28199

PatrickfBraz · 2022-12-07T16:52:36Z

Apache Airflow version

Airflow v2.4.1

What happened

In order to provide greater flexibility and ease for the implementation of DAGs and Tasks in our Airflow instance, we decided to implement our custom backend for XCom. In this way, we save in the database only the reference to objects that are serialized in pickle and saved in Google Cloud Storage (GCS).

All the recommendations found in this documentation were followed, including the implementation of the orm_deserialize_value method to create a short and lightweight representation of the objects saved in the GCS.

The custom backend implemented works perfectly and has been in production for a few months. Along with this, there has recently been a strong push on the team to implement the new DAGs using the TaskFlow API and slowly refactor the old DAGs. However, during the implementation of a new DAG which works with DataFrames from the pandas library we noticed that the execution presented errors not in the Task that generated the DataFrame but in the Tasks that consumed it and during the debug procedure we discovered that the problems were being caused because the DataFrames were being saved (or just trying to) into the rendered_task_instance_fields table.

We believed that only arguments provided as a template were actually rendered and saved in this table, but apparently TaskFlow shares information between Tasks through templates (I don't know exactly how it works). Also, one would expect, as with the tab that renders the XComs, that the orm_deserialize_value method would be called, but that doesn't seem to be the case.

Example os a sample code:

import pendulum
import pandas as pd
from airflow.decorators import dag, task

now = pendulum.now()


@dag(start_date=now, schedule="@daily", catchup=False)
def sample():
    @task()
    def get_dataframe():
        return pd.read_csv("https://gist.githubusercontent.com/chriddyp/feaa84b34854e53fb72a/raw/dbba00aeafb981f0f50014030d1b6ad0399d957d/example-data.csv")

    @task()
    def load_dataframe(df: pd.DataFrame):
        print(df.head())

    load_dataframe(get_dataframe())


sample()

There is no problem executing the DAG:

The following XCom information is rendered on UI:

Checking the XCom table on the local database used by airflow:

Everything ok so far. However, when checking the rendered template tab:

Confirming if this was really the object saved in the database:

What you think should happen instead

I expected that even if the object is deserialized, the method that returns the lightweight representation of the saved object is called. The created representation is enough for DEBUG purposes and doesn't burden the database with bulky objects.

How to reproduce

Implement a new custom backend that accepts pandas DataFrame. Here you can find an example. NOTE: don't forget to implement the orm_deserialize_value to create a lighter representation of the Dataframe.
Reference your new backend on the core config. Ex: AIRFLOW__CORE__XCOM_BACKEND=path.to.custom.XComBackend
Set up the local environment with the docker-compose file: https://airflow.apache.org/docs/apache-airflow/2.5.0/docker-compose.yaml
Execute the sample code above.
Look at the Xcoms and rendered template tabs.

Operating System

NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Versions of Apache Airflow Providers

In the production environment, we are using:

apache-airflow-providers-postgres>=4.0.0
apache-airflow-providers-apache-beam>=4.0.0
apache-airflow-providers-cncf-kubernetes>=4.1.0
apache-airflow-providers-datadog>=3.0.0
apache-airflow-providers-google>=8.0.0
apache-airflow-providers-http>=3.0.0
apache-airflow-providers-microsoft-mssql>=3.0.0
apache-airflow-providers-mongo>=3.0.0
apache-airflow-providers-mysql>=3.0.0
apache-airflow-providers-odbc>=3.0.0
apache-airflow-providers-sftp>=3.0.0
apache-airflow-providers-ssh>=3.0.0
apache-airflow-providers-airbyte>=3.0.0
apache-airflow-upgrade-check==1.4.0

Deployment

Other 3rd-party Helm chart

Deployment details

We manage a fork of the official Airflow chart which we customize for our use case. Deployment is done on a v1.21 Kubernetes cluster hosted on Google Kubernetes Engine (GKE)

Anything else

This occurrence is not a critical problem since using the Operators in the classic way already allows us to use the custom XCom without problems. However, TaskFlow presents a much more readable and friendly way of producing code. Since we want to democratize and facilitate the access and implementation of DAGs between different teams in the same Airflow instance, using TaskFlow will be of great use to us.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-12-07T16:52:38Z

Thanks for opening your first issue here! Be sure to follow the issue template!

uranusjr · 2022-12-08T06:34:10Z

Hmm, I’m not sure how it’d be actionable. Perhaps we need to add more hooks to the XCom backend interface for this.

potiuk · 2022-12-08T13:21:06Z

Hmm, I’m not sure how it’d be actionable. Perhaps we need to add more hooks to the XCom backend interface for this.

Maybe we should simply do not disply args * if they come from XCom and we have a custom backend? or use orm_deserialize in this case to display it? I guess it is possible to determine where the op_args/kwargs come from ?

uranusjr · 2022-12-08T13:29:03Z

Using orm_deserialize is mostly what I’m thinking, except if it’s not viable we could add a separate interface (I think we need some kind of backward compat measure so a new function may be needed anyway). Not displaying the args at all when a custom XCom backend is involved seems a bit unfriendly to the user.

potiuk · 2022-12-08T13:35:42Z

Yeah, not very unfriendly. Using orm_deserialize is best and I think entirely in-line with the intended use of the method.

I'd even say the current behaviour is a bug because clearly the orm_deserialize method was exactly going to handle those kind of cases according to description:

There is also an orm_deserialize_value method that is called whenever the XCom objects are rendered for UI or reporting purposes; if you have large or expensive-to-retrieve values in your XComs, you should override this method to avoid calling that code (and instead return a lighter, incomplete representation) so the UI remains responsive.

dilex42 · 2023-04-25T18:32:58Z

Hi. I have the same problem and it's kinda frustrating as I think it completely ruins Taskflow idea of presenting DAGs as composition of tasks.
As per your Medium article and issue #8059, Custom XCom was designed to allow users to pass large objects between tasks for cleaner code. And it works great when a task returns something. Problem arises later when you try to pass that returned value to another task. Then all of a sudden we face another attempt to write this large object to database for some reason. So in reality whole "pass objects from task to task" concept isn't working and that's why I completely agree that this issue should be classified as a bug.

Next, I am wondering what possible solutions could there be:

Is it necessary to deserialize XComs before task serialization in db? While I am not an avid user of rendered template tab, I understand why some would like to see templates_dict rendered. But what's the usage of rendered args and especially rendered xcoms? You can always check them in xcoms section or in you custom storage. And, I believe, that tasks that yet to be run have something like XComArgs(DecoratedPython)... in their rendered args. So can't it stay that way in others states?
Perhaps you will be open to let users customize what is serializing in each task? Currently it's ['templates_dict', 'op_args', 'op_kwargs'] but maybe we can pass some parameter to change that behaviour and exclude args from certain tasks
That being said, I agree that using orm_deserialize seems to be good idea as it gives necessary customisation opportunities and kinda why it was invented, if I understand correctly. Am I wrong in hoping that this change is only 1 line of code and should be pretty straightforward or there are a lot of places where this should be changed? If you can point me to specific code in question, I don't mind to try to make a PR.

Also for people with same problem, I'm currently using a workaround of explicitly pulling desired XCom inside a task. It's not ideal but it's clean enough and works.

Thanks for your time, and sorry if it sounds harsh, I'm just frustrated a little bit😞

potiuk · 2023-04-26T16:50:23Z

Thanks for your time, and sorry if it sounds harsh, I'm just frustrated a little bitdisappointed

Why not making a PR an tring to fix it? This is why usually do. Usually when you create a PR and try to fix it you find a good way of doing it and by doing it and showing what you propose, it will be clear what you are proposing. discussing over proposed improvement is always a good idea.

Airflow is created by almost 2.500 people - mostly those who felt frustrated with something and then implemented a fix or feature. This is how it works here.

Joffreybvn · 2023-08-30T06:38:27Z

Hello,

I am having similar issue. I'm whilling to make a PR. Same story as @PatrickfBraz: giving a try to "full taskflow" and Airflow native features, instead of doing everything with KubernetesPodOperator / PythonOperator. Using Airflow 2.7.0.

I have a custom XCom backend. My dag consist of: HttpOperator -> SQLExecuteQueryOperator to pull data from an API, passing it via XCom, and saving it into a database. I use a SQL template which pull XCom and render a big query:

{%- set data = ti.xcom_pull(task_ids='pull_data_from_api') -%}
{%- for entry in data -%}
INSERT INTO "SCHEMA_NAME"."TABLE_NAME" VALUES ('{{entry|tojson}}');
{% endfor %}

Resulting in: (This is an example with mockup data)

The database I'm pushing to has limitation: I have to push data doing multiple queries. As a workaround, I could do an 'executemany' with params in a PythonOperator. But as soon as this chunk of code would turn into a custom reusable Operator, I will have a templated parameters which will anyway save all the data into the database.

I also see the AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK parameter. But rendered templates are great (especially for beginners), I don't want to disable them, I want to keep them visible in the UI.

So, if I understand well this discussion, a PR to begin with could render the template two times:

One for the actual execution of the task, with real data - not stored in the db.
One to be stored in the db/shown in the UI, based on orm_deserialize method
+ A mechanism to detect if it actually needs to be rendered two times (checking custom XCom backend)

?

uranusjr · 2023-08-30T07:37:41Z

One for the actual execution of the task, with real data - not stored in the db

One to be stored in the db/shown in the UI, based on orm_deserialize method + A mechanism to detect if it actually needs to be rendered two times (checking custom XCom backend)

Instead of checking a custom backend, we can just check whether the custom backend actually re-implements orm_deserialize_value or not, and only render a second time if that’s the case.

ephraimbuddy · 2024-03-04T13:58:30Z

Hi @uranusjr @potiuk , I'm looking at this issue but it doesn't seem like using orm_deserialize_value would solve this or am I missing something? The code that causes the writing of this value is here:

airflow/airflow/models/renderedtifields.py

Lines 118 to 120 in 30f7b2a

    
           self.rendered_fields = { 
        
               field: serialize_template_field(getattr(self.task, field)) for field in self.task.template_fields 
        
           }

and I'm thinking it should be solved there. thoughts?

uranusjr · 2024-03-05T01:04:20Z

The context is that when we render the templates (which are saved to RTIF later), the rendering process also implicitly resolves XCom, causing big values to be loaded (and thus saved to RTIF). We do want the values to be loaded for execution, so one possible solution would be to render the templates twice if a custom XCom backend is detected (by checking if orm_deserialize_value is overriden), once normally for execution, once without loading the large values for storage. I think handling this in serialize_template_field might be more difficult; when we reach there, all templates have been rendered and we may miss a lot of context on whether we want a value to be abbrevated or not.

But thinking of this again, maybe that is fine…? Custom XCom is just one special case (where the problem is more significant). It could be argued we should reduce all large values anyway regardless of where they are from, and handling that in serialize_template_field would be reasonable.

There's no control over the size of objects stored in the rendered taskinstance field. This PR adds control and enable users to be able to customize the size of data that can be stored in this field closes: apache#28199

* Prevent large objects from being stored in the RTIF There's no control over the size of objects stored in the rendered taskinstance field. This PR adds control and enable users to be able to customize the size of data that can be stored in this field closes: #28199 * fixup! Prevent large objects from being stored in the RTIF * Use len and check the size of the serialized * Add test to already existing test * Remove xcom db clearing * Update airflow/config_templates/config.yml Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Apply review suggestions * Add test for redacting values and add significant item * Prefix with Truncated line * Ensure secrets are masked * fixup! Ensure secrets are masked * Check the template field length in separate branches for jsonable and nonjsonable * add tests for rendered dataframe * Apply suggestions from code review * update code and tests * fixup! update code and tests --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

PatrickfBraz added area:core kind:bug This is a clearly a bug labels Dec 7, 2022

potiuk added this to the Airflow 2.5.1 milestone Dec 7, 2022

uranusjr modified the milestones: Airflow 2.5.1, Airflow 2.6.0 Dec 8, 2022

uranusjr added kind:feature Feature Requests and removed kind:bug This is a clearly a bug labels Dec 8, 2022

uranusjr changed the title ~~Large objects managed by custom XCom backend are being rendered and saved in rendered_task_instance_fields table~~ Prevent large objects from being stored in RenderedTaskInstanceFields Dec 8, 2022

ephraimbuddy modified the milestones: Airflow 2.6.0, Airflow 2.6.1 Apr 13, 2023

eladkal removed this from the Airflow 2.6.1 milestone Apr 28, 2023

potiuk mentioned this issue Jul 4, 2023

Improve handling of Rendered Task Instance Fields #32342

Open

1 task

ephraimbuddy modified the milestones: Airlfow 2.6.3, Airflow 2.6.3, Airflow 2.6.4 Jul 4, 2023

eladkal removed this from the Airflow 2.6.4 milestone Aug 1, 2023

ephraimbuddy self-assigned this Mar 4, 2024

ephraimbuddy mentioned this issue Mar 12, 2024

Prevent large objects from being stored in the RTIF #38094

Merged

ephraimbuddy closed this as completed in #38094 Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent large objects from being stored in RenderedTaskInstanceFields #28199

Prevent large objects from being stored in RenderedTaskInstanceFields #28199

PatrickfBraz commented Dec 7, 2022 •

edited

Loading

boring-cyborg bot commented Dec 7, 2022

uranusjr commented Dec 8, 2022

potiuk commented Dec 8, 2022

uranusjr commented Dec 8, 2022

potiuk commented Dec 8, 2022

dilex42 commented Apr 25, 2023

potiuk commented Apr 26, 2023 •

edited

Loading

Joffreybvn commented Aug 30, 2023 •

edited

Loading

uranusjr commented Aug 30, 2023

ephraimbuddy commented Mar 4, 2024

uranusjr commented Mar 5, 2024

Prevent large objects from being stored in RenderedTaskInstanceFields #28199

Prevent large objects from being stored in RenderedTaskInstanceFields #28199

Comments

PatrickfBraz commented Dec 7, 2022 • edited Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Dec 7, 2022

uranusjr commented Dec 8, 2022

potiuk commented Dec 8, 2022

uranusjr commented Dec 8, 2022

potiuk commented Dec 8, 2022

dilex42 commented Apr 25, 2023

potiuk commented Apr 26, 2023 • edited Loading

Joffreybvn commented Aug 30, 2023 • edited Loading

uranusjr commented Aug 30, 2023

ephraimbuddy commented Mar 4, 2024

uranusjr commented Mar 5, 2024

PatrickfBraz commented Dec 7, 2022 •

edited

Loading

potiuk commented Apr 26, 2023 •

edited

Loading

Joffreybvn commented Aug 30, 2023 •

edited

Loading