-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent large objects from being stored in RenderedTaskInstanceFields #28199
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Hmm, I’m not sure how it’d be actionable. Perhaps we need to add more hooks to the XCom backend interface for this. |
Maybe we should simply do not disply args * if they come from XCom and we have a custom backend? or use orm_deserialize in this case to display it? I guess it is possible to determine where the op_args/kwargs come from ? |
Using |
Yeah, not very unfriendly. Using I'd even say the current behaviour is a bug because clearly the orm_deserialize method was exactly going to handle those kind of cases according to description:
|
Hi. I have the same problem and it's kinda frustrating as I think it completely ruins Taskflow idea of presenting DAGs as composition of tasks. Next, I am wondering what possible solutions could there be:
Also for people with same problem, I'm currently using a workaround of explicitly pulling desired XCom inside a task. It's not ideal but it's clean enough and works. Thanks for your time, and sorry if it sounds harsh, I'm just frustrated a little bit😞 |
Why not making a PR an tring to fix it? This is why usually do. Usually when you create a PR and try to fix it you find a good way of doing it and by doing it and showing what you propose, it will be clear what you are proposing. discussing over proposed improvement is always a good idea. Airflow is created by almost 2.500 people - mostly those who felt frustrated with something and then implemented a fix or feature. This is how it works here. |
Hello, I am having similar issue. I'm whilling to make a PR. Same story as @PatrickfBraz: giving a try to "full taskflow" and Airflow native features, instead of doing everything with KubernetesPodOperator / PythonOperator. Using Airflow 2.7.0. I have a custom XCom backend. My dag consist of: {%- set data = ti.xcom_pull(task_ids='pull_data_from_api') -%}
{%- for entry in data -%}
INSERT INTO "SCHEMA_NAME"."TABLE_NAME" VALUES ('{{entry|tojson}}');
{% endfor %} Resulting in: (This is an example with mockup data) I also see the So, if I understand well this discussion, a PR to begin with could render the template two times:
? |
Instead of checking a custom backend, we can just check whether the custom backend actually re-implements |
Hi @uranusjr @potiuk , I'm looking at this issue but it doesn't seem like using airflow/airflow/models/renderedtifields.py Lines 118 to 120 in 30f7b2a
|
The context is that when we render the templates (which are saved to RTIF later), the rendering process also implicitly resolves XCom, causing big values to be loaded (and thus saved to RTIF). We do want the values to be loaded for execution, so one possible solution would be to render the templates twice if a custom XCom backend is detected (by checking if But thinking of this again, maybe that is fine…? Custom XCom is just one special case (where the problem is more significant). It could be argued we should reduce all large values anyway regardless of where they are from, and handling that in |
There's no control over the size of objects stored in the rendered taskinstance field. This PR adds control and enable users to be able to customize the size of data that can be stored in this field closes: apache#28199
* Prevent large objects from being stored in the RTIF There's no control over the size of objects stored in the rendered taskinstance field. This PR adds control and enable users to be able to customize the size of data that can be stored in this field closes: #28199 * fixup! Prevent large objects from being stored in the RTIF * Use len and check the size of the serialized * Add test to already existing test * Remove xcom db clearing * Update airflow/config_templates/config.yml Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Apply review suggestions * Add test for redacting values and add significant item * Prefix with Truncated line * Ensure secrets are masked * fixup! Ensure secrets are masked * Check the template field length in separate branches for jsonable and nonjsonable * add tests for rendered dataframe * Apply suggestions from code review * update code and tests * fixup! update code and tests --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Apache Airflow version
Airflow v2.4.1
What happened
In order to provide greater flexibility and ease for the implementation of DAGs and Tasks in our Airflow instance, we decided to implement our custom backend for XCom. In this way, we save in the database only the reference to objects that are serialized in pickle and saved in Google Cloud Storage (GCS).
All the recommendations found in this documentation were followed, including the implementation of the orm_deserialize_value method to create a short and lightweight representation of the objects saved in the GCS.
The custom backend implemented works perfectly and has been in production for a few months. Along with this, there has recently been a strong push on the team to implement the new DAGs using the TaskFlow API and slowly refactor the old DAGs. However, during the implementation of a new DAG which works with DataFrames from the pandas library we noticed that the execution presented errors not in the Task that generated the DataFrame but in the Tasks that consumed it and during the debug procedure we discovered that the problems were being caused because the DataFrames were being saved (or just trying to) into the rendered_task_instance_fields table.
We believed that only arguments provided as a template were actually rendered and saved in this table, but apparently TaskFlow shares information between Tasks through templates (I don't know exactly how it works). Also, one would expect, as with the tab that renders the XComs, that the orm_deserialize_value method would be called, but that doesn't seem to be the case.
Example os a sample code:
There is no problem executing the DAG:
The following XCom information is rendered on UI:
Checking the XCom table on the local database used by airflow:
Everything ok so far. However, when checking the rendered template tab:
Confirming if this was really the object saved in the database:
What you think should happen instead
I expected that even if the object is deserialized, the method that returns the lightweight representation of the saved object is called. The created representation is enough for DEBUG purposes and doesn't burden the database with bulky objects.
How to reproduce
Operating System
NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal
Versions of Apache Airflow Providers
In the production environment, we are using:
apache-airflow-providers-postgres>=4.0.0
apache-airflow-providers-apache-beam>=4.0.0
apache-airflow-providers-cncf-kubernetes>=4.1.0
apache-airflow-providers-datadog>=3.0.0
apache-airflow-providers-google>=8.0.0
apache-airflow-providers-http>=3.0.0
apache-airflow-providers-microsoft-mssql>=3.0.0
apache-airflow-providers-mongo>=3.0.0
apache-airflow-providers-mysql>=3.0.0
apache-airflow-providers-odbc>=3.0.0
apache-airflow-providers-sftp>=3.0.0
apache-airflow-providers-ssh>=3.0.0
apache-airflow-providers-airbyte>=3.0.0
apache-airflow-upgrade-check==1.4.0
Deployment
Other 3rd-party Helm chart
Deployment details
We manage a fork of the official Airflow chart which we customize for our use case. Deployment is done on a v1.21 Kubernetes cluster hosted on Google Kubernetes Engine (GKE)
Anything else
This occurrence is not a critical problem since using the Operators in the classic way already allows us to use the custom XCom without problems. However, TaskFlow presents a much more readable and friendly way of producing code. Since we want to democratize and facilitate the access and implementation of DAGs between different teams in the same Airflow instance, using TaskFlow will be of great use to us.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: