Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove select_column option in TaskInstance.get_task_instance #38571

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Mar 27, 2024

Fundamentally what's going on here is we need a TaskInstance object instead of a Row object when sending over the wire in RPC call. But the full story on this one is actually somewhat complicated.
It was back in 2.2.0 in #25312 when we converted to query with the column attrs instead of the TI object (#28900 only refactored this logic into a function). The reason was to avoid locking the dag_run table since TI newly had a dag_run relationship attr. Now, this causes a problem with AIP-44 because the RPC api does not know how to serialize a Row object.
This PR switches back to querying a TaskInstance object, but avoids locking dag_run by using lazy_load option. Meanwhile, since try_number is a horrible attribute (which gives you a different answer depending on the state), we have to switch it back to look at the underlying private attr instead of the public accesor.

Older description:
This was originally added in #28900 presumably for compatiblity with serialization. Maybe things have changed since then, because it's actually the Row object that does not serialize properly (and this is what's returned with this option True) meanwhile the TaskInstance object actually is serialized properly. This is the only usage of this param, and it's not needed here, so I'm just removing it. You could argue that it's public and can't be removed but I think it's pretty safe.

@dstandish
Copy link
Contributor Author

I think there were still some tests failing @jedcunningham

@jedcunningham
Copy link
Member

I think there were still some tests failing @jedcunningham

The 1 I looked at didn't seem obviously related, in the 10 seconds I looked at it. So I figured a fresh run wouldn't hurt. But maybe it is related :)

@potiuk
Copy link
Member

potiuk commented Mar 29, 2024

Those test failure looked like abrupt failure of the docker engine in the middle of testing. I re-run just the tests and if it appears again then we have something interesting here.

@dstandish
Copy link
Contributor Author

Yeah there was some weird error related to a test of sensor. I don’t think it’s really an issue probably something with the test but I was still working through it

@dstandish
Copy link
Contributor Author

oh -- I thought you merged it @jedcunningham -- sorry i was confused 🙃

Fundamentally what's going on here is we need a TaskInstance object instead of a Row object when sending over the wire in RPC call.  But the full story on this one is actually somewhat complicated.
It was back in 2.2.0 in apache#25312 when we converted to query with the column attrs instead of the TI object (apache#28900 only refactored this logic into a function).  The reason was to avoid locking the dag_run table since TI newly had a dag_run relationship attr.  Now, this causes a problem with AIP-44 because the RPC api does not know how to serialize a Row object.
This PR switches back to querying a TaskInstance object, but avoids locking dag_run by using lazy_load option.  Meanwhile, since try_number is a horrible attribute (which gives you a different answer depending on the state), we have to switch it back to look at the underlying private attr instead of the public accesor.
@dstandish dstandish force-pushed the remove-select_column-option-in-taskinstance_get_task_instance branch from a7b15cb to c9585c5 Compare March 30, 2024 17:45
@dstandish
Copy link
Contributor Author

figured out what was going on here. it was a very confusing one. not obvious. basically, when switching it back to just query the TI, this had the effect of incrementing try_number every time ti.refresh_from_db is called (since try_number is bananas). This nonobviously caused failure in test of reschedule poke mode sensor because it could not find the right reschedule db obj so the start date was advancing when it shouldn't have. The reason was just try number shenanigans. But so then i found that that the real reason that we were query attrs directly (instead of orm obj) was to avoid locking dag_run! But we should be able to do that by simply lazy loading the attr..... So I added two changes (1) lazy load dag run attr in this func and (2) go back to setting try_number from private attr _try_number (which is what it was before the deadlock fix orig added). well that was a mounthful...

@uranusjr
Copy link
Member

uranusjr commented Apr 2, 2024

I hope we’ll be able to clean up the try_number bs when we implement AIP-64.

@dstandish dstandish merged commit 583fa2d into apache:main Apr 2, 2024
41 checks passed
@dstandish dstandish deleted the remove-select_column-option-in-taskinstance_get_task_instance branch April 2, 2024 16:16
@ephraimbuddy ephraimbuddy added the type:improvement Changelog: Improvements label Jun 3, 2024
@dstandish dstandish added this to the Airflow 2.10.0 milestone Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:improvement Changelog: Improvements
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants