Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support deleting the local log files when using remote logging #29772

Merged
merged 24 commits into from
Mar 7, 2023

Conversation

hussein-awala
Copy link
Member

closes: #16142


In this PR, I add a new parameter delete_local_copy for S3TaskHandler, GCSTaskHandler and OSSTaskHandler (WasbTaskHandler already has it), to support deleting the local copy of logs after uploading it to remote logging location.

To avoid adding unnecessary extra dependency between these providers and the Airflow core, I didn't create a new Airflow config as discussed in the issue, so users can still upgrade providers versions and use this feature without the need to upgrade the Airflow version.

@boring-cyborg boring-cyborg bot added area:logging area:providers provider:amazon-aws AWS/Amazon - related issues provider:microsoft-azure Azure-related issues provider:google Google (including GCP) related issues labels Feb 26, 2023
@eladkal
Copy link
Contributor

eladkal commented Feb 26, 2023

To avoid adding unnecessary extra dependency between these providers and the Airflow core, I didn't create a new Airflow config as discussed in the issue, so users can still upgrade providers versions and use this feature without the need to upgrade the Airflow version.

We should mention it in the docs (probably in remote logging part) otherwise users won't be aware they can set this

@potiuk
Copy link
Member

potiuk commented Feb 26, 2023

We should mention it in the docs (probably in remote logging part) otherwise users won't be aware they can set this

Yes. I agree right now it is not at all obvious how to do it (you really need to copy local settings and add new keyword parameters and somehow know that they are there (many of our users are kinda scared to look at the source code to find out how thing really are and expect documentation to tell them what they can do 🤷 ),

I would say it would be worth to actually add an option as well to airflow configuration (and update default local settins to read the configuration) - it will still allow to use the feature without upgrading airflow. But in the future it will be as easy as modifying the configuration.

Also you will be able to copy the local_settings from latest airflow version, add configuration (even if you do not upgrade) and new providers and - surprisingly - it will also work, because you do not have to have new airflow ot read new value from the config. I think for the future - compatibility, this could be even described in the docs how to approach it - copy the relevant part of settings, add configuration, and then when you migrate in the future, it will just work (TM).

WDYT?

@hussein-awala
Copy link
Member Author

We should mention it in the docs (probably in remote logging part) otherwise users won't be aware they can set this

I agree, I'll add this to the doc

@hussein-awala
Copy link
Member Author

hussein-awala commented Feb 26, 2023

I would say it would be worth to actually add an option as well to airflow configuration (and update default local settins to read the configuration) - it will still allow to use the feature without upgrading airflow. But in the future it will be as easy as modifying the configuration.
Also you will be able to copy the local_settings from latest airflow version, add configuration (even if you do not upgrade) and new providers and - surprisingly - it will also work, because you do not have to have new airflow ot read new value from the config. I think for the future - compatibility, this could be even described in the docs how to approach it - copy the relevant part of settings, add configuration, and then when you migrate in the future, it will just work (TM).

Yes, I thought about this, and I had two different options to add the config:

  • Add the config and use it as default value for delete_local_copy -> the users of Airflow < 2.6.0 should add the config manually because to avoid AirflowConfigException
  • Add the config and use it in a try except with a second default value False defined in each provider -> All the users can configure the RemoteTaskHandler using Airflow config or keyword config and it is fully bc

@eladkal
Copy link
Contributor

eladkal commented Feb 26, 2023

I think it's OK if this config will be Airflow>=2.6.0 only.
This is a nice to have feature, not something we should invest time in backporting as no one really ever complained about this.

As for providers, we can check for Airflow version in providers... for example:

if Version(version) < Version("2.4"):

@hussein-awala
Copy link
Member Author

I just realized that there is not conf to pass kwargs to TaskHandler class, and that the delete_local_copy of wasb handler is fixed to False, so the config is necessary to activate the option.

I'm wondering if we add a new conf remote_task_handler_kwargs to let old Airflow version users use the future version without the need to upgrade it.

Yes, I thought about this, and I had two different options to add the config:

  • Add the config and use it as default value for delete_local_copy -> the users of Airflow < 2.6.0 should add the config manually because to avoid AirflowConfigException
  • Add the config and use it in a try except with a second default value False defined in each provider -> All the users can configure the RemoteTaskHandler using Airflow config or keyword config and it is fully bc

My initial idea was based on this conf, but now the only solution to override the conf delete_local_copy for Airflow < 2.6.0 users is extending the TaskHandler class and override the value of this conf.

@hussein-awala hussein-awala marked this pull request as draft February 26, 2023 13:42
@o-nikolas
Copy link
Contributor

Looks like there are still some failing tests @hussein-awala

@hussein-awala hussein-awala marked this pull request as ready for review March 3, 2023 20:07
@hussein-awala hussein-awala requested a review from potiuk as a code owner March 3, 2023 20:07
@eladkal eladkal added this to the Airflow 2.6.0 milestone Mar 3, 2023
@hussein-awala
Copy link
Member Author

@o-nikolas I fixed the failing tests, can you take a look?

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never loved how much copy/paste is involved with changes to the remote loggers. We should really create a nice base class/interface for that stuff. But that shouldn't block this useful contribution. Thanks for the work!
(left just a couple nits on wording)

airflow/config_templates/config.yml Outdated Show resolved Hide resolved
airflow/config_templates/default_airflow.cfg Outdated Show resolved Hide resolved
Co-authored-by: Niko Oliveira <onikolas@amazon.com>
@hussein-awala
Copy link
Member Author

We should really create a nice base class/interface for that stuff

I think it's difficult with the decoupling policy we have for providers.

@o-nikolas
Copy link
Contributor

We should really create a nice base class/interface for that stuff

I think it's difficult with the decoupling policy we have for providers.

Ehh, ya, but I think it's possible. We've done this with executors and they will soon live in entirely different provider packages. But either way, not a blocker for this PR

@eladkal
Copy link
Contributor

eladkal commented Mar 7, 2023

Ehh, ya, but I think it's possible. We've done this with executors and they will soon live in entirely different provider packages. But either way, not a blocker for this PR

Can you open a followup issue and describe the task?

@o-nikolas o-nikolas merged commit b6392ae into apache:main Mar 7, 2023
@o-nikolas
Copy link
Contributor

Ehh, ya, but I think it's possible. We've done this with executors and they will soon live in entirely different provider packages. But either way, not a blocker for this PR

Can you open a followup issue and describe the task?

Sure, will do 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging area:providers provider:amazon-aws AWS/Amazon - related issues provider:google Google (including GCP) related issues provider:microsoft-azure Azure-related issues type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete local logs when choosing remote logging
5 participants