Skip to content

AWS GlueCrawlerOperator deletes existing tags on run #32330

@luos-fc

Description

@luos-fc

Apache Airflow version

2.6.2

What happened

We are currently on AWS Provider 6.0.0 and looking to upgrade to the latest version 8.2.0. However, there are some issues with the GlueCrawlerOperator making the upgrade challenging, namely that the operator attempts to update the crawler tags on every run. Because we manage our resource tagging through Terraform, we do not provide any tags to the operator, which results in all of the tags being deleted (as well as needing additional glue:GetTags and glue:UntagResource permissions needing to be added to relevant IAM roles to even run the crawler).

It seems strange that the default behaviour of the operator has been changed to make modifications to infrastructure, especially as this differs from the GlueJobOperator, which only performs updates when certain parameters are set. Potentially something similar could be done here, where if no Tags key is present in the config dict they aren't modified at all. Not sure what the best approach is.

What you think should happen instead

The crawler should run without any alterations to the existing infrastructure

How to reproduce

Run a GlueCrawlerOperator without tags in config, against a crawler with tags present

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

Amazon 8.2.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions