Skip to content

Comments

Migrate ADLSListOperator from ADLS Gen1 to Gen2#61188

Merged
shahar1 merged 7 commits intoapache:mainfrom
nailo2c:bug-44228-ADLSListOperator_use_deprecated_domain
Feb 10, 2026
Merged

Migrate ADLSListOperator from ADLS Gen1 to Gen2#61188
shahar1 merged 7 commits intoapache:mainfrom
nailo2c:bug-44228-ADLSListOperator_use_deprecated_domain

Conversation

@nailo2c
Copy link
Contributor

@nailo2c nailo2c commented Jan 28, 2026

closes: #44228

Why

The older ADLSListOperator uses AzureDataLakeHook, which uses Gen 1 SDK is already retired.

self._conn = core.AzureDLFileSystem(credential, store_name=self.account_name)

How

Replace it with AzureDataLakeStorageV2Hook, which uses Gen 2 SDK.

Given Gen1 is retired, the impact should be limited, but this is a breaking change.

What

I created an object (blob) in an Azure Storage account.

issue-44228-azure

And I used this DAG to test whether I could fetch it.

from datetime import datetime

from airflow import DAG
from airflow.providers.microsoft.azure.operators.adls import ADLSListOperator
from airflow.providers.standard.operators.python import PythonOperator

with DAG(
    dag_id="test_adls_issue_44228_fixed",
    start_date=datetime(2026, 1, 1),
    schedule=None,
    catchup=False,
) as dag:

    list_files = ADLSListOperator(
        task_id="list_adls_files",
        file_system_name="testcontainer",
        path="",
        azure_data_lake_conn_id="adls_gen2_default",
    )

    def print_files(ti):
        files = ti.xcom_pull(task_ids="list_adls_files")
        print("=" * 50)
        print(f"Files found: {files}")
        print(f"Total count: {len(files) if files else 0}")
        print("=" * 50)

    print_result = PythonOperator(
        task_id="print_result",
        python_callable=print_files,
    )

    list_files >> print_result

It works pretty well.
issue-44228-airflow-dag

Discussion

It seems AzureDataLakeHook uses the Gen 1 SDK. Perhaps we need to add @deprecated(...) to it?


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    Claude Opus 4.5

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@nailo2c
Copy link
Contributor Author

nailo2c commented Jan 29, 2026

Hmmm… it looks like ADLSToGCSOperator inherits from ADLSListOperator. Let me take a look.

@nailo2c nailo2c requested a review from shahar1 as a code owner January 29, 2026 21:45
@nailo2c
Copy link
Contributor Author

nailo2c commented Jan 29, 2026

Okay, fixed! I added file_system_name to ADLSToGCSOperator and updated the related tests. If there's anything else I can improve, please don't hesitate to let me know :)

Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion

It seems AzureDataLakeHook uses the Gen 1 SDK. Perhaps we need to add @deprecated(...) to it?

If Gen 1 SDK is already retired - there's no point in retrospectively deprecating it, as it doesn't serve a practical purpose (in general, it's better to take care of a proper deprecation much before retiring - but someone needs to track these changes on time 🙂).
However, as file_system_name is now mandatory - maybe instead of unexplained TypeError when missing, we should reflect its "sudden" introduction better to the user (use case - people who had used this operator in the past and don't understand why it is now required):

    def __init__(
        self,
        *,
        src_adls: str,
        dest_gcs: str,
        azure_data_lake_conn_id: str,
        gcp_conn_id: str = "google_cloud_default",
        replace: bool = False,
        gzip: bool = False,
        google_impersonation_chain: str | Sequence[str] | None = None,
        **kwargs,
    ) -> None:
        file_system_name = kwargs.get('file_system_name')
        if not file_system_name:
            raise TypeError(
                "The 'file_system_name' parameter is required. "
                "ADLSListOperator has been migrated from Azure Data Lake Storage Gen1 (retired) "
                "to Gen2, which requires specifying a file system name. "
                "Please add file_system_name='your-container-name' to your operator instantiation."
            )

WDYT?

CC: @VladaZakharova @MaksYermak (for the GCP transfer operator)

@shahar1 shahar1 added the provider:google Google (including GCP) related issues label Jan 30, 2026
@nailo2c
Copy link
Contributor Author

nailo2c commented Jan 31, 2026

Totally agree, that's much clearer. Let me add it!

Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, at least in the GCP aspects :)
@dabla could you please review the changes in the Azure provider?

@shahar1 shahar1 self-requested a review January 31, 2026 09:32
@shahar1
Copy link
Contributor

shahar1 commented Jan 31, 2026

LGTM, at least in the GCP aspects :) @dabla could you please review the changes in the Azure provider?

@nailo2c Almost missed this one, my bad - we need to add # use next version to the Azure provider in Google provider's dependencies, otherwise it might break:

https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDERS.md#update-versions-of-dependent-providers-to-the-next-version

@nailo2c
Copy link
Contributor Author

nailo2c commented Jan 31, 2026

Wow, I didn't know that, learned something new today! Thanks 😄

@shahar1
Copy link
Contributor

shahar1 commented Feb 7, 2026

@VladaZakharova @MaksYermak
As both CI and system tests passed, if no objections are made, this PR will be merged by the upcoming release on Tuesday.

@shahar1 shahar1 changed the title Migrate ADLSListOperator from ADLS Gen1 to Gen2 (#44228) Migrate ADLSListOperator from ADLS Gen1 to Gen2 Feb 10, 2026
@shahar1 shahar1 merged commit c56b84c into apache:main Feb 10, 2026
102 checks passed
shahar1 added a commit to shahar1/airflow that referenced this pull request Feb 10, 2026
Ratasa143 pushed a commit to Ratasa143/airflow that referenced this pull request Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues provider:microsoft-azure Azure-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Azure Data Lake connection will not work for blob.core.windows.net domain

4 participants