Skip to content

Commit

Permalink
Change default .airflowignore syntax to glob (apache#42436)
Browse files Browse the repository at this point in the history
Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>
  • Loading branch information
2 people authored and ellisms committed Nov 13, 2024
1 parent 3645539 commit 161786c
Show file tree
Hide file tree
Showing 9 changed files with 29 additions and 37 deletions.
2 changes: 1 addition & 1 deletion airflow/config_templates/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ core:
version_added: 2.3.0
type: string
example: ~
default: "regexp"
default: "glob"
default_task_retries:
description: |
The number of retries each task is going to have by default. Can be overridden at dag or task level.
Expand Down
6 changes: 3 additions & 3 deletions airflow/utils/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ def _find_path_from_directory(
def find_path_from_directory(
base_dir_path: str | os.PathLike[str],
ignore_file_name: str,
ignore_file_syntax: str = conf.get_mandatory_value("core", "DAG_IGNORE_FILE_SYNTAX", fallback="regexp"),
ignore_file_syntax: str = conf.get_mandatory_value("core", "DAG_IGNORE_FILE_SYNTAX", fallback="glob"),
) -> Generator[str, None, None]:
"""
Recursively search the base path for a list of file paths that should not be ignored.
Expand All @@ -232,9 +232,9 @@ def find_path_from_directory(
:return: a generator of file paths.
"""
if ignore_file_syntax == "glob":
if ignore_file_syntax == "glob" or not ignore_file_syntax:
return _find_path_from_directory(base_dir_path, ignore_file_name, _GlobIgnoreRule)
elif ignore_file_syntax == "regexp" or not ignore_file_syntax:
elif ignore_file_syntax == "regexp":
return _find_path_from_directory(base_dir_path, ignore_file_name, _RegexpIgnoreRule)
else:
raise ValueError(f"Unsupported ignore_file_syntax: {ignore_file_syntax}")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,14 +125,7 @@ for the paths that should be ignored. You do not need to have that file in any o
In the example above the DAGs are only in ``my_custom_dags`` folder, the ``common_package`` should not be
scanned by scheduler when searching for DAGS, so we should ignore ``common_package`` folder. You also
want to ignore the ``base_dag.py`` if you keep a base DAG there that ``my_dag1.py`` and ``my_dag2.py`` derives
from. Your ``.airflowignore`` should look then like this:

.. code-block:: none
my_company/common_package/.*
my_company/my_custom_dags/base_dag\.py
If ``DAG_IGNORE_FILE_SYNTAX`` is set to ``glob``, the equivalent ``.airflowignore`` file would be:
from. Your ``.airflowignore`` should look then like this (using the default ``glob`` syntax):

.. code-block:: none
Expand Down
31 changes: 12 additions & 19 deletions docs/apache-airflow/core-concepts/dags.rst
Original file line number Diff line number Diff line change
Expand Up @@ -712,19 +712,9 @@ configuration parameter (*added in Airflow 2.3*): ``regexp`` and ``glob``.

.. note::

The default ``DAG_IGNORE_FILE_SYNTAX`` is ``regexp`` to ensure backwards compatibility.
The default ``DAG_IGNORE_FILE_SYNTAX`` is ``glob`` in Airflow 3 or later (in previous versions it was ``regexp``).

For the ``regexp`` pattern syntax (the default), each line in ``.airflowignore``
specifies a regular expression pattern, and directories or files whose names (not DAG id)
match any of the patterns would be ignored (under the hood, ``Pattern.search()`` is used
to match the pattern). Use the ``#`` character to indicate a comment; all characters
on lines starting with ``#`` will be ignored.

As with most regexp matching in Airflow, the regexp engine is ``re2``, which explicitly
doesn't support many advanced features, please check its
`documentation <https://github.com/google/re2/wiki/Syntax>`_ for more information.

With the ``glob`` syntax, the patterns work just like those in a ``.gitignore`` file:
With the ``glob`` syntax (the default), the patterns work just like those in a ``.gitignore`` file:

* The ``*`` character will match any number of characters, except ``/``
* The ``?`` character will match any single character, except ``/``
Expand All @@ -738,15 +728,18 @@ With the ``glob`` syntax, the patterns work just like those in a ``.gitignore``
is relative to the directory level of the particular .airflowignore file itself. Otherwise the
pattern may also match at any level below the .airflowignore level.

The ``.airflowignore`` file should be put in your ``DAG_FOLDER``. For example, you can prepare
a ``.airflowignore`` file using the ``regexp`` syntax with content

.. code-block::
For the ``regexp`` pattern syntax, each line in ``.airflowignore``
specifies a regular expression pattern, and directories or files whose names (not DAG id)
match any of the patterns would be ignored (under the hood, ``Pattern.search()`` is used
to match the pattern). Use the ``#`` character to indicate a comment; all characters
on lines starting with ``#`` will be ignored.

project_a
tenant_[\d]
As with most regexp matching in Airflow, the regexp engine is ``re2``, which explicitly
doesn't support many advanced features, please check its
`documentation <https://github.com/google/re2/wiki/Syntax>`_ for more information.

Or, equivalently, in the ``glob`` syntax
The ``.airflowignore`` file should be put in your ``DAG_FOLDER``. For example, you can prepare
a ``.airflowignore`` file with the ``glob`` syntax

.. code-block::
Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/howto/dynamic-dag-generation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Then you can import and use the ``ALL_TASKS`` constant in all your DAGs like tha
...
Don't forget that in this case you need to add empty ``__init__.py`` file in the ``my_company_utils`` folder
and you should add the ``my_company_utils/.*`` line to ``.airflowignore`` file (if using the regexp ignore
and you should add the ``my_company_utils/*`` line to ``.airflowignore`` file (using the default glob
syntax), so that the whole folder is ignored by the scheduler when it looks for DAGs.


Expand Down
7 changes: 7 additions & 0 deletions newsfragments/42436.significant.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Default ``.airflowignore`` syntax changed to ``glob``

The default value to the configuration ``[core] dag_ignore_file_syntax`` has
been changed to ``glob``, which better matches the ignore file behavior of many
popular tools.

To revert to the previous behavior, set the configuration to ``regexp``.
5 changes: 2 additions & 3 deletions tests/dags/.airflowignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
.*_invalid.* # Skip invalid files
subdir3 # Skip the nested subdir3 directory
# *badrule # This rule is an invalid regex. It would be warned about and skipped.
*_invalid_* # Skip invalid files
subdir3 # Skip the nested subdir3 directory
2 changes: 1 addition & 1 deletion tests/dags/subdir1/.airflowignore
Original file line number Diff line number Diff line change
@@ -1 +1 @@
.*_ignore_this.py # Ignore files ending with "_ignore_this.py"
*_ignore_this.py # Ignore files ending with "_ignore_this.py"
2 changes: 1 addition & 1 deletion tests/plugins/test_plugin_ignore.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ def test_find_not_should_ignore_path_regexp(self, tmp_path):
"test_load_sub1.py",
}
ignore_list_file = ".airflowignore"
for file_path in find_path_from_directory(plugin_folder_path, ignore_list_file):
for file_path in find_path_from_directory(plugin_folder_path, ignore_list_file, "regexp"):
file_path = Path(file_path)
if file_path.is_file() and file_path.suffix == ".py":
detected_files.add(file_path.name)
Expand Down

0 comments on commit 161786c

Please sign in to comment.